Netflix running Presto in the AWS Cloud

Netflix running Presto in the AWS Cloud Zhenxiao Luo Senior Software Engineer @ Netflix

Outline ● BigDataPlatform@Netflix ● Use cases & requirements ● What we did ○ Reading/Writing from/to Amazon S3 ○ Operations ○ Deployment ○ Performance ● What’s next?

BigDataPlatform @ Netflix

Use Cases ● Big Batch Jobs ○ high throughput, fault tolerant, ETL ○ data spills to disk ○ Hive on Tez, Pig on Tez ● Adhoc Queries ○ low latency, interactive, data exploration ○ in-memory, but limited data size ○ Impala, Redshift, Spark, Presto

Netflix Requirement ● SQL like Language ● Low latency for adhoc queries ● Work well on AWS cloud ● Good integration with Hadoop stack ● Scale to 1000+ node cluster ● Open source with community support

What did Netflix do?

Reading/Writing to/from S3 ● Option 1: Apache Hadoop NativeS3FileSysyem ● Option 2: PrestoS3FileSystem ○ retry logic for read timeout ○ write directly to final S3 path ● Option 3: emrFileSystem ○ disable hadoop logging ○ disable hadoop FileSystem cache

Bug Fixes ● https://github. com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4 6e ● https://github. com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef b86 ● https://github.com/facebook/presto/pull/1147 ● https://github.com/facebook/presto/pull/1300 ● https://github.com/facebook/presto/issues/1285 ● https://github.com/facebook/presto/issues/1264

Our Operations Environment ● Launch script on top of EMR ● Ganglia integration ● Usage graphs - concurrent queries & tasks

Current Deployment ● Presto in Production @ Netflix ● 100+ nodes Presto Cluster ● 1000+ queries running per day ● Presto query against the same Petabyte Scale S3 Data Warehouse as Hive and Pig

Observed Performance @ Netflix ● Data in Sequence File Format ● One MapReduce Job SmallTableScan ○ MapReduce overhead dominates the query execution time ○ Presto is always ~10X faster than Hive ● One MapReduce Job BigTableScan ○ MapReduce overhead is marginal compared with big table scan time ○ Presto performs similar to Hive ● Multiple MapReduce Aggregation ○ Presto is always > 10X faster than Hive ● Joins ○ Presto is always > 2X faster than Hive

What we are working on ● Support Parquet File Format ○ https://github.com/facebook/presto/pull/1147 ○ Parquet performs similar to Sequence, but not as fast as RCFile ● ODBC/JDBC driver for Presto ○ Support Microstrategy running on Presto

Some inconveniences ... ● Support Server Side “Use Schema” ○ Workaround: Client Side “Use Schema” Or “Schema.Table” ● Recurse the partition directory ○ Different behavior with Hive ● Metadata caching ○ have to rerun the query a number of times to see the metadata change ● Extend JSON extract functions to allow . notation ○ json_extract_scalar(mapColumn, '$.namePart1.namePart2') ○ Workaround: regexp_extract ● WebUI running slow ○ load query task info on demand

Features we would like ● Big table join ● User Defined Functions ● Break down one column value into several tuples ○ In Hive: lateral view explode json_tuple ● Decimal type ● Scheduler ● Writes ○ Insert overwrite ○ Alter table add partition ○ Parallel writes from workers (not client only)

Q & A Thank you!

Netflix running Presto in the AWS Cloud

by Zhenxiao Luo, Software Engineer at Cloudera

on May 15, 2014

Statistics

Views

Actions

1 Embed 4

Accessibility

Categories

Upload Details

Usage Rights

Report content

Netflix running Presto in the AWS Cloud Presentation Transcript