×
  • Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
 

Netflix running Presto in the AWS Cloud

on

  • 599 views

 

Statistics

Views

Total Views
599
Views on SlideShare
595
Embed Views
4

Actions

Likes
3
Downloads
9
Comments
0

1 Embed 4

https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via SlideShare as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
Post Comment
Edit your comment

    Netflix running Presto in the AWS Cloud Netflix running Presto in the AWS Cloud Presentation Transcript

    • Netflix running Presto in the AWS Cloud Zhenxiao Luo Senior Software Engineer @ Netflix
    • Outline ● BigDataPlatform@Netflix ● Use cases & requirements ● What we did ○ Reading/Writing from/to Amazon S3 ○ Operations ○ Deployment ○ Performance ● What’s next?
    • BigDataPlatform @ Netflix
    • Use Cases ● Big Batch Jobs ○ high throughput, fault tolerant, ETL ○ data spills to disk ○ Hive on Tez, Pig on Tez ● Adhoc Queries ○ low latency, interactive, data exploration ○ in-memory, but limited data size ○ Impala, Redshift, Spark, Presto
    • Netflix Requirement ● SQL like Language ● Low latency for adhoc queries ● Work well on AWS cloud ● Good integration with Hadoop stack ● Scale to 1000+ node cluster ● Open source with community support
    • What did Netflix do?
    • Reading/Writing to/from S3 ● Option 1: Apache Hadoop NativeS3FileSysyem ● Option 2: PrestoS3FileSystem ○ retry logic for read timeout ○ write directly to final S3 path ● Option 3: emrFileSystem ○ disable hadoop logging ○ disable hadoop FileSystem cache
    • Bug Fixes ● https://github. com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4 6e ● https://github. com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef b86 ● https://github.com/facebook/presto/pull/1147 ● https://github.com/facebook/presto/pull/1300 ● https://github.com/facebook/presto/issues/1285 ● https://github.com/facebook/presto/issues/1264
    • Our Operations Environment ● Launch script on top of EMR ● Ganglia integration ● Usage graphs - concurrent queries & tasks
    • Current Deployment ● Presto in Production @ Netflix ● 100+ nodes Presto Cluster ● 1000+ queries running per day ● Presto query against the same Petabyte Scale S3 Data Warehouse as Hive and Pig
    • Observed Performance @ Netflix ● Data in Sequence File Format ● One MapReduce Job SmallTableScan ○ MapReduce overhead dominates the query execution time ○ Presto is always ~10X faster than Hive ● One MapReduce Job BigTableScan ○ MapReduce overhead is marginal compared with big table scan time ○ Presto performs similar to Hive ● Multiple MapReduce Aggregation ○ Presto is always > 10X faster than Hive ● Joins ○ Presto is always > 2X faster than Hive
    • What we are working on ● Support Parquet File Format ○ https://github.com/facebook/presto/pull/1147 ○ Parquet performs similar to Sequence, but not as fast as RCFile ● ODBC/JDBC driver for Presto ○ Support Microstrategy running on Presto
    • Some inconveniences ... ● Support Server Side “Use Schema” ○ Workaround: Client Side “Use Schema” Or “Schema.Table” ● Recurse the partition directory ○ Different behavior with Hive ● Metadata caching ○ have to rerun the query a number of times to see the metadata change ● Extend JSON extract functions to allow . notation ○ json_extract_scalar(mapColumn, '$.namePart1.namePart2') ○ Workaround: regexp_extract ● WebUI running slow ○ load query task info on demand
    • Features we would like ● Big table join ● User Defined Functions ● Break down one column value into several tuples ○ In Hive: lateral view explode json_tuple ● Decimal type ● Scheduler ● Writes ○ Insert overwrite ○ Alter table add partition ○ Parallel writes from workers (not client only)
    • Q & A Thank you!