How to create Treasure Data #dotsbigdata

1. Masahiro Nakagawa August 1, 2015 BigData All Stars 2015 How to create Treasure Data #dotsbigdata

2. Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of Presto Source Code Reading / meetup > etc…

3. Company overview http://www.treasuredata.com/opensource 21 65

4. Treasure Data Solution Ingest Analyze Distribute 74 and

5. Treasure Data Service > A simpliﬁed cloud analytics infrastructure > Customers focus on their business > SQL interfaces for Schema-less data sources > Fit for Data Hub / Lake > Batch / Low latency / Machine Learning > Lots of ingestion and integrated solutions > Fluentd / Embulk / Data Connector / SDKs > Result Output / Prestogres Gateway / BI tools > Awesome support for time to value

6. 21 65

7. Plazma - TD’s distributed analytical database

8. Plazma by the numbers > Streaming import > 45 billion records / day > Bulk Import > 10 billion records / day > Hive Query > 3+ trillion records / day > Machine Learning queries, Hivemall, increased > Presto Query > 3+ trillion records / day

9. TD’s resource management > Guarantee and boost compute resources > Guarantee for stabilizing query performance > Boost for sharing free resources > Get multi-tenant merit > Global resource schedular > manage job, resource and priority across users > Separate storage from compute resource > Easy to scale workers > We can use S3 / GCS / Azure Storage for reliable backend

10. Data Importing

11. Import Queue td-agent / ﬂuentd Import Worker ✓ Buffering for  5 minute ✓ Retrying  (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} … MySQL   (PerfectQueue)

12. Import Queue td-agent / ﬂuentd Import Worker ✓ Buffering for  1 minute ✓ Retrying  (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL   (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38

13. Import Queue td-agent / ﬂuentd Import Worker ✓ Buffering for  5 minute ✓ Retrying  (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL   (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)

14. Import Queue Import Worker Import Worker Import Worker ✓ HA ✓ Load balancing

15. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker Archive Storage

16. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker uploaded time ﬁle index range records 2015-03-08 10:47 [2015-12-01 10:47:11,  2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,  2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,  2015-12-01 11:40:49] 14 … … … … Archive Storage Metadata of the records in a ﬁle (stored on PostgreSQL)

17. Amazon S3 / Basho Riak CS Metadata Merge Worker  (MapReduce) uploaded time ﬁle index range records 2015-03-08 10:47 [2015-12-01 10:47:11,  2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,  2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,  2015-12-01 11:40:49] 14 … … … … ﬁle index range records [2015-12-01 10:00:00,  2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,  2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)

18. Amazon S3 / Basho Riak CS Metadata uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,  2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,  2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,  2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,  2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,  2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL GiST (R-tree) Index on“time” column on the files Read from Archive Storage if merged. Otherwise, from Realtime Storage

19. Data Importing > Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage > Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the ﬁles every 1 hour. > Metadata > Index is built on PostgreSQL using RANGE type and  GiST index

20. Data processing

21. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL path index range records [2015-12-01 10:00:00,  2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,  2015-12-01 12:00:00] 2,143 … … … MessagePack Columnar  File Format

22. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,  2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,  2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL

23. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,  2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,  2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00  GROUP BY code

24. Handling Eventual Consistency 1. Writing data / metadata first > At this time, data is not visible 2. Check data is available or not > GET, GET, GET… 3. Data become visible > Query includes imported data!  Ex. Netflix case > https://github.com/Netflix/s3mper

25. Hide network cost > Open a lot of connections to Object Storage > Using range feature with columnar offset > Improve scan performance for partitioned data > Detect recoverable error > We have error lists for fault tolerance > Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to OS and re-read data

26. buffer Optimizing Scan Performance •  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck 8 TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read•  decompression •  msgpack-java v07 S3 read S3 read S3 read Optimize scan performance

27. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufﬁcient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns

28. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufﬁcient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns

29. Presto retry on Internal Errors > Query succeed eventually                log scale

30. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … …

31. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … … MessagePack Columnar  File Format is schema-less ✓ Instant schema change SQL is schema-full ✓ SQL doesn’t work  without schema Schema-on-Read

32. Realtime Storage Query Engine  Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} Schema-on-Read Schema-full Schema-less

33. Realtime Storage Query Engine  Hive, Pig, Presto Archive Storage Schema-full Schema-less Schema {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (  user INT, name STRING, value INT, host INT ); | user | 54 | value | 120 | host | NULL | | Schema-on-Read | name | “plazma”

34. Realtime Storage Query Engine  Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (  user INT, name STRING, value INT, host INT ); | user | 54 | name | “plazma” | value | 120 | host | NULL | | Schema-on-Read Schema-full Schema-less Schema

35. Streaming logging layer Reliable forwarding Pluggable architecture http://ﬂuentd.org/

36. Bulk loading Parallel processing Pluggable architecture http://embulk.org/

37. Hadoop > Distributed computing framework > Consist of many components…              http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

38. Presto > > Open sourced by Facebook > https://github.com/facebook/presto      A distributed SQL query engine  for interactive data analisys  against GBs to PBs of data.

39. Conclusion > Build scalable data analytics platform on Cloud > Separate resource and storage > loosely-coupled components > We have lots of useful OSS and services :) > There are many trade-off > Use existing component or create new component? > Stick to the basics! > If you tired, please use Treasure Data ;)

40. https://jobs.lever.co/treasure-data Cloud service for the entire data pipeline.

How to create Treasure Data #dotsbigdata

N Masahiro

Transcript of "How to create Treasure Data #dotsbigdata"

A particular slide catching your eye?