0
Masahiro Nakagawa
August 1, 2015
BigData All Stars 2015
How to create
Treasure Data
#dotsbigdata
Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd ...
Company overview
http://www.treasuredata.com/opensource
21 65
Treasure Data Solution
Ingest Analyze Distribute
74
and
Treasure Data Service
> A simplified cloud analytics infrastructure
> Customers focus on their business
> SQL interfaces fo...
21 65
Plazma - TD’s distributed analytical database
Plazma by the numbers
> Streaming import
> 45 billion records / day
> Bulk Import
> 10 billion records / day
> Hive Query
...
TD’s resource management
> Guarantee and boost compute resources
> Guarantee for stabilizing query performance
> Boost for...
Data Importing
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

5 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on ...
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

1 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on ...
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

5 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on ...
Import
Queue
Import
Worker
Import
Worker
Import
Worker
✓ HA
✓ Load balancing
Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
Arch...
Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
uplo...
Amazon S3 /
Basho Riak CS
Metadata
Merge Worker

(MapReduce)
uploaded time file index range records
2015-03-08 10:47
[2015-...
Amazon S3 /
Basho Riak CS
Metadata
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,

2015-12-0...
Data Importing
> Scalable & Reliable importing
> Fluentd buffers data on a disk
> Import queue deduplicates uploaded chunk...
Data processing
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
Handling Eventual Consistency
1. Writing data / metadata first
> At this time, data is not visible
2. Check data is availab...
Hide network cost
> Open a lot of connections to Object Storage
> Using range feature with columnar offset
> Improve scan ...
buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
8
Ta...
Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory...
Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory...
Presto retry on Internal Errors
> Query succeed eventually















log scale
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local...
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
Schema-full
Schema-less
Schema
{“user”:54, “name”:”plazma...
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local...
Streaming logging layer
Reliable forwarding
Pluggable architecture
http://fluentd.org/
Bulk loading
Parallel processing
Pluggable architecture
http://embulk.org/
Hadoop
> Distributed computing framework
> Consist of many components…













http://hortonworks.com/hadoop-tutorial...
Presto
>
> Open sourced by Facebook
> https://github.com/facebook/presto





A distributed SQL query engine

for interact...
Conclusion
> Build scalable data analytics platform on Cloud
> Separate resource and storage
> loosely-coupled components
...
https://jobs.lever.co/treasure-data
Cloud service for the entire data pipeline.
Upcoming SlideShare
Loading in...5
×

How to create Treasure Data #dotsbigdata

1,237

Published on

http://eventdots.jp/event/562221

Published in: Engineering

Transcript of "How to create Treasure Data #dotsbigdata"

  1. 1. Masahiro Nakagawa August 1, 2015 BigData All Stars 2015 How to create Treasure Data #dotsbigdata
  2. 2. Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of Presto Source Code Reading / meetup > etc…
  3. 3. Company overview http://www.treasuredata.com/opensource 21 65
  4. 4. Treasure Data Solution Ingest Analyze Distribute 74 and
  5. 5. Treasure Data Service > A simplified cloud analytics infrastructure > Customers focus on their business > SQL interfaces for Schema-less data sources > Fit for Data Hub / Lake > Batch / Low latency / Machine Learning > Lots of ingestion and integrated solutions > Fluentd / Embulk / Data Connector / SDKs > Result Output / Prestogres Gateway / BI tools > Awesome support for time to value
  6. 6. 21 65
  7. 7. Plazma - TD’s distributed analytical database
  8. 8. Plazma by the numbers > Streaming import > 45 billion records / day > Bulk Import > 10 billion records / day > Hive Query > 3+ trillion records / day > Machine Learning queries, Hivemall, increased > Presto Query > 3+ trillion records / day
  9. 9. TD’s resource management > Guarantee and boost compute resources > Guarantee for stabilizing query performance > Boost for sharing free resources > Get multi-tenant merit > Global resource schedular > manage job, resource and priority across users > Separate storage from compute resource > Easy to scale workers > We can use S3 / GCS / Azure Storage for reliable backend
  10. 10. Data Importing
  11. 11. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 5 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} … MySQL 
 (PerfectQueue)
  12. 12. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 1 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38
  13. 13. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 5 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)
  14. 14. Import Queue Import Worker Import Worker Import Worker ✓ HA ✓ Load balancing
  15. 15. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker Archive Storage
  16. 16. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … Archive Storage Metadata of the records in a file (stored on PostgreSQL)
  17. 17. Amazon S3 / Basho Riak CS Metadata Merge Worker
 (MapReduce) uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)
  18. 18. Amazon S3 / Basho Riak CS Metadata uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL GiST (R-tree) Index on“time” column on the files Read from Archive Storage if merged. Otherwise, from Realtime Storage
  19. 19. Data Importing > Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage > Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour. > Metadata > Index is built on PostgreSQL using RANGE type and
 GiST index
  20. 20. Data processing
  21. 21. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … MessagePack Columnar
 File Format
  22. 22. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL
  23. 23. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00
 GROUP BY code
  24. 24. Handling Eventual Consistency 1. Writing data / metadata first > At this time, data is not visible 2. Check data is available or not > GET, GET, GET… 3. Data become visible > Query includes imported data!
 Ex. Netflix case > https://github.com/Netflix/s3mper
  25. 25. Hide network cost > Open a lot of connections to Object Storage > Using range feature with columnar offset > Improve scan performance for partitioned data > Detect recoverable error > We have error lists for fault tolerance > Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to OS and re-read data
  26. 26. buffer Optimizing Scan Performance •  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck 8 TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read•  decompression •  msgpack-java v07 S3 read S3 read S3 read Optimize scan performance
  27. 27. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufficient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns
  28. 28. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufficient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns
  29. 29. Presto retry on Internal Errors > Query succeed eventually
 
 
 
 
 
 
 
 log scale
  30. 30. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … …
  31. 31. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … … MessagePack Columnar
 File Format is schema-less ✓ Instant schema change SQL is schema-full ✓ SQL doesn’t work
 without schema Schema-on-Read
  32. 32. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} Schema-on-Read Schema-full Schema-less
  33. 33. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage Schema-full Schema-less Schema {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | value | 120 | host | NULL | | Schema-on-Read | name | “plazma”
  34. 34. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | name | “plazma” | value | 120 | host | NULL | | Schema-on-Read Schema-full Schema-less Schema
  35. 35. Streaming logging layer Reliable forwarding Pluggable architecture http://fluentd.org/
  36. 36. Bulk loading Parallel processing Pluggable architecture http://embulk.org/
  37. 37. Hadoop > Distributed computing framework > Consist of many components…
 
 
 
 
 
 
 http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
  38. 38. Presto > > Open sourced by Facebook > https://github.com/facebook/presto
 
 
 A distributed SQL query engine
 for interactive data analisys
 against GBs to PBs of data.
  39. 39. Conclusion > Build scalable data analytics platform on Cloud > Separate resource and storage > loosely-coupled components > We have lots of useful OSS and services :) > There are many trade-off > Use existing component or create new component? > Stick to the basics! > If you tired, please use Treasure Data ;)
  40. 40. https://jobs.lever.co/treasure-data Cloud service for the entire data pipeline.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×