Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

201809 DB tech showcase

178 views

Published on

https://www.db-tech-showcase.com/dbts/tokyo on A24

Published in: Engineering
  • Be the first to comment

201809 DB tech showcase

  1. 1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Keisuke Suzuki Software engineer PlazmaDB ペタバイトオーダのデータ分析 基盤を支える分散ストレージの アーキテクチャとその運用
  2. 2. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Who am I? Keisuke Suzuki • Backend Engineer @ Treasure Data KK – Ex. Fujitsu • DB / Distributed system / Performance optimization • Twitter: @yajilobee
  3. 3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Treasure Data & PlazmaDB
  4. 4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Arm Treasure Data eCDP
  5. 5. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. PlazmaDB Streaming data Bulk load PlazmaDB Metadata (PostgreSQL) AWS S3 / Riak CS
  6. 6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Daily Workload & Storage Size Import Query Storage size 500 Billion Records / day ~ 5.8 Million Records / sec 5 PB (+5~10 TB / day) 55 Trillion Records 600,000 Queries / day 15 Trillion Records / day
  7. 7. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. PlazmaDB Features - Columnar format - Partitioned by time and optionally user defined column - Schema less - Partition index - Partition optimization - Merge partitions - Realtime Storage & Archive Storage - Transaction - Read committed isolation
  8. 8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. PlazmaDB Features - Columnar format - Partitioned by time and optionally user defined column - Schema less - Partition index - Partition optimization - Merge partitions - Realtime Storage & Archive Storage - Transaction - Read committed isolation
  9. 9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Log data time orderid user region price ... 2018-01-01 10:00:00 1 1 ‘A’ 10000 2018-01-01 10:03:03 2 7 ‘C’ 40000 2018-01-01 10:23:03 3 6 ‘B’ 3000 2018-01-01 10:23:12 4 3 ‘A’ 5500 2018-01-01 11:04:44 5 1 ‘A’ 20000 2018-01-01 11:30:00 6 8 ‘C’ 3000 ... Many columns (attributes) Accumulate over time
  10. 10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Analytical Query SELECT region, SUM(price) FROM orders WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ GROUP BY region Few part of columns Filter by time window time orderid user region price ... 2018-01-01 10:00:00 1 1 ‘A’ 10000 2018-01-01 10:03:03 2 7 ‘C’ 40000 2018-01-01 10:23:03 3 6 ‘B’ 3000 2018-01-01 10:23:12 4 3 ‘A’ 5500 2018-01-01 11:04:44 5 1 ‘A’ 20000 2018-01-01 11:30:00 6 8 ‘C’ 3000 ...
  11. 11. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. time orderid user region price ... 2018-01-01 10:00:00 1 1 ‘A’ 10000 2018-01-01 10:03:03 2 7 ‘C’ 40000 2018-01-01 10:23:03 3 6 ‘B’ 3000 2018-01-01 10:23:12 4 3 ‘A’ 5500 2018-01-01 11:04:44 5 1 ‘A’ 20000 2018-01-01 11:30:00 6 8 ‘C’ 3000 ... Inefficiency of Row Based Format SELECT region, SUM(price) FROM orders WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ GROUP BY region Scan direction Scanned data
  12. 12. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. time orderid user region price ... 2018-01-01 10:00:00 1 1 ‘A’ 10000 2018-01-01 10:03:03 2 7 ‘C’ 40000 2018-01-01 10:23:03 3 6 ‘B’ 3000 2018-01-01 10:23:12 4 3 ‘A’ 5500 2018-01-01 11:04:44 5 1 ‘A’ 20000 2018-01-01 11:30:00 6 8 ‘C’ 3000 ... Columnar Format SELECT region, SUM(price) FROM orders WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ GROUP BY region Scan direction Scanned data Few part of columns Filter by time window
  13. 13. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. 2018-01-01 10:23:03 3 6 ... 2018-01-01 10:23:12 4 3 ... PlazmaDB PlazmaDB Partitions 2018-01-01 10:00:00 1 1 ... 2018-01-01 10:03:03 2 7 ... 2018-01-01 11:04:44 5 1 ... {“time”: “2018-01-01 10:00:00”, “orderid”: 1, …}, {“time”: “2018-01-01 10:03:03”, “orderid”: 2, …} Worker{“time”: “2018-01-01 10:23:03”, “orderid”: 3, …}, {“time”: “2018-01-01 10:23:12”, “orderid”: 4, …} {“time”: “2018-01-01 11:04:44”, “orderid”: 5, …} Application Partition file: A S3/RiakCS Object Send logs periodically Convert to columnar & Store a partition Table is collection of partitions
  14. 14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. 2018-01-01 10:23:03 3 6 ... 2018-01-01 10:23:12 4 3 ... Meta DB (PostgreSQL) PlazmaDB Metadata 2018-01-01 10:00:00 1 1 ... 2018-01-01 10:03:03 2 7 ... 2018-01-01 11:04:44 5 1 ... data_set_id path ... 1 1 1 2 2018-01-01 10:00:00 1 1 ... 2018-01-01 10:03:03 2 7 ... PlazmaDB is Multi tenant data_set_id: ID combination of User, Database, Table Data set 1 Data set 2 AWS S3 / RiakCS
  15. 15. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. 2018-01-01 10:23:03 3 6 ... 2018-01-01 10:23:12 4 3 ... Meta DB (PostgreSQL) Partition Index 2018-01-01 10:00:00 1 1 ... 2018-01-01 10:03:03 2 7 ... 2018-01-01 11:04:44 6 1 ... data_set_id time_range path ... 1 [2018-01-01 10:00:00, 2018-01-01 10:03:03] 1 [2018-01-01 10:23:03, 2018-01-01 10:23:12] 1 [2018-01-01 11:04:44, 2018-01-01 10:04:44] 2 2018-01-01 10:00:00 1 1 ... 2018-01-01 10:03:03 2 7 ... Data set 1 Data set 2 AWS S3 / RiakCS
  16. 16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Partition Lookup SELECT region, SUM(price) FROM orders -- assume this is data set 1 WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ GROUP BY region Meta DB (PostgreSQL) data_set_id time_range path ... 1 [2018-01-01 10:00:00, 2018-01-01 10:03:03] 1 [2018-01-01 10:23:03, 2018-01-01 10:23:12] 1 [2018-01-01 11:04:44, 2018-01-01 10:04:44] 2
  17. 17. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. 2018-01-01 10:23:03 3 6 ‘B’ 3000 2018-01-01 10:23:12 4 3 ‘A’ 5500 time orderid user region price ... 2018-01-01 10:00:00 1 1 ‘A’ 10000 2018-01-01 10:03:03 2 7 ‘C’ 40000 Skip Partition Scan SELECT region, SUM(price) FROM orders WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ GROUP BY region Scan direction Scanned data 2018-01-01 11:04:44 5 1 ‘A’ 20000 2018-01-01 11:30:00 6 8 ‘C’ 3000 ...
  18. 18. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. How to find Partitions? SELECT region, SUM(price) FROM orders -- assume this is data set 1 WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ GROUP BY region Meta DB (PostgreSQL) data_set_id time_range path ... 1 [2018-01-01 10:00:00, 2018-01-01 10:03:03] 1 [2018-01-01 10:23:03, 2018-01-01 10:23:12] 1 [2018-01-01 11:04:44, 2018-01-01 10:04:44] 2 Number of partitions in a data set can be large (1M+) for large tables. ?
  19. 19. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Range Type and GiST Index of PostgreSQL Meta DB (PostgreSQL) data_set_id time_range path ... 1 [2018-01-01 10:00:00, 2018-01-01 10:03:03] 1 [2018-01-01 10:23:03, 2018-01-01 10:23:12] 1 [2018-01-01 11:04:44, 2018-01-01 10:04:44] 2 GiST Index Range type • Overlap operator time_range && [2018-01-01 10:00, 2018-01-01 11:00] Overlap is checked by index scan
  20. 20. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. User Defined Partition (Beta) SELECT … FROM … WHERE time > … AND time <... AND region = ‘A’ Time Time Region(userdefinedkey) A C B
  21. 21. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. PlazmaDB Partition Fragmentation Data set 1 Data set 2 Presto worker Hive worker ● Latency (30ms+) ● Get operation cost
  22. 22. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Data set 2 PlazmaDB Data set 2 Data set 1 Realtime Storage & Archive Storage Realtime Storage Archive Storage Partitions imported 1 hour Merge Data set 1
  23. 23. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Data set 2 PlazmaDB Data set 2 Data set 1 Realtime Storage & Archive Storage Realtime Storage Archive Storage Partitions imported 1 hour Merge Data set 1 Reduced to 1/20 - 1/100 partitions
  24. 24. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Streaming & Bulk data upload Realtime Storage Archive Storage Bulk loadStreaming data
  25. 25. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Fragmentation of Archive Storage data_set_id time_range import_time ... 1 [... 10:00:00, ... 10:03:03] … 10:05:00 1 [... 10:23:03, ... 10:23:12] … 10:25:00 ... 1 [... 20:30:00, ... 20:34:44] … 20:35:00 1 [... 10:00:44, ... 20:44:14] … 20:45:00 ... data_set_id time_range ... 1 [... 10:00:00, ... 11:00:00] ... 1 [... 10:00:44, … 11:00:00] 1 [... 15:04:38, ... 15:34:44] 1 [... 20:00:00, ... 21:00:00] ... Realtime Storage Archive Storage Delayed data Split by 1 hour window
  26. 26. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Remerge Partitions Archive Storage
  27. 27. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Re: PlazmaDB features - Columnar format - Partitioned by time and optionally user defined column - Schema less - Partition index - Partition optimization - Merge partitions - Realtime Storage & Archive Storage - Transaction - Read committed isolation
  28. 28. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Current PlazmaDB & Future Challenges
  29. 29. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Data Volume PlazmaDB Meta DB (PostgreSQL) Realtime Storage Archive Storage AWS S3 / Riak CS 5 PB GiST GiST Partition Metadata Partition Metadata 1 TB
  30. 30. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Monitoring • Arm Treasure Data – Detailed log analyze • DataDog – Metrics visualization
  31. 31. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Read Workload on Meta DB • Metadata size ~ 1TB • Shared Buffer size ~ 150GB • But, Hot Data size is much smaller than Shared Buffer # of Read Requested on Data Set /day #ofPartitionsRead/day
  32. 32. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Write Workload on Meta DB 200k transaction / min ~ 3k transaction / sec 3 MB / sec
  33. 33. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. PostgreSQL Auto VACUUM (FREEZE) VACUUM FREEZE: Vacuum to prevent transaction ID wraparound failures • Force full scan on relation (as of PostgreSQL 9.4) – Hot data may be evicted to scan relations for vacuum => Read workload can be affected – PostgreSQL 9.6 or later mitigate the problem • The more transaction IDs are consumed, the more vacuum can be happened – In our case, it happens every 2-3 day
  34. 34. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Impact of VACUUM FREEZE
  35. 35. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. GiST Index Bloat 100GB 150GB 50GB 200GB B-tree GiST Bloat 30-40GB/month Reindex (~20h)
  36. 36. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Metadata Table Partitioning GiST Partition Metadata One relation’s reindex becomes shorter and space saving
  37. 37. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Meta DB Scale out DB1 DB2 DB3
  38. 38. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. More Partition Skip SELECT region, SUM(price) FROM orders WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ AND user_age >= 20 AND user_age <= 30 GROUP BY region Meta DB (PostgreSQL) data_set_id time_range path ... 1 [2018-01-01 10:00:00, 2018-01-01 10:03:03] 1 [2018-01-01 10:23:03, 2018-01-01 10:23:12] 1 [2018-01-01 11:04:44, 2018-01-01 10:04:44] 2 Scan partition & Filter
  39. 39. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. More Partition Skip SELECT region, SUM(price) FROM orders WHERE time >= ‘2018-01-01 10:00’ AND time <= ‘2018-01-01 11:00’ AND user_age >= 20 AND user_age <= 30 GROUP BY region Meta DB (PostgreSQL) data_set_id time_range user_age_range 1 [2018-01-01 10:00:00, 2018-01-01 10:03:03] [35, 40] 1 [2018-01-01 10:23:03, 2018-01-01 10:23:12] [25, 30] 1 [2018-01-01 11:04:44, 2018-01-01 10:04:44] 2 Store metadata on frequently accessed columns
  40. 40. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Smart Partition Selection for Remerge Remerge is resource consuming • Current – Data set size – # of Partitions • Idea – Access frequency – Data freshness Fresh data is likely to be hot
  41. 41. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Summary - PlazmaDB: Storage Layer of Arm Treasure Data Analytics Platform - Optimization for Analytical Queries - Columnar + Time Partitioning + Partition Index - Optimization for Streaming data - Realtime & Archive Storage + Merge Partition - Challenges - Reduce impact of PostgreSQL VACUUM FREEZE - GiST index management - More Partition Optimization - Enrich Metadata - Smart Remerge
  42. 42. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. We are Hiring!! https://www.treasuredata.com/company/careers/
  43. 43. Thank You! Danke! Merci! 谢谢! Gracias! Kiitos! Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.

×
Save this presentationTap To Close