TPC-DSから学ぶPostgreSQLの弱点と今後の展望

1. 1 TPC-DSから学ぶPostgreSQLの弱点と今後の展望 NEC ビジネスクリエイション本部 The PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>

2. 2 自己紹介 ▌名前：海外浩平 ▌会社： NEC ▌仕事：  PG-Stromプロジェクトリーダー  PostgreSQL及び他のOSSプロジェクトへのコントリビューション  PG-Stromを軸とした事業立上げ ▌PG-Stromプロジェクト  ミッション：ヘテロジニアス計算や不揮発メモリなど、新たな半導体技術の進化の成果を全てのユーザの元へ届ける。  2012年に海外個人の開発プロジェクトとしてスタート。現在はNECがファンドしている。  完全なオープンソースプロジェクト（GPL v2） PGconf.JP 2015 - What TPC-DS tells us and further enhancement

3. 3 本日のテーマ ① OLAP系ワークロードを代表する（とされる） TPC-DSクエリをPostgreSQL 9.5βで実行 ② そこから見えてきたPostgreSQLの弱点とは！？ ③ これらの課題に対し、開発者コミュニティではどのような取り組みが行われているか？（主に並列分散処理の取り組みについて） PGconf.JP 2015 - What TPC-DS tells us and further enhancement

4. 4 アジェンダ 1. TPC-DSベンチマークとは？ 2. ベンチマーク結果と分析 3. 改善アプローチ 4. その先の未来

5. 5 TPC-DSベンチマークとは (1/3) ▌TPCとは  TPC: Transaction Processing Performance Council  ベンダー中立なNPO組織。1988年設立。カリフォルニア州本拠。  トランザクション処理＆DBベンチマークの定義を目的とする。 ▌定義されているベンチマーク  TPC-C On-line transaction processing (1992~)  TPC-H Ad-hoc decision support system (1999~)  TPC-E Complex on-line transaction processing (2006~)  TPC-DS Complex decisions support system (2011~)  TPC-DI Data integration (2013~) PGconf.JP 2015 - What TPC-DS tells us and further enhancement DBの更新性能業務系システムを想定したワークロードを定義 DBの集計性能情報系システムを想定したワークロードを定義

6. 6 TPC-DSベンチマークとは (2/3) ▌What is TPC Benchmark™DS  意思決定支援ベンチマーク (Decision Support)  大量データに対する、実世界のビジネスクエスチョン • Ad-hoc, reporting, data mining, ... 等の領域を想定  プロセッサ (CPU/GPU) とI/Oに対する高負荷を想定  様々な種類のクエリを99種類 (103個)定義 PGconf.JP 2015 - What TPC-DS tells us and further enhancement

7. 7 TPC-DSベンチマークとは (3/3) ▌TPC-DS data/business model  全国規模で展開し各地に店舗を持つ小売・流通業。店舗・Web・カタログの各販売チャネルを持つ事を想定。  業務プロセス • 顧客の購買行動・返品履歴をトラッキング • 顧客属性によるダイナミックなWebページの生成（CRM） • 在庫商品の管理 ▌データ構造  販売履歴を中心とするスタースキーマ構造 PGconf.JP 2015 - What TPC-DS tells us and further enhancement

8. 8 TPC-DSクエリの例 – Query01 with customer_total_return as (select sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_FEE) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year = 2000 group by sr_customer_sk ,sr_store_sk) select c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'TN' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100; PGconf.JP 2015 - What TPC-DS tells us and further enhancement 2000年、テネシー州の店舗において、返品数が店舗平均の20%を越える顧客を 100件検索する

9. 9 アジェンダ 1. TPC-DSベンチマークとは？ 2. ベンチマーク結果と分析 • フラット化しないサブクエリ • Nested Loopの見込み違い • TPC-DSが教えてくれたこと 3. 改善アプローチ 4. その先の未来

10. 10 ベンチマーク測定条件 ▌Software  PostgreSQL v9.5β1 • work_mem = 96GB • shared_buffers = 160GB • statement_timeout = 3600000 (1時間で打ち切り)  Red Hat Enterprise Linux Server 6.6  NVIDIA CUDA 7.0 ▌Hardware  Dell PowerEdge T630 • CPU: Intel Xeon E5-2670v3 (2.3GHz, 12C) x 2 • GPU: NVIDIA Tesla K20c (706MHz, 2496C) x 1 • RAM: 384GB (16GB RDIMM 2133MT/s) x 24 • HDD: 300GB (SAS, 10krpm) x 8; RAID5 ▌ベンチマーク前提  Scaling Factor = 100  pg_prewarm により全データを事前にRAMへロード PGconf.JP 2015 - What TPC-DS tells us and further enhancement

11. 11 ベンチマーク結果① – PostgreSQL v9.5β + TPC-DSオリジナル ▌103中11本のクエリが、1時間(=3600秒)経過しても終了せず。 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 0 500 1000 1500 2000 2500 3000 3500 クエリ応答時間(sec) PostgreSQL v9.5β + TPC-DSオリジナル

12. 12 ベンチマーク結果の分析 (1/5) – Query01を題材に with customer_total_return as (select sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_FEE) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year =2000 group by sr_customer_sk ,sr_store_sk) select c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'TN' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100; PGconf.JP 2015 - What TPC-DS tells us and further enhancement

13. 13 ベンチマーク結果の分析 (2/5) – Query01の実行計画 Limit (cost=433929567.15..433929567.40 rows=100 width=17) CTE customer_total_return (cost=0.00..2773.22 rows=138661 width=48) ...(CTE省略)... -> Sort (cost=432901698.79..432901711.73 rows=5174 width=17) Sort Key: customer.c_customer_id -> Nested Loop (cost=0.43..432901501.05 rows=5174 width=17) -> Nested Loop (cost=0.00..432881293.33 rows=5174 width=8) Join Filter: (ctr1.ctr_store_sk = store.s_store_sk) -> CTE Scan on customer_total_return ctr1 (cost=0.00..432850070.69 rows=46220 width=16) Filter: (ctr_total_return > (SubPlan 2)) SubPlan 2 -> Aggregate (cost=3121.61..3121.62 rows=1 width=32) -> CTE Scan on customer_total_return ctr2 (cost=0.00..3119.87 rows=693 width=32) Filter: (ctr1.ctr_store_sk = ctr_store_sk) -> Materialize (cost=0.00..24.25 rows=45 width=8) -> Seq Scan on store (cost=0.00..24.02 rows=45 width=8) Filter: (s_state = 'TN'::bpchar) -> Index Scan using customer_pkey on customer (cost=0.43..3.90 rows=1 width=25) Index Cond: (c_customer_sk = ctr1.ctr_customer_sk) (26 rows) PGconf.JP 2015 - What TPC-DS tells us and further enhancement

14. 14 ベンチマーク結果の分析 (3/5) – 犯人は誰ぞ？ with customer_total_return as (select sr_customer_sk as ctr_customer_sk ,sr_store_sk as ctr_store_sk ,sum(SR_FEE) as ctr_total_return from store_returns ,date_dim where sr_returned_date_sk = d_date_sk and d_year =2000 group by sr_customer_sk ,sr_store_sk) select c_customer_id from customer_total_return ctr1 ,store ,customer where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 from customer_total_return ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk) and s_store_sk = ctr1.ctr_store_sk and s_state = 'TN' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100; PGconf.JP 2015 - What TPC-DS tells us and further enhancement WHERE条件句がctr1テーブルから読み出したレコードの内容に依存（parametalized）している。 ⇒ ctr1テーブルのレコード数と同じ回数だけサブクエリを繰り返す！

15. 15 ベンチマーク結果の分析 (4/5) – 繰り返しをJoinで置き換え with customer_total_return as (...CTE省略...) select c_customer_id from customer_total_return ctr1 ,store ,customer ,(select ctr_store_sk ,avg(ctr_total_return)::numeric(7,2) avg_total_return from customer_total_return group by ctr_store_sk) ctr2 where ctr1.ctr_store_sk = ctr2.ctr_store_sk and ctr1.ctr_total_return > avg_total_return*1.2 and s_store_sk = ctr1.ctr_store_sk and s_state = 'TN' and ctr1.ctr_customer_sk = c_customer_sk order by c_customer_id limit 100; PGconf.JP 2015 - What TPC-DS tells us and further enhancement 先に ctr_store_sk 毎の平均値を一回だけ計算し、次にcustomer_total_returnの結果とINNER JOINを行う。

16. 16 ベンチマーク結果の分析 (5/5) – 修正クエリの実行結果 Limit (cost=1059793.62..1059793.87 rows=100 width=17) .....(省略)..... -> Hash Join (cost=3506.15..7201.47 rows=5174 width=8) Hash Cond: (ctr1.ctr_store_sk = customer_total_return.ctr_store_sk) Join Filter: (ctr1.ctr_total_return > (avg(customer_total_return.ctr_total_return)::numeric(7,2)) * 1.2) Rows Removed by Join Filter: 407779 -> CTE Scan on customer_total_return ctr1 (cost=0.00..2773.22 rows=138661 width=48) (actual time=10421.995..10936.477 rows=5435529 loops=1) -> Hash (cost=3505.87..3505.87 rows=22 width=30) Buckets: 1024 Batches: 1 Memory Usage: 10kB -> Merge Join (cost=3504.43..3505.87 rows=22 width=30) Merge Cond: (store.s_store_sk = customer_total_return.ctr_store_sk) -> Sort (cost=25.26..25.37 rows=45 width=8) (actual time=0.165..0.168 rows=45 loops=1) Sort Key: store.s_store_sk -> Seq Scan on store (cost=0.00..24.02 rows=45 width=8) (actual time=0.012..0.140 rows=45 loops=1) Filter: (s_state = 'TN'::bpchar) Rows Removed by Filter: 357 -> Sort (cost=3479.17..3479.67 rows=200 width=22) (actual time=5266.754..5266.765 rows=199 loops=1) Sort Key: customer_total_return.ctr_store_sk -> HashAggregate (cost=3466.53..3469.53 rows=200 width=40) (actual time=5266.589..5266.711 rows=202 loops=1) Group Key: customer_total_return.ctr_store_sk -> CTE Scan on customer_total_return (cost=0.00..2773.22 rows=138661 width=40) (actual time=0.001..3833.790 rows=5435529 loops=1) .....(省略)..... Planning time: 10.193 ms Execution time: 17775.038 ms PGconf.JP 2015 - What TPC-DS tells us and further enhancement 極端に実行回数の多かったサブクエリ内の処理

17. 17 ベンチマーク結果② – PostgreSQL v9.5β + TPC-DS修正 ▌クエリ01の実行時間：？？  17.78sec ▌クエリ01, 06, 10, 30, 35, 81, 95の7本でSunLink書換え  時間内に終了しないクエリ 11本  4本 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 0 500 1000 1500 2000 2500 3000 3500 クエリ応答時間(sec) PostgreSQL v9.5β + TPC-DS SubLink書換え

18. 18 反省会：なぜサブクエリの実行が非効率になってしまったか？ ▌TPC-DSのケースは機械的にサブクエリJOINへと書き換え可能。 ▌PostgreSQLもサブクエリをJOINへと書き換える機構は持っている。  が、書き換えられるパターンが限定的であるため。 ▌某シェアNo.1商用DBなどはｺﾞﾆｮｺﾞﾆｮｺﾞﾆｮ.... PGconf.JP 2015 - What TPC-DS tells us and further enhancement subquery_planner(....) -> pull_up_sublinks(....) -> pull_up_sublinks_jointree_recurse(....) -> pull_up_sublinks_qual_recurse(....) -> convert_ANY_sublink_to_join(....) : /* * The sub-select must not refer to any Vars of the parent * query. (Vars of higher levels should be okay, though.) */ if (contain_vars_of_level((Node *) subselect, 1)) return NULL; : サブクエリ外の値を参照していたら諦める

19. 19 更なる分析 (1/4) – Query16を題材に select count(distinct cs_order_number) as "order count" ,sum(cs_ext_ship_cost) as "total shipping cost" ,sum(cs_net_profit) as "total net profit" from catalog_sales cs1 ,date_dim ,customer_address ,call_center where d_date between '1999-2-01' and (cast('1999-2-01' as date) + '60 days'::interval) and cs1.cs_ship_date_sk = d_date_sk and cs1.cs_ship_addr_sk = ca_address_sk and ca_state = 'IL' and cs1.cs_call_center_sk = cc_call_center_sk and cc_county in ('Williamson County','Williamson County', 'Williamson County','Williamson County', 'Williamson County') and exists (select * from catalog_sales cs2 where cs1.cs_order_number = cs2.cs_order_number and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) and not exists(select * from catalog_returns cr1 where cs1.cs_order_number = cr1.cr_order_number) order by count(distinct cs_order_number) limit 100; PGconf.JP 2015 - What TPC-DS tells us and further enhancement

20. 20 更なる分析 (2/4) – Query16の実行計画 (SF=1で実行…) ▌問題サイズを1/100に縮減して実行時情報を採取した結果…。 ▌NestLoopの処理コストは O(Nouter × Ninner) 推定誤差が容易に計算量の爆発に繋がる ▌144千行のつもりが、1,462x66,560≒97百万行に膨れ上がってしまった(!) PGconf.JP 2015 - What TPC-DS tells us and further enhancement ....(省略).... -> Nested Loop Anti Join (cost=83650.17..162632.42 rows=1 width=20) (actual time=975.907..25308.315 rows=495 loops=1) Join Filter: (cs1.cs_order_number = cr1.cr_order_number) Rows Removed by Join Filter: 97309298 -> Nested Loop (cost=83650.17..155743.64 rows=1 width=20) (actual time=926.407..3219.101 rows=1462 loops=1) -> Nested Loop (cost=83649.88..155741.95 rows=4 width=28) (actual time=920.546..3133.191 rows=49929 loops=1) Join Filter: (cs1.cs_call_center_sk = call_center.cc_call_center_sk) Rows Removed by Join Filter: 250035 : ....(省略; これ以下は合計で3.13secしか要していない).... : -> Seq Scan on catalog_returns cr1 (cost=0.00..5599.67 rows=144067 width=8) (actual time=0.001..6.348 rows=66560 loops=1462) Planning time: 1.644 ms Execution time: 25310.373 ms

21. 21 更なる分析 (3/4) – 海外の愚痴実行計画作成時に、推定行数のリスク（変動）を考慮していない ▌推定値と推定誤差  推定値が実測値と異なるのは不可避  推定誤差による影響は処理タイプによって異なる。 • HashJoin : O(⊿N +⊿M) • MergeJoin : O(⊿N log(⊿N) +⊿M log(⊿M)) • NestLoop : O(⊿N ×⊿M) 仕方ないので、O(NM) 処理であるNested Loopを無効化する。 SET enable_nestloop = off; PGconf.JP 2015 - What TPC-DS tells us and further enhancement 推定値と実際の値がずれるのは不可避。バラつきの度合いや影響は処理によるが…。

22. 22 更なる分析 (4/4) – Query16の実行計画 (SF=1, NestLoop禁止) ▌NestLoopの代わりにHashJoinを使用した結果  Nouter の推定誤差は変わらない 1  1462  その場合でも、処理すべき行数が膨れ上がるという事はない ▌結果、25.3sec  1.60sec へスピードアップ PGconf.JP 2015 - What TPC-DS tells us and further enhancement SET enabled_nestloop = off; ....(省略).... -> Hash Anti Join (cost=95880.57..165398.59 rows=1 width=20) (actual time=804.945..1600.695 rows=495 loops=1) Hash Cond: (cs1.cs_order_number = cr1.cr_order_number) -> Hash Join (cost=88480.07..157998.06 rows=1 width=20) (actual time=746.820..1542.702 rows=1462 loops=1) Hash Cond: (cs1.cs_call_center_sk = call_center.cc_call_center_sk) : ....(省略).... : -> Hash (cost=5599.67..5599.67 rows=144067 width=8) (actual time=56.534..56.534 rows=144067 loops=1) Buckets: 262144 Batches: 1 Memory Usage: 7676kB -> Seq Scan on catalog_returns cr1 (cost=0.00..5599.67 rows=144067 width=8) (actual time=0.006..35.087 rows=144067 loops=1) Planning time: 1.198 ms Execution time: 1601.838 ms

23. 23 ベンチマーク結果③ – NestedLoop禁止 + TPC-DS修正 ▌HashJoinを強制した事で、1時間以上応答を返さないクエリは消滅 ▌downside:  巨大テーブル間のJOINではメモリを馬鹿食いする。  X exists in (SELECT ...) では立ち上がりが遅い。 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 0 500 1000 1500 2000 2500 3000 3500 クエリ応答時間(単位:sec) NestedLoop禁止、TPC-DS SubLink書き換え

24. 24 ベンチマーク結果③‘ – 処理時間の内訳 ▌Scan(34.4%)、Join(36.9%)、Aggregate(23.3%)で総処理時間の95%  但し、Scanはon-memoryのデータ転送である事に留意。 ▌その他 (5.3%) の中で目立つのは…。  Sort、SetOp、Window関数 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 0 200 400 600 800 1000 1200 1400 クエリ応答時間(単位:sec) NestedLoop禁止、SubLink書き換え、各処理時間の内訳 Scan Join Aggregate Others Append +SetOp Sort (48M行) Window 関数

25. 25 TPC-DSの結果から得られた知見前提：TPC-DSは世間一般のBIワークロードを反映している ▌裏ボス Planner ▌ラスボス Scan Join Aggregation ▌中ボス Sort SetOp Window関数 PGconf.JP 2015 - What TPC-DS tells us and further enhancement この辺を頑張ると、体感パフォーマンスがぐっと上がる（ハズ）

26. 26 アジェンダ 1. TPC-DSベンチマークとは？ 2. ベンチマーク結果と分析 3. 改善アプローチ • Sustaining Innovations • Upper Planner Path-Ification • Parallelism – scale-up • Parallelism – scale-out • Distributed Aggregation 4. その先の未来

27. 27 v9.5 Sustaining Innovations (1/2) – HashJoin • Improve in-memory hash performance ▌SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 [, ...] GROUP BY cat;  t0: 100M rows, t1~t10: 100K rows for each, all the data was preloaded.  CPU: Xeon E5-2670v3, RAM: 384GB, Red Hat Enterprise Linux 7.0 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 0 200 400 600 800 1,000 1,200 9.4 9.5β 9.4 9.5β 9.4 9.5β 9.4 9.5β 9.4 9.5β 9.4 9.5β 9.4 9.5β 9.4 9.5β 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of tables involved in JOINs Scan Join Aggregate Others

28. 28 v9.5 Sustaining Innovations (2/2) – SortSupport • Improve the speed of sorting VARCHAR, TEXT, and NUMERIC fields • Extend the infrastructure that allows sorting to be performed by inlined... ▌SELECT * FROM tbl_[text|char] ORDER BY val;  tbl_[text|char]: contains 50M rows, MD5 random string  CPU: Xeon E5-2670v3, RAM: 384GB, Red Hat Enterprise Linux 7.0 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 212.13 783.89 290.37 835.69 101.96 126.37 288.66 865.81 0 100 200 300 400 500 600 700 800 900 1,000 Text C Text JP Char(n) C Char(n) JP QueryResponseTime[sec] PostgreSQL 9.4 PostgreSQL 9.5

29. 29 Upper Planner Path-Ification (1/2) – 現在のプラン生成 ▌最初にScan+Joinの組合せをトライ ▌最も推定コストの小さな実行パスにAggregateなどを乗せる  途中でAggregateを挟んだ方がよいケースを上手く扱えない。 PGconf.JP 2015 - What TPC-DS tells us and further enhancement table A table B table C Hash Join Nest Loop table A table B table C Hash Join Merge Join table A table C table B Hash Join Hash Join table-A table-B table-C Hash Join Merge Join Aggregate

30. 30 Upper Planner Path-Ification (2/2) – あるべき姿 ScanやJoin同様にAggregate等を含むパスを検討する機能 ▌何が可能になるか？  JOINの前に部分集約を挟み、JOINすべき行数を削減  ワーカープロセス側の処理で部分集約を実行し、データ量を削減  ....など ▌各候補パスをコストベースで比較検討できる事がポイント PGconf.JP 2015 - What TPC-DS tells us and further enhancement table-A table-B table-C Hash Join Merge Join Aggregate table-A table-B table-C Merge Join FinalAggregate Hash Join PartialAggregate VS

31. 31 その他のプラナー改善 ▌大規模なプラナー改善は Path-Ification機能のマージを待っている状況 ▌インテリジェントな SubLinkのPull-up ▌行推定の変動幅を考慮したプラン選択  現時点では具体的な取り組みに落ち込んでいない  pgsql-hackersへの参加者求む！ PGconf.JP 2015 - What TPC-DS tells us and further enhancement

32. 32 並列処理 PGconf.JP 2015 - What TPC-DS tells us and further enhancement

33. 33 並列処理と開発者コミュニティでの動向 PGconf.JP 2015 - What TPC-DS tells us and further enhancement Scale-Out Scale-Up Homogeneous Scale-Up Heterogeneous Scale-Up +

34. 34 なぜ並列処理が Scan 高速化につながるのか？ PGconf.JP 2015 - What TPC-DS tells us and further enhancement Storage Parallel SeqScan Parallel SeqScan Parallel SeqScan Parallel SeqScan Parallel SeqScan shared buffer Gather Backend Process Background WorkerProcess

35. 35 Nested Loop Parallel SeqScan Index Scan Nested Loop Parallel SeqScan Index Scan Nested Loop Parallel SeqScan Index Scan Nested Loop Parallel SeqScan Index Scan なぜ並列処理が Join 高速化につながるのか？ PGconf.JP 2015 - What TPC-DS tells us and further enhancement Storage shared buffer Gather INNER INNER INNER OUTER INNER OUTEROUTEROUTER

36. 36 GpuHashJoin – より細粒度での並列処理 PGconf.JP 2015 - What TPC-DS tells us and further enhancement Inner relation Outer relation Inner relation Outer relation Hash table Hash table Next step Next step All CPU does is just references the result of relations join Sequential Hash table search by CPU Sequential Projection by CPU Parallel Projection by GPU Parallel Hash- table search by GPU Built-in Hash-Join GpuHashJoin of PG-Strom

37. 37 GpuHash Join Parallel SeqScan Seq Scan GpuHash Join Parallel SeqScan Seq Scan GpuHash Join Parallel SeqScan Seq Scan GpuHash Join Parallel SeqScan Seq Scan なぜCPU+GPU並列処理が Join 高速化につながるのか？ PGconf.JP 2015 - What TPC-DS tells us and further enhancement Storage shared buffer Gather INNER INNER INNER OUTER INNER OUTEROUTEROUTER

38. 38 スケールアウト (1/2) – FDW をベースとした分散DB PGconf.JP 2015 - What TPC-DS tells us and further enhancement Foreign Scan Foreign Scan Foreign Scan Foreign Scan 外部表を含むパーティション定義Gather node 並列ワーカの起動 FDWのリモートJOIN AppendとJoinの順序入替え v9.5 機能 v9.6 機能 Gather ParallelAppand

39. 39 スケールアウト (2/2) – Target List Push Down ▌プロジェクションが複雑なクエリでは、ローカルCPUを節約し、外部の計算機資源を使用した方が性能を向上できると考えられる。  外部計算機資源：CPU（他プロセス、外部サーバ）、GPU、FPGA(?)など  Upper Planner Path-Ification でプラナーのインフラが改善された次のステップ PGconf.JP 2015 - What TPC-DS tells us and further enhancement SELECT sqrt((x0-y0)^2 + ... + (x99-y99)^2) dist FROM items_x, items_y WHERE x_id != y_id; Projection NestLoop Scan on items_x Scan on items_y ForeignScan (scanrelid=0) 最終結果 x0~x99、y0~y99を用いて、ここで最終結果を計算する x0~x99および y0~y99の値をローカルに転送計算結果だけをローカルに転送vs

40. 40 Distributed Aggregation (1/4) – Map-Reduce ▌平均値の定義 1 𝑁 𝑥𝑖 𝑁 𝑖=1 = 1 𝑁 𝑥𝑖 𝑘1 𝑖=1 + ⋯ + 𝑥𝑖 𝑁 𝑖=𝑘 𝑗−1 … 1 < 𝑘𝑗, 𝑘𝑗 < 𝑁  各Σ項を複数に分割、独立に計算しても最終結果は同じ。  実は浮動小数点計算誤差も小さくなる。 ▌Partial Aggregate + Final Aggregate PGconf.JP 2015 - What TPC-DS tells us and further enhancement Final Aggregate Partial Aggregate Scan Aggregate Scan count(*), avg(X) sum(nrows), avg_ex(nrows, sum(X)) 行：{X} 行：{X}100万行100万行 100行 count(*) nrows, sum(X)

41. 41 Final Aggregate Distributed Aggregation (2/4) – CPU並列 PGconf.JP 2015 - What TPC-DS tells us and further enhancement Storage Parallel SeqScan Parallel SeqScan Parallel SeqScan Parallel SeqScan Parallel SeqScan shared buffer Gather Partial Aggregate Partial Aggregate Partial Aggregate Partial Aggregate Partial Aggregate 集約演算で行数が減る事にで、 Gather処理の負担が減る。

42. 42 Aggregate GpuPreAgg Scan Distributed Aggregation (3/4) – GPU並列による実装 PGconf.JP 2015 - What TPC-DS tells us and further enhancement shared buffer ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1st Step L1 cacheで部分集約 2nd Step DRAM上で部分集約入力データ出力データ Block-0GPUコアBlock-1GPUコアBlock-2GPUコア CPU側負荷軽減

43. 43 Final Aggregate Distributed Aggregation (4/4) – CPU+GPUハイブリッド実装 PGconf.JP 2015 - What TPC-DS tells us and further enhancement Storage Parallel SeqScan Parallel SeqScan Parallel SeqScan Parallel SeqScan Parallel SeqScan shared buffer Gather Partial Aggregate Partial Aggregate Partial Aggregate Partial Aggregate Partial Aggregate GpuPreAgg GpuPreAgg GpuPreAgg GpuPreAgg GpuPreAgg

44. 44 アジェンダ 1. TPC-DSベンチマークとは？ 2. ベンチマーク結果と分析 3. 改善アプローチ 4. その先の未来

45. 45 再掲）TPC-DSの結果から得られた知見前提：TPC-DSは世間一般のBIワークロードを反映している ▌裏ボス Planner ▌ラスボス Scan Join Aggregation ▌中ボス Sort SetOp Window関数 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 1st Step: Upper Planner Path-Ification  これをインフラとして、より高度な実行計画を生成できるようにする。インテリジェントなプラナーが処理を分散し、様々な粒度で並列処理に落とし込むのが基本路線。 PostgreSQL開発者コミュニティでは、以下の全てが開発されている。 • CPU並列 • GPU並列 • マルチノード並列

46. 46 全部組み合わせるとどうなるか？ PGconf.JP 2015 - What TPC-DS tells us and further enhancement storage shared buffer shared buffer shared buffer storage storage マルチノード並列 CPU並列GPU並列

47. 47 余談）全部組み合わせたらどうなるか？ PGconf.JP 2015 - What TPC-DS tells us and further enhancement

48. 48 余談）全部組み合わせたらどうなるか？ PGconf.JP 2015 - What TPC-DS tells us and further enhancement

49. 49 分散トランザクションマネージャ (DTM) ▌eXtensible Transaction Manager  ノード間でMVCC一貫性を担保するには、複数ノードに対応したトランザクション状態の調停機構が必要。（最終的にはロックも）  トランザクション管理を拡張可能とする枠組みが提案されている。 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 引用：Distributed Transaction Manager Olge Bartunov, Alexander Korotkov, PGconf China 2015 (Beijing)

50. 50 Scan 売上明細テーブル Aggregation Before Join ▌Distributed Aggregationの派生形  一定の条件下で、Joinの前に部分集約を作る事ができる。 PGconf.JP 2015 - What TPC-DS tells us and further enhancement 伝票ID 品物ID 数量：：品目マスタ品物ID 品物分類ID 単価：： 1万レコード10億レコード SELECT 品物分類ID, SUM(数量)*単価 FROM 売上明細 f, 品目マスタ m WHERE f.品物ID = m.品物ID GROUP BY 品物分類ID, 単価 HashJoin Aggregate Scan 品目マスタ売上明細 10億行1万行 10億行 100行 Scan HashJoin Final Aggregate Scan 品目マスタ売上明細 10億行1万行 1万行 100行 Partial Aggregate 1万行部分集約 BY 品物ID 出力：品物ID SUM(数量)

51. 51 列指向ストレージ ▌特徴  必要なカラムだけを取り出すため、I/O量が小さくなり、圧縮を効かせやすい。  更新系(OLTP)ワークロードは圧倒的に苦手 ▌FDWベースの実装  cstore_fdw... CitusDataによるFDWモジュールの実装  v1.3がJul-2015に提供済みだが、INSERT/UPDATE/DELETEなど制約もあり。 ▌Native columnar storage  Alvaro Herrera/Tomáš Vondra(2ndQuadrant)が提案中  MVCCのvisibility checkをどうするか？（visibility mapを使うしかない？） PGconf.JP 2015 - What TPC-DS tells us and further enhancement システム列（xmin, xmax） 行の可視性を制御する A列 B列 C列 D列検索条件 f(A) Visibility Check

52. 52 PCI-E SSD-to-GPU Direct ▌SSD-to-GPU Directにより期待できる効果  条件句に一致しない行を事前に除去できる。  クエリで参照されない列を事前に除去できる。  CPU/RAMにロードされた時点で既にJOIN済み・集約済み PGconf.JP 2015 - What TPC-DS tells us and further enhancement CPU + RAM GPU SSD SQLから生成した GPU命令列従来のデータの流れ SSDCPU/RAMGPU新しいデータの流れ SSDGPUCPU/RAM

53. 53 Features improvement step by step... PGconf.JP 2015 - What TPC-DS tells us and further enhancement

54. 54

TPC-DSから学ぶPostgreSQLの弱点と今後の展望

Kohei KaiGai

TPC-DSから学ぶPostgreSQLの弱点と今後の展望

A particular slide catching your eye?