No-Ops で大量データ処理を簡単に実現する - BigQuery と Cloud Dataflow で実現する次世代データ処理基盤

1. 福田潔 Google Cloud Platform カスタマーエンジニア Google Cloud 部門 No-Ops で大量データ処理を簡単に実現する BigQuery と Cloud Dataflow で実現する次世代データ処理基盤

2. Data + No-Ops

3. Data makes software great. Apps (and companies) win or lose based on how they use it.

4. Better software, faster.

5. Confidential & Proprietary 分析クラスタを作成クラスタを管理クラスタをアップグレードインデックスを定義ソフトウェアをセットアップ VPCをセットアップスケールを管理典型的なデータウェアハウスインフラではなく、データから知見を導くところにフォーカスする分析クラウド時代の Big Data アーキテクチャ分析に費やす時間を増やす

6. Google confidential │ Do not distribute Google in 1 minute 1000 new devices 3M Searches 100 Hours 1B Activated Devices 100M GB Search Content

7. Confidential + Proprietary MapReduce 後のイノベーションの歩み MapReduce BigTable DremelColossus FlumeMegastore SpannerPubSub Millwheel

8. 2016 Google Cloud Platform 8 Google Research 20082002 2004 2006 2010 2012 2014 2015 Open Source 2005 Google Cloud Products GFS Map Reduce 15年以上、データの問題に向き合ってきた

9. 2016 Google Cloud Platform 9 Google Research 20082002 2004 2006 2010 2012 2014 2015 Open Source 2005 Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable ML GFS Map Reduce BigTable Dremel Flume Java Millwheel Tensorflow Apache Beam PubSub 15年以上、データの問題に向き合ってきた

10. 保存分析処理収集ビッグデータのライフサイクル

11. Confidential & ProprietaryGoogle Cloud Platform 11 分析保存収集 BigQuery (SQL) 処理 Cloud Dataflow (stream and batch) Cloud Storage (objects) Cloud Datastor (NoSQL) BigQuery Storage (structured) Cloud Dataproc (Hadoop & Ecosystem) Cloud Bigtable (NoSQL HBase) Cassandra hBase MongoDBRabbit MQ Kafka Cloud 2.0 Cloud 3.0 可視化 Cloud DataLab (iPython/Jupyter) Tableau Pub/Sub Stackdriver Logging BQ Streaming App Engine Cloud SQL (SQL) Cloud Machine Learning プロダクトをマップすると

12. Confidential + Proprietary リファレンスアーキテクチャ : データを収集する Cloud Pub/Sub 信頼性の高い、多対多の非同期メッセージング Cloud Storage シンプルでコスト効果の高いオブジェクトストレージ生ログ、ファイル、外部システムからのアウトプット等イベント、メトリック等

13. Confidential + Proprietary リファレンスアーキテクチャ: 処理および変換 Cloud Dataflow バッチ/ストリーム両方に対応したデータ処理エンジン Stream Batch 生ログ、ファイル、外部システムからのアウトプット等イベント、メトリック等

14. Confidential + Proprietary リファレンスアーキテクチャ : 処理および変換 Stream Batch Cloud Dataproc Spark / Hadoop のマネージド・サービス Batch Cloud Dataflow バッチ/ストリーム両方に対応したデータ処理エンジン生ログ、ファイル、外部システムからのアウトプット等イベント、メトリック等

15. Confidential + Proprietary リファレンスアーキテクチャ: 分析および保存 Stream Batch BigQuery 大規模データセットに対する高性能クエリーエンジン Bigtable 大規模データに対する高性能NoSQLデータサービス Batch 生ログ、ファイル、外部システムからのアウトプット等イベント、メトリック等

16. Confidential + Proprietary リファレンスアーキテクチャ: 学習およびレコメンド Stream Batch Batch Cloud Machine Learning 機会学習のスケーラブルなマネージドプラットフォーム生ログ、ファイル、外部システムからのアウトプット等イベント、メトリック等

17. Confidential + Proprietary リファレンスアーキテクチャ: 学習およびレコメンド Stream Batch Batch 外部アプリケーション Cloud Datalab 可視化および BI データ共有 B C A 生ログ、ファイル、外部システムからのアウトプット等イベント、メトリック等

18. Confidential + Proprietary Events, metrics, etc. Stream Batch Batch Raw logs, files, assets, Google Analytics data etc. Applications and Reports Cloud Datalab Visualization and BI Co-workers B C A a serverless big data stack that scales automatically

19. Confidential + Proprietary Cloud Dataflow Batch/Streaming Processing BigQuery Large Scale Analytics

20. BigQuery Your Enterprise Data Warehouse in the cloud

21. 21 BigQuery とは？耐久性があり高可用性を備える SQLの利便性ペタバイト規模で高速フルマネージドの No-Ops データウェアハウス

22. Confidential + Proprietary BigQuery の内部構造 SQL クエリペタビットネットワーク BigQuery ストレージコンピュートストリーミングインジェスト高速バッチロード

23. Demo?

24. 25Data & Analytics 5 年間にわたる継続的な改善 2010 2011 2012 2013 2014 公開大規模なクエリ結果 2015 2016 900 300 0 1,200 Code Submits Google I/O でベータリリース Dremel X Big JOIN サポートダイナミックエグゼキューション Capacitor シャッフル高速化 100k qps のストリーミングユーザー定義関数 100k qps のストリーミング

25. Unstructured data accounts for 90% of enterprise data* *Source: IDC

26. Dataflow New default of stream processing

27. Google Confidential バッチ処理の問題点：データは継続的に生成される（＝Unbounded Data）。なぜ処理するために待たなければならないのか？ Dataflow(Apache Beam) はストリーム処理の新しいデフォルトバッチ処理はストリーム処理のサブセットと捉える Cloud Dataflow とは？

28. 20122002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Dataflow は新しいデフォルト

29. Dataflow モデルおよび Cloud Dataflow Dataflow Model & SDKs バッチおよびストリーム処理の統合プログラムモデル no-ops, フルマネージサービス Google Cloud Dataflow Apache Beam

30. Confidential + Proprietary 入力した文字列サジェストするリスト #ar #argentina, #arugularocks, #argylesocks #arg #argentina, #argylesocks, #argonauts #arge #argentina, #argentum, #argentine ハッシュタグのオートコンプリート事例

31. Confidential + Proprietary {a->[apple, art, argentina], ar->[art, argentina, armenia],...} Count ExpandPrefixes Top(3) Write Read ExtractTags {a->(argentina, 5M), a->(armenia, 2M), …, ar->(argentina, 5M), ar->(armenia, 2M), ...} {#argentina scores!, watching #armenia vs #argentina, my #art project, …} {argentina, armenia, argentina, art, ...} {argentina->5M, armenia->2M, art->90M, ...} Tweets Predictions

32. Confidential + Proprietary Count ExpandPrefixes Top(3) Write Read ExtractTags Tweets Predictions Pipeline p = Pipeline.create(new PipelineOptions()); p p.run(); .apply(ParDo.of(new ExtractTags())) .apply(Top.largestPerKey(3)) .apply(Count.perElement()) .apply(ParDo.of(new ExpandPrefixes()) .apply(TextIO.Write.to(“gs://…”)); .apply(TextIO.Read.from(“gs://…”)) class ExpandPrefixes … { public void processElement(ProcessContext c) { String word = c.element().getKey(); for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); c.output(KV.of(prefix, c.element())); } } }

33. 34 Spotify - Big Wins with Google’s Data Product youtu.be/LTVFg6YOjW o

34. Gateway Data Center Client Hadoop Client Client Client Cloud Pub/Sub Event Delivery Service File Tailer Syslog Cloud Storage Dataflow ETL using Cloud Dataflow BigQuery

35. どこから始めるか？

36. Confidential + Proprietary シンプルに始める

37. Confidential + Proprietary シンプルに始める Batch 生ログ、ファイル、外部システムからのアウトプット等 Cloud Datalab 可視化および BI

38. Confidential + Proprietary シンプルに始める： ETL処理を追加 Batch 生ログ、ファイル、外部システムからのアウトプット等 Cloud Datalab 可視化および BI

39. Confidential + Proprietary シンプルに始める：ストリームに対応 Stream Batch 生ログ、ファイル、外部システムからのアウトプット等イベント、メトリック等 Cloud Datalab 可視化および BI

40. 結果は?

41. Confidential + Proprietary cloud.google.com

42. Confidential + Proprietary 参考 ● Google Cloud Platform ドキュメント ○ https://cloud.google.com/ ● BigQuery ○ https://cloud.google.com/bigquery/ ● Dataflow ○ https://cloud.google.com/dataflow/ ● Google Big Data Blog ○ https://cloud.google.com/blog/big-data/ ● Architecture: Optimized Large-Scale Analytics Ingestion ○ https://cloud.google.com/solutions/architecture/optimized-large-scale-a nalytics-ingestion

43. 4545 Thank You! fukudak@google.com

No-Ops で大量データ処理を簡単に実現する - BigQuery と Cloud Dataflow で実現する次世代データ処理基盤

Kiyoshi Fukuda

No-Ops で大量データ処理を簡単に実現する - BigQuery と Cloud Dataflow で実現する次世代データ処理基盤