Workflow Hacks #1 - dots. Tokyo

260
-1

Published on

Workflow Hacks #1

Published in: Engineering
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
260
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Workflow Hacks #1 - dots. Tokyo

  1. 1. Workflow Hacks! #1 Taro L. Saito
 leo@treasure-data.com Dec. 14, 2015 dots. Tokyo, Japan
  2. 2. Workflow Hacks! #1 2
  3. 3. アンケート • 終了後 メールにてアンケートを送付します • 質問内容 • 現在、どのようなシステムを使っているか? • ワークフローでどのような問題を解決したいか? • 回答いただいた方に、抽選でTreasure Dataパーカー をプレゼント! 3
  4. 4. About Me: Taro L. Saito 4 2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing Relational-Style XML Query [SIGMOD 2008] ~ 2014 Assistant Professor at University of Tokyo Genome Science Research - Big Data Processing - Distributed Computing 2014.03~ Treasure Data, Inc. Tokyo 2015.07~ Treasure Data, Inc. 
 Mountain View, CA
  5. 5. Cloud Platform for Data Analytics 8 • Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine) • 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda Import Export Store Analyze with Presto/Hive (Distributed SQL Engine) Enterp Enterprise Data BI
  6. 6. Workflow Fundamental Features • Dependency management • task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management • Error handling • Easy access to logs • Notification 9
  7. 7. Workflow Tools • Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos • Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR) 10
  8. 8. Dataflow DSL • Translate this data processing program • into a cluster computing program 11 A B A0 A1 A2 B1 B2 f B0 C C g map reduce f g
  9. 9. Redbook: Dataflow Engines • Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis • http://www.redbook.io/ch5-dataflow.html • DryadLINQ • Most influential interface
 for dataflow DSL • SQL-like operation • Functional style • Spark • SparkSQL • 70% of Spark accesses • Dataset API • Shift to the dataframe based API 12
  10. 10. Dataflow -> Execution Plan • Example - Hive: SQL to MapReduce • Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
 GROUP BY page 13 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  11. 11. Workflows 14 A f B C g D E F G
  12. 12. Hadoop is not enough • C. Olston et al. [SIGMOD 2011] • continuous processing • independent scheduling • Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow
 Microsoft [SOSP 2013] 15
  13. 13. Continuous Processing • The Dataflow Model • Akidau et al., Google [VLDB2015] • Unbounded data processing • late-coming data • Integration of • batch processing • accumulation 16
  14. 14. Cluster Computing with Dryad 
 M. Budiu, 2008
  15. 15. Cluster Computing with Dryad 
 M. Budiu, 2008 Workflow Hacks!
  16. 16. Airflow 19
  17. 17. Airflow • Best practices with Airflow - An open source platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup • https://youtu.be/dgaoqOZlvEA 20
  18. 18. Workflow Development • Programmatic • Generate workflows by code • Configuration as Code • Workflow reuse/overwrite • object oriented • Parameterization 21
  19. 19. Luigi • Luigiによるワークフロー管理 • http://qiita.com/k24d/items/ fb9bed08423e6249d376 22
  20. 20. Nextflow • http://www.nextflow.io/ 23
  21. 21. Dataflow DSL vs Workflow DSL • Dataflow • A -> B -> C -> … • Data dependencies • Workflow • Task A -> Task B -> Task C -> … • Task dependencies • Data transfer is optional (through file or DB) • + Scheduling • + Task names • For monitoring, redo, etc. 24
  22. 22. Weavelet (wvlet) • Object-oriented workflow DSL for Scala • Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class 25
  23. 23. Isolating DAG generation and its execution • Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059 • Asakusa on Hadoop, Spark 26 Local Hadoop Spark Result DSL generates DAG
  24. 24. Stream DSL • Add “moving stream” support to Dataflow DSL • ”moving" streams and "resting" datasets • Example • Spark Streaming • Spark DSL + Micro-batch for stream • Microsoft Azure Stream SQL • Windowing support for moving data • Norikra • Stream processing with SQL • Reactive programming • ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG) • Back-pressure support • Controlling data transfer speed from receiver side 27
  25. 25. Task Execution Retry • リトライと冪等性のデザインパターン • http://frsyuki.hatenablog.com/entry/2014/06/09/164559 • System failures • Process is not responding • network, hardware failures • Middleware failures • provisioning failures, missing components • User failures • Wrong configuration • Programming error 28
  26. 26. Retry Example • Example: Task calling a REST API /create/xxx • Client: First attempt • Server returns 200 Success • But failed to get the status code • Client retries the task • Get 409 conflict error (entry xxx is already created) • Solution (Application side) • Handle 409 error as success in the client (idempotent execution) • More strict approach • Making xxx unique for each request 29
  27. 27. Fault Tolerance • Presto: Distributed query engine developed by Facebook • Uses HTTP data transfer • No fault-tolerance • 99.5% of queries finishes without any failure • For queries processing 10 billions or more rows => Drops to 85% 30 A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  28. 28. Summary • Recent workflow tools • Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc. • Workflow manager • Handle system failures, monitoring • Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors • Idempotent execution • Requires splitting large tasks into smaller ones 31
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×