2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction

1. Spark DataFrames Introduction 道玄坂LT祭り第2回(統計、機械学習、データ抽出) Yu Ishikawa 1

2. 発表のゴール • Hadoop MapReduce を書くのめんどいわー • Python ユーザや R ユーザが，ビッグデータを気軽に扱うにはどうすりゃいいの？ 2 それ Spark DataFrame でできるよ

3. Apache Spark 知ってるよね？ 3

4. Speed 4 Hadoop MapReduce に比べて 10x ~ 100x 早い

5. Ease of Use 5 PageRank も数行で実装できる

6. Generality 6 機械学習ライブラリなども標準で利用できる

7. Runs Everywhere 7 Hadoop, Mesos などと豊富な連携

8. 2015 年の重点開発方針 • Data Science – 使いやすい high-level APIs の提供 • scikit-learn のような一貫性をもった APIs • Platform Interfaces – 標準対応以外のデータソースへのアクセスやアルゴリズムをより簡単に利用できるインターフェースの提供 • spark-packages というパッケージ管理ツールのようなもので 3rd party ライブラリを利用できるようにする 8

9. DataFrame APIs の紹介「Spark なにそれ？」って人も多いと思うので表面的な話をします 9

10. Agenda • 導入 • DataFrame APIs とは？ • DataFrame APIs の紹介 • Demo • まとめ 10

12. DataFrame APIs とは • データサイエンスでよく使われる処理を Domain-specific functions にしたもの – Project – Filter – Aggregation – Join – UDFs • Python, Java, Scala and R (via SparkR) から利用できる • Spark 1.3 でリリースされる機能 12 ビッグデータを Spark 上でより簡潔でより高速に処理できる機能

13. Spark ならより簡単に MR を実装できる 13

14. DataFrame APIs で Spark をより簡潔に操作 14

15. DataFrame の最適化機構によってより高速に処理ができる 15

17. 主な DataFrame APIs • Creation • Check Schema • Project • Filter • Aggregation • Join • UDFs 17

18. Creation • DataFrame API で扱うデータの読み込み • JSON, Hive, Purque などが利用できる 18 // Create a SQLContext (sc is an existing SparkContext) val context = new org.apache.spark.sql.SQLContext(sc) // Create a DataFrame for Github events var path = "file:///tmp/github-archive-data/*.json.gz" val event = context.load(path, "json").as('event) // Create a DataFrame for Github users path = "file:///tmp/github-archive-data/github-users.json" val user = context.load(path, "json").as('user)

20. Project • select() で取り出したいカラムを選択 – $”parent.child” の記法でネストされたカラムも選択できる • select(‘key as ‘alias) のように as でエイリアスを作れる 20 // Select a column event("public”) event.select(‘public as ‘PUBLIC) // Select multile columns with aliases event.select('public as 'PUBLIC, 'id as 'ID) // Select nested columns with aliases event.select($"payload.size" as 'size, $"actor.id" as 'actor_id)

21. Filter • filter() は SQL の WHERE のような役割 • 複数の条件を指定する場合は，ひとつの filter() に入れることも，ふたつの filter() のチェーンに分けることもできる 21 // Filter by a condition user.filter("name is not null”) // Filter by a comblination of two conditions event.filter("public = true and type = 'ForkEvent'”) event.filter("public = true").filter("type = 'ForkEvent'”)

22. Aggregation • count() は単純にレコード数を数える • groupBy() は SQL の GROUP BY の働き • agg() を組み合わせることで威力をさらに発揮 22 // Count the number of records user.count // Group by ‘type column and then count event.groupBy("type").count() // Aggregate by ‘id column user.groupBy('id).agg('id, count("*"), sum('id))

23. Join • まず as() で各 DataFrame のエイリアスを登録 • join() と where() で結合と結合条件を指定 • 結合した結果のカラムを取り出すには，登録したエイリアスを利用すれば良い 23 // Register their aliases val user = user.as('user) val pr = event.filter('type === "PullRequestEvent").as('pr) // Join the two data sets val join = pr.join(user).where($"pr.payload.pull_request.user.id" === $"user.id") join.select($"pr.type", $"user.name", $"pr.created_at”)

24. UDF: User Defined Function • udf() で独自関数を定義して，DataFrame の中で利用できる • 例：文字列 “2015-01-01T00:00:00Z” から – “2015-01-01” を抽出する関数を定義 – “00:00:00” を抽出する関数を定義 24 // Define User Defined Functions val toDate = udf((createdAt: String) => createdAt.substring(0, 10)) val toTime = udf((createdAt: String) => createdAt.substring(11, 19)) // Use the UDFs in select() event.select(toDate('created_at) as 'date, toTime('created_at) as 'time)

26. With Machine Learning (1) • Github のコミットメッセージに対して，word2vec を適応 26 val toIter = udf((x: ArrayBuffer[String]) => x.mkString(delimiter)) val messages = event. select($"payload.commits.message" as 'messages).filter("messages is not null"). select(toIter($"messages")). flatMap(row => row.toSeq.map(_.toString).apply(0).split(delimiter)) val message = messages.map(_.replaceAll("""(n|)""", ""). replaceAll("""s+""", " ").split(" ").map(_.replaceAll("""(,|.)$""", "")).toSeq). filter(_.size > 0) // create a model val model = new Word2Vec().fit(message)

27. With Machine Learning (2) • 構築したモデルで類似語を出力 27 > model.findSynonyms("bug", 10).foreach(println) (issue,0.6874246597290039) (typo,0.663004457950592) (bugs,0.599325954914093) (errors,0.5887047052383423) (problem,0.5665265321731567) (fixes,0.5617778897285461) (spelling,0.5353047847747803) (crash,0.5330312848091125) (Fixed,0.5128884315490723) (small,0.5113803744316101)

29. まとめ • DataFrame APIs は，ビッグデータを Spark 上でより簡潔より高速に処理できる機能 • groupBy, agg, count, join などのデータ操作でよく使う関数が準備されている – Pandas を知っている人は，あんな感じをイメージしてくれるとよい • UDF で独自の関数も定義できる • 機械学習ライブラリとの組み合わせられる 29

30. DataFrame の課題 • Apache Spark の機械学習ライブラリ mllib につなぎこむためのデータ変換が面倒くさい • よりシームレスに連携できるような仕組みが必要だと思われる 30

31. DataFrame Introduction • spark-dataframe-introduction • http://goo.gl/Futoi0 31

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

2015 03-12 道玄坂LT祭り第2回 Spark DataFrame Introduction

Yu Ishikawa

Transcript