  Introduction to New features and Use cases of Hivemall Research Engineer Makoto YUI @myui <> 2016/03/30 Treasure Data Techtalk
  2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data
2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan.
2009.03 Ph.D. in Computer Science from NAIST
  3. 3. Ø 2015.04 Joined Treasure Data, Inc. 1st Research Engineer in Treasure Data Ø 2010.04-2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Ø 2009.03 Ph.D. in Computer Science from NAIST Ø TD登山部部長 Ø 部員3名(うち幽霊部員1名) Who am I ? 3
  12. 12. 12 他製品連携 SQL Server CRM RDBMS Appログ センサー Webログ ERP バッチ型 分析 アドホック型 分析 API ODBC JDBC PUSH Treasure Agent 分析ツール連携 データ可視化・共有 Treasure Data Collectors 組込み Embulk モバイルSDK JS SDK Treasure Data supports ML-as-a-Service Machine Learning
  13. 13. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 13
  14. 14. What is Hivemall Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2 14
  15. 15. What is Hivemall Hadoop HDFS MapReduce (MR v1) Hive / PIG Hivemall Apache YARN Apache Tez DAG processing MR v2 Machine Learning Query Processing Parallel Data Processing Framework Resource Management Distributed File System 15 Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
  16. 16. Hivemall’s Vision: ML on SQL Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers ✓Machine Learning made easy for SQL developers (ML for the rest of us) ✓Interactive and Stable APIs w/ SQL abstraction This SQL query automatically runs in parallel on Hadoop 16
  17. 17. List of Features in Hivemall v0.3.x Classification(both binary- and multi-class) ✓ Perceptron ✓ Passive Aggressive (PA) ✓ Confidence Weighted (CW) ✓ Adaptive Regularization of Weight Vectors (AROW) ✓ Soft Confidence Weighted (SCW) ✓AdaGrad+RDA Regression ✓Logistic Regression (SGD) ✓PA Regression ✓AROW Regression ✓AdaGrad ✓AdaDELTA kNN and Recommendation ✓ Minhash and b-Bit Minhash (LSH variant) ✓ Similarity Search using K-NN (Euclid/Cosine/Jaccard/Angular) ✓ Matrix Factorization Feature engineering ✓ Feature Hashing ✓ Feature Scaling (normalization, z-score) ✓TF-IDF vectorizer ✓Polynomial Expansion Anomaly Detection ✓ Local Outlier Factor Top-k query processing 17
  18. 18. Features supported in Hivemall v0.4.0 18 1.RandomForest • classification, regression 2.Factorization Machine • classification, regression (factorization)
  19. 19. Features supported in Hivemall v0.4.1-alpha 19 1. NLP Tokenizer (形態素解析) • Kuromoji 2. Mini-batch Gradient Descent 3. RandomForest scalability Improvements Treasure Data is operating Hivemall v0.4.1-alpha.6 The above feature are already supported
  20. 20. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 20
  21. 21. Ø CTR prediction of Ad click logs •Freakout Inc. and more •Replaced Spark MLlib w/ Hivemall at company X Industry use cases of Hivemall 21
  22. 22. 22 ØGender prediction of Ad click logs •Scaleout Inc. Industry use cases of Hivemall
  23. 23. 23 Industry use cases of Hivemall Ø Value prediction of Real estates •Livesense
  24. 24. 24Source: Industry use cases of Hivemall
  25. 25. 25 ØChurn Detection •OISIX Industry use cases of Hivemall
  26. 26. 26 会員サービスの解約予測 •10万人の会員による定期購 買が会社全体の売上、利益を 左右するが、解約リスクのあ る会員を事前に把握、防止す る策を欠いていた •統計の専門知識無しで機械学習 •解約予測リストへのポイント付 与により解約率が半減 •解約リスクを伴う施策、イベン トを炙り出すと同時に、非解約 者の特徴的な行動も把握可能に •リスク度合いに応じて UI を変 更するなど間接的なサービス改 善も実現 •機械学習を行い、過去1ヶ月間 のデータをもとに未来1ヶ月間 に解約する可能性の高い顧客リ ストを作成 •具体的には、学習用テーブル作 成 -> 正規化 -> 学習モデル作成 -> ロジスティック回帰の各ス テップをTD + Hivemall を用い てクエリで簡便に実現 Web Mobile 属性情報 行動ログ クレーム情報 流入元 利用サービス情報 直接施策 間接施策 ポイント付与 ケアコール 成功体験への誘導UI 変更 予測に使うデータ
  27. 27. 27 ØRecommendation •Portal site Industry use cases of Hivemall
  28. 28. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 28
  29. 29. 29 RandomForest in Hivemall v0.4 Ensemble of Decision Trees
  30. 30. 30 RandomForest in Hivemall v0.4 Ensemble of Decision Trees
  31. 31. 31 Training of RandomForest
  32. 32. 32 Prediction of RandomForest
  33. 33. Out-of-bag tests and Variable Importance 33
  34. 34. 34 Out-of-bag tests and Variable Importance
  35. 35. Recommendation Rating prediction of a Matrix Can be applied for user/Item Recommendation 35
  36. 36. 36 Matrix Factorization Factorize a matrix into a product of matrices having k-latent factor
  37. 37. 37 Training of Matrix Factorization Support iterative training using local disk cache
  38. 38. 38 Prediction of Matrix Factorization
  39. 39. 39 Factorization Machines Matrix Factorization
  40. 40. 40 Factorization Machines Context information (e.g., time) can be considered Source:
  41. 41. 41 Training data for Factorization Machines Each Feature takes LibSVM-like format <feature[:weight]>
  42. 42. 42 Training of Factorization Machines
  43. 43. 43 Prediction of Factorization Machines
  44. 44. 44 Feature Engineering functions
  45. 45. 45 Feature Engineering functions
  46. 46. Agenda 1. Introduction to Hivemall 2. Industrial use cases 3. How to use Hivemall 4. Development roadmap 46
  47. 47. Features to be supported in Hivemall v0.4.1 47 1. NLP Tokenizer (形態素解析) • Kuromoji integration was requested by Company R 2. Mini-batch Gradient Descent 3. RandomForest scalability Improvements 4. Recommendation for Implicit Feedback Dataset • Useful where only positive-only feedback is available • BPR: Bayesian Personalized Ranking from Implicit Feedback, Proc. UAI, 2009. Planned to release v0.4.1 in April.
  48. 48. Features to be supported in Hivemall v0.4.2 48 1. Gradient Tree Boosting • classifier, regression • based on Smile
  49. 49. Features to be supported in Hivemall v0.4.2 49 1. Gradient Tree Boosting • classifier, regression • based on Smile 2. Field-aware Factorization Machine • classification, regression (factorization) Planned to release v0.4.1 in June
  50. 50. Features to be supported in Hivemall v0.5 50 1. Mix server on Apache YARN • Service for parameter sharing among workers 学習器1 学習器2 学習器N パラメタ 交換 学習 モデル 分割された訓練例 データ並列 データ並列
  51. 51. Features to be supported in Hivemall v0.5 51 1. Mix server on Apache YARN • Service for parameter sharing among worker 2. Online LDA • topic modeling, clustering 3. XGBoost Integration 4.Generalized Linear Model • Ridge/Elastic net/Lasso regularization • Supports various loss functions 5. Alternating Direction Method of Multipliers (ADMM) convex optimization 6. T-sne Dimension Reduction
  52. 52. 52 Analytics Workflow Machine learning workflows can be simplified using our new workflow engine, named Digdag +main: +prepare: _parallel: true +train: td>: ./tasks/train_join.sql +test: td>: ./tasks/test_join.sql +quantify: td>: ./tasks/train_quantify.sql +model_test_quantify: _parallel: true +model: td>: ./tasks/make_model.sql +test_quantify: td>: ./tasks/test_quantify.sql +pred: td>: ./tasks/prediction.sql CLI version will be released soon. Stay tuned!
  53. 53. Conclusion and Takeaway 53 Hivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs Hivemall’s Positioning Treasure Data provides ML-as-a-Service using Hivemall Major development leaps in v0.4 More will follow in v0.4.1 and later • For SQL users that need ML • Easy-of-use and scalability in mind • Random Forest • Factorization Machine
  54. 54. 54 Blog article about Hivemall TD, Hivemall, Jupyter, Pandas-TDを使ってKaggleの 課題を解くシリーズ
  55. 55. 55 We support machine learning in Cloud Any feature request? Or, questions?
