Hivemall Talk at TD tech talk #3

2. Copyright ©2015 Treasure Data. All Rights Reserved. Ø2015/04 Joined Treasure Data, Inc. Ø1st Research Engineer in Treasure Data ØMy mission in TD is developing ML-‐as-‐a-‐Service (MLaaS) Ø2010/04-‐2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ØWorked on a large-‐scale Machine Learning project and Parallel Databases Ø2009/03 Ph.D. in Computer Science from NAIST Ø My research topic was about building XML native database and Parallel Database systems ØSuper programmer award from the MITOU Foundation (a Government founded program for finding young and talented programmers) Ø Super creators in Treasure Data: Sada Furuhashi, Keisuke Nishida 2 Who am I ?

3. Copyright ©2015 Treasure Data. All Rights Reserved. 3 0 2000 4000 6000 8000 10000 12000 Aug-‐12Sep-‐12Oct-‐12Nov-‐12Dec-‐12 Jan-‐13Feb-‐13M ar-‐13Apr-‐13M ay-‐13Jun-‐13 Jul-‐13Aug-‐13Sep-‐13Oct-‐13Nov-‐13Dec-‐13 Jan-‐14Feb-‐14M ar-‐14Apr-‐14M ay-‐14Jun-‐14 Jul-‐14Aug-‐14Sep-‐14Oct-‐14 Billion records (Unit) Service in Series A Funding Reached 100 customers Selected as “Cool Vendor in Big Data” by Gartner 10 trillion records 5 trillion records Figures on Oct. 2014 4 hundred thousand (40万) records Imported for each SECOND!! 10+ trillion (10兆) records Total number of imported records 12 billion (120億) records # records sent by an Ad-tech company Figures of Imported Data in Treasure Data

4. Copyright ©2015 Treasure Data. All Rights Reserved. The latest numbers in Treasure Data 100+ Customers In Japan 15 trillion # of stored records 4,000 A single company sends data to us from 4,000 nodes 500,000 # of records stored per a second 4

6. Copyright ©2015 Treasure Data. All Rights Reserved. What is Hivemall Scalable machine learning library built on the top of Apache Hive, licensed under the Apache License v2 Hadoop HDFS MapReduce (MRv1) Hive / PIG Hivemall Apache YARN Apache Tez DAG processing MR v2 Machine Learning Check http://github.com/myui/hivemall 6 Query Processing Parallel Data Processing Framework Resource Management Distributed File System

7. Copyright ©2015 Treasure Data. All Rights Reserved. R M MM M HDFS HDFS M M M R M M M R HDFS M MM M M HDFS R MapReduce and DAG engine MapReduce DAG engine Tez/Spark No intermediate DFS reads/writes! 7

8. Copyright ©2015 Treasure Data. All Rights Reserved. Very easy to use; Machine Learning on SQL The key characteristic of Hivemall 100+ lines of code Classification with Mahout CREATE TABLE lr_model AS SELECT feature, -‐-‐ reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -‐-‐ map-‐only task GROUP BY feature; -‐-‐ shuffled to reducers ü Machine Learning made easy for SQL developers (ML for the rest of us) ü APIs are very stable because of SQL abstraction This SQL query automatically runs in parallel on Hadoop 8

9. Copyright ©2015 Treasure Data. All Rights Reserved. List of functions in Hivemall v0.3 9 • Classification (both binary-‐ and multi-‐class) ü Perceptron ü Passive Aggressive (PA) ü Confidence Weighted (CW) ü Adaptive Regularization of Weight Vectors (AROW) ü Soft Confidence Weighted (SCW) ü AdaGrad+RDA • Regression ü Logistic Regression (SGD) ü PA Regression ü AROW Regression ü AdaGrad ü AdaDELTA • kNN and Recommendation ü Minhash and b-‐Bit Minhash (LSH variant) ü Similarity Search using K-‐NN ü Matrix Factorization • Feature engineering ü Feature hashing ü Feature scaling (normalization, z-‐score) ü TF-‐IDF vectorizer Treasure Data will support Hivemall v0.3.1 in the next week! bit.ly/hivemall-‐mf

13. Copyright ©2015 Treasure Data. All Rights Reserved. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How to use Hivemall -‐ Data preparation Define a Hive table for training/testing data 13

15. Copyright ©2015 Treasure Data. All Rights Reserved. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How to use Hivemall -‐ Feature Engineering Transforming a label value to a value between 0.0 and 1.0 15

17. Copyright ©2015 Treasure Data. All Rights Reserved. How to use Hivemall -‐ Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training by logistic regression map-‐only task to learn a prediction model Shuffle map-‐outputs to reduces by feature Reducers perform model averaging in parallel 17

18. Copyright ©2015 Treasure Data. All Rights Reserved. How to use Hivemall -‐ Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training of Confidence Weighted Classifier Vote to use negative or positive weights for avg +0.7, +0.3, +0.2, -‐0.1, +0.7 Training for the CW classifier 18

19. Copyright ©2015 Treasure Data. All Rights Reserved. create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; Ensemble learning for stable prediction performance Just stack prediction models by union all 19

21. Copyright ©2015 Treasure Data. All Rights Reserved. How to use Hivemall -‐ Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction is done by LEFT OUTER JOIN between test data and prediction model No need to load the entire model into memory 21

23. Copyright ©2015 Treasure Data. All Rights Reserved. Type/Purpose Matrix of Machine Learning 23 Online Learning Offline Learning Online Prediction • Algorithm Trade (HFT) • Twitter real-‐time analysis • Ad-‐tech (e.g., CTR/CVR prediction) • Real-‐time recommendation Offline Prediction no/fewneeds? • Daily/weeklybatch systems • Business Analytics/Reporting

24. Copyright ©2015 Treasure Data. All Rights Reserved. How to use Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model 24

25. Copyright ©2015 Treasure Data. All Rights Reserved. Export Prediction Model to a RDBMS 25 hive> desc news20b_cw_model1; feature int weight double Any RDBMS TD export Periodical export is very easy in Treasure Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855

26. Copyright ©2015 Treasure Data. All Rights Reserved. 26 hive> desc testing_exploded; feature string value float Real-‐time Prediction on MySQL #2 Preparing a Test data table SIGMOID(x) = 1.0 / (1.0 + exp(-‐x)) Prediction Model Label Feature Vector SELECT sigmoid(sum(t.value * m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature) #3 Online prediction on MySQL You can alternatively use SQL view defining for testing target Index lookups are very efficient in RDBMSs http://bit.ly/hivemall-‐rtp

27. Copyright ©2015 Treasure Data. All Rights Reserved. Cost of Amazon Machine Learning Amazon-‐ML is suspected to be based on Vowpal Wabbit (single process) 27 Data Analysis and Model Building Fees $0.42/Instance per Hour Batch Prediction $0.1/1000 requests Real-‐time Prediction $0.0001 per a request Pay-‐per-‐request is apparently not suitable for doing prediction for each web request (e.g. online CTR prediction)

Hivemall Talk at TD tech talk #3

Makoto Yui

Transcript