Your SlideShare is downloading. ×
Hivemall Talk at TD tech talk #3
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hivemall Talk at TD tech talk #3

84

Published on

Hivemall talk at Treasure Data Tech Talk #3 at Retty …

Hivemall talk at Treasure Data Tech Talk #3 at Retty
http://eventdots.jp/event/458211

Published in: Engineering
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
84
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Treasure  Data  Inc. Research  Engineer Makoto  YUI  @myui 2015/05/14 TD  tech  talk  #3  @Retty 1 http://myui.github.io/ 20  min.  Introduction  to  Hivemall
  • 2. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Ø2015/04  Joined  Treasure  Data,  Inc. Ø1st Research  Engineer  in  Treasure  Data ØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)   Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute   of  Advanced  Industrial  Science  and  Technology,  Japan.   ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel   Databases   Ø2009/03  Ph.D.  in  Computer  Science  from  NAIST Ø My  research  topic  was  about  building  XML  native  database  and   Parallel  Database  systems ØSuper  programmer  award  from  the  MITOU  Foundation   (a  Government  founded  program  for  finding  young  and   talented  programmers) Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida 2 Who  am    I  ?
  • 3. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 3 0 2000 4000 6000 8000 10000 12000 Aug-­‐12Sep-­‐12Oct-­‐12Nov-­‐12Dec-­‐12 Jan-­‐13Feb-­‐13M ar-­‐13Apr-­‐13M ay-­‐13Jun-­‐13 Jul-­‐13Aug-­‐13Sep-­‐13Oct-­‐13Nov-­‐13Dec-­‐13 Jan-­‐14Feb-­‐14M ar-­‐14Apr-­‐14M ay-­‐14Jun-­‐14 Jul-­‐14Aug-­‐14Sep-­‐14Oct-­‐14 Billion  records  (Unit) Service  in Series  A  Funding Reached  100  customers Selected  as  “Cool  Vendor   in  Big  Data”  by  Gartner 10  trillion records   5  trillion  records Figures on Oct. 2014 4 hundred thousand (40万) records Imported for each SECOND!! 10+ trillion (10兆) records Total number of imported records 12 billion (120億) records # records sent by an Ad-tech company Figures  of  Imported  Data  in  Treasure  Data
  • 4. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. The  latest  numbers  in  Treasure  Data 100+ Customers In Japan 15 trillion # of stored records 4,000 A single company sends data to us from 4,000 nodes 500,000 # of records stored per a second 4
  • 5. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 5
  • 6. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. What  is  Hivemall Scalable  machine  learning  library  built  on  the  top  of   Apache  Hive,  licensed  under  the  Apache  License  v2 Hadoop  HDFS MapReduce (MRv1) Hive /  PIG Hivemall Apache  YARN Apache  Tez DAG processing MR v2 Machine  Learning Check  http://github.com/myui/hivemall 6 Query  Processing Parallel  Data   Processing  Framework Resource  Management Distributed  File  System
  • 7. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. R M MM M HDFS HDFS M M M R M M M R HDFS M MM M M HDFS R MapReduce  and  DAG  engine MapReduce   DAG  engine Tez/Spark No  intermediate  DFS  reads/writes! 7
  • 8. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Very  easy  to  use;  Machine  Learning  on  SQL The  key  characteristic  of  Hivemall 100+  lines of  code Classification  with  Mahout CREATE  TABLE  lr_model AS SELECT feature,  -­‐-­‐ reducers  perform  model  averaging  in   parallel avg(weight)  as  weight FROM  ( SELECT  logress(features,label,..)  as  (feature,weight) FROM  train )  t  -­‐-­‐ map-­‐only  task GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers ü Machine  Learning  made  easy  for  SQL   developers  (ML  for  the  rest  of  us) ü APIs  are  very  stable  because  of  SQL   abstraction This  SQL  query  automatically  runs  in  parallel on  Hadoop   8
  • 9. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. List  of  functions  in  Hivemall  v0.3 9 • Classification  (both   binary-­‐ and  multi-­‐class) ü Perceptron ü Passive  Aggressive  (PA) ü Confidence  Weighted  (CW) ü Adaptive  Regularization  of   Weight  Vectors  (AROW) ü Soft  Confidence  Weighted  (SCW) ü AdaGrad+RDA • Regression ü Logistic  Regression  (SGD) ü PA  Regression ü AROW  Regression ü AdaGrad ü AdaDELTA • kNN and  Recommendation ü Minhash and  b-­‐Bit  Minhash (LSH  variant) ü Similarity  Search  using  K-­‐NN ü Matrix  Factorization • Feature  engineering ü Feature  hashing ü Feature  scaling (normalization,  z-­‐score)   ü TF-­‐IDF  vectorizer Treasure  Data  will  support  Hivemall v0.3.1  in  the  next  week!   bit.ly/hivemall-­‐mf
  • 10. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. • Contribution  from  Daniel  Dai  (Pig  PMC)  from   Hortonworks • To  be  supported  from  Pig  0.15 10 Hivemall  on  Apache  Pig
  • 11. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 11
  • 12. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Data  preparation 12
  • 13. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How  to  use  Hivemall  -­‐ Data  preparation Define  a  Hive  table  for  training/testing  data 13
  • 14. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Feature  Engineering 14
  • 15. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How  to  use  Hivemall  -­‐ Feature  Engineering Transforming  a  label  value   to  a  value  between  0.0  and  1.0 15
  • 16. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Training 16
  • 17. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training  by  logistic  regression map-­‐only  task  to  learn  a  prediction  model Shuffle  map-­‐outputs  to  reduces  by  feature Reducers  perform  model  averaging   in  parallel 17
  • 18. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training  of  Confidence  Weighted  Classifier Vote  to  use  negative  or  positive   weights  for  avg +0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7 Training  for  the  CW  classifier 18
  • 19. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; Ensemble  learning  for  stable  prediction  performance Just  stack  prediction  models   by  union  all 19
  • 20. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Prediction 20
  • 21. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction  is  done  by  LEFT  OUTER  JOIN between  test  data  and  prediction  model No  need  to  load  the  entire  model  into  memory 21
  • 22. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 22
  • 23. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Type/Purpose  Matrix  of  Machine  Learning 23 Online Learning Offline Learning Online Prediction • Algorithm Trade  (HFT) • Twitter  real-­‐time   analysis • Ad-­‐tech (e.g.,  CTR/CVR   prediction) • Real-­‐time   recommendation Offline Prediction no/fewneeds? • Daily/weeklybatch   systems • Business Analytics/Reporting
  • 24. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature  Vector Feature  Vector Label Export   prediction  model 24
  • 25. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Export  Prediction  Model  to  a  RDBMS 25 hive> desc news20b_cw_model1; feature int weight double Any  RDBMS TD  export Periodical  export  is  very easy in  Treasure  Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855
  • 26. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 26 hive>  desc  testing_exploded;                                                     feature                                  string   value                                      float Real-­‐time  Prediction  on  MySQL #2  Preparing  a  Test  data  table SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x)) Prediction Model Label Feature  Vector SELECT     sigmoid(sum(t.value   *  m.weight))  as  prob FROM testing_exploded   t  LEFT  OUTER  JOIN   prediction_model   m  ON  (t.feature  =  m.feature) #3  Online  prediction  on  MySQL   You  can  alternatively  use  SQL  view defining  for  testing  target Index  lookups  are  very efficient  in  RDBMSs http://bit.ly/hivemall-­‐rtp
  • 27. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Cost  of  Amazon  Machine  Learning Amazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit (single  process)   27 Data  Analysis  and  Model  Building  Fees $0.42/Instance  per  Hour Batch  Prediction $0.1/1000 requests Real-­‐time  Prediction $0.0001  per  a  request Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for   each  web  request  (e.g.  online  CTR  prediction)
  • 28. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 28 Real-­‐time  Prediction  on  Treasure  Data Run  batch  training job  periodically Real-­‐time  prediction on  a  RDBMS Periodical export
  • 29. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 29 Beyond  Query-­‐as-­‐a-­‐Service! We  ❤️  Open-­‐source!  We  Invented  .. We  are  Hiring!

×