• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
 

Treasure Data on The YARN - Hadoop Conference Japan 2014

on

  • 270 views

 

Statistics

Views

Total Views
270
Views on SlideShare
265
Embed Views
5

Actions

Likes
1
Downloads
8
Comments
0

1 Embed 5

https://twitter.com 5

Accessibility

Categories

Upload Details

Uploaded via SlideShare as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Treasure Data on The YARN - Hadoop Conference Japan 2014 Treasure Data on The YARN - Hadoop Conference Japan 2014 Presentation Transcript

    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Treasure Data on The YARN Ryu Kobayashi ! Hadoop Conference Japan 2014 8 July 2014
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Who am I? • Ryu Kobayashi • @ryu_kobayashi • https://github.com/ryukobayashi • Treasure Data, Inc. • Software Engineer • Background • Hadoop, Cassandra, Machine Learning, ... • I developed Huahin(Hadoop) Framework. 
 http://huahinframework.org/
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What is Treasure Data?
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Query Language
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Hadoop&Cluster PlazmaDB Our System HDFS is not used
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Hadoop&Cluster PlazmaDB Our System HDFS is not used • Customize Hadoop • Customize Hive • Customize Pig • Customize Impala • Customize Presto
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster user1,&user4,& user5,&… user2,&user9,& user34,&… user10,&user40,& user102,&… user50,&user88,& user1023,&…
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Scheduler and Queue QueueScheduler Hadoop&Cluster Hadoop&Cluster
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster and Hadoop Cluster(YARN) YARN&Cluster
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. MRv1 and YARN Queue Queue Hadoop&Cluster Hadoop&Cluster
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service • About 4700 users • About 6 trillion records • About 12 million Jobs • About 40,000 Job by day
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What is YARN?
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. YARN(Yet Another Resource Negotiator) Architecture
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • MRv1 • JobTracker • TaskTracker
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • YARN • ResourceManager • NodeManager • ApplicationMaster • Job History Server
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • MRv1 • JobTracker • TaskTracker • YARN • ResourceManager • NodeManager • ApplicationMaster • Job History Server * ******(We*can*not*see*the*log*history*If*it*do*not*install)
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Note!!!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Use the Hadoop 2.4.0 and later!!!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • The versions which must not be used • Apache Hadoop 2.2.0 • Apache Hadoop 2.3.0 • HDP 2.0(2.2.0 based)
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Currently • Apache Hadoop 2.4.1 • CDH 5.0.2(2.3.0 based and patch) • HDP 2.1(2.4.0 based)
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Why should not use? • Capacity Scheduler • There is a bug • Fair Scheduler • There is a bug
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Any bugs? • Each Scheduler will cause a deadlock
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Distribution • CDH 5.0.2 • Red Hat/CentOS/Oracle 5 • Red Hat/CentOS/Oracle 6 • Ubuntu/Debian • HDP 2.1 • Red Hat/CentOS/SLES (64-bit) • (There is already Ubuntu12 to the repository) • Windows Server 2008 & 2012
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Configuration file has been changed several(YARN from MRv1) ! reference: http://goo.gl/vBIYQP
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Deprecated Properties
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Other notes for configuration file • hadoop-conf-pseudo does not work • some mistakes ex : yarn.nodemanager.aux-services mapreduce.shuffle -> mapreduce_shuffle • 2.2.0 and 2.4.0 • There are some differences
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What should we do? • Copy of CDH VM and HDP VM configuration files • Use the Ambari or Cloudera Manager • I work hard on their own!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Slot has been changed(YARN from MRv1) • MRv1 • map slot, reduce slot • YARN(MRv2) • resource(container)
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. mapred-site.xml • mapred.tasktracker.map.tasks.maximum • mapred.tasktracker.reduce.tasks.maximum scheduler.xml • maxMaps, minMaps • maxReduces, minReduces MRv1
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn-site.xml • yarn.nodemanager.resource.memory-mb • (yarn.nodenamager.vmem-pmem-ratio) • (yarn.scheduler.minimum-allocation-mb) mapred-site.xml • yarn.app.mapreduce.am.resource.mb • mapreduce.map.memory.mb • mapreduce.reduce.memory.mb fair-scheduler.xml • maxResources, minResources YARN(MRv2)
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memory-mb => Memory that NodeManager uses ! yarn.app.mapreduce.am.resource.mb => Memory that ApplicationMaster uses ! mapreduce.map.memory.mb => Memory that Map uses ! mapreduce.reduce.memory.mb => Memory that Reduce uses YANR Resource Management
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memory-mb = 4096 yarn.app.mapreduce.am.resource.mb = 1024 mapreduce.map.memory.mb = 1024 mapreduce.reduce.memory.mb = 2048 ! MRv2 Application ApplicationMaster => 1 Mapper => 3 Reducer => 1 YANR Resource Example
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. In addition to this(ex: Fair Scheduler): minResources maxResources maxRunningApps schedulingPolicy YANR Resource Example
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. In addition to this(ex: Fair Scheduler): pool -> queue user. maxRunningJobs -> user. maxRunningApps userMaxJobsDefault -> userMaxAppsDefault etc… Changes Fair scheduler
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memoryDmb
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. YANR Scheduler Management
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. e.g. Use hdp-configuration-utils.py script http://goo.gl/L2hxyq ! Use Ambari http://ambari.apache.org/ (not supported Ubuntu12. Ubuntu 12 support is coming soon) YANR Resource Management
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. DefaultContainerExecuter • Container launch process based • Same as the conventional(MRv1) ! LinuxContainerExecuter • Only Linux • Some restrictions • cgroup, etc… YANR Container Executer
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. MRv1 • The need to set the initial ! YARN • The need to set the initial • There is a change from MRv1 (ex: /tmp/hadoop-yarn/) YANR Directory Structure
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What should we do? • Reference the CDH VM and HDP VM HDFS directory • Use the Ambari or Cloudera Manager • I work hard on their own!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Enjoy the YARN!!!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We are hiring!!!
    • Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Thanks!!!