dots. Conference Spring 2016 大規模Webサービスを支える技術 (mercari)

0
-1

Published on

2016 2/28(日) 10:30 〜 14:15 実施分のmercariの資料です。「メルカリDevOps物語 - 俺たちの戦いはこれからだ -」で発表しました。

Published in: Engineering
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
0
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

dots. Conference Spring 2016 大規模Webサービスを支える技術 (mercari)

  1. 1. Mercari’s Never Ending Improvements @siroken3 / KENICHI Sasaki SRE Team @ Mercari, Inc. 2016/02
  2. 2. Mercari - Your Friendly Mobile Marketplace https://www.mercari.com/
  3. 3. Self Introduction • Joined Mercari in July,2014 • SRE (Site Reliability Engineer) • Role • Development Productivity
  4. 4. What is SRE? • Site Reliability Engineer • The Role/position introduced in Google
 “Software Engineers responsible for ensuring that all of Google’s services are super reliable and super fast, all of the time.” • Mercari SRE team members responsible for • Availability • Performance • Construction and operation log analytics platform • Server provisioning, deployment • Security • Development of the development environment
  5. 5. JP Growth https://pixabay.com/photo-918965/
  6. 6. Download Numbers 0M 8M 15M 23M 30M July 2014 Feb 2016 24M 4M +500%
  7. 7. Req/Sec (HTTPS: Peak) 0 5000 10000 15000 20000 July, 2014 Feb. 2016 20K 3K +560%
  8. 8. Servers (APP) July 2014 Feb. 2016 +50%
  9. 9. Servers (DB) July 2014 Feb. 2016 +9
  10. 10. SRE Members July. 2014 Feb. 2016
  11. 11. Infrastructure Overview • JP • SAKURA Internet Ishikari DC
 dedicated server + cloud • US • AWS Oregon • Shared • Akamai • Amazon Route53, S3, CloudFront • Google BigQuery
  12. 12. app Infrastructure 2014 mail nat internet DB Redis batch Q4M Worker search global private
  13. 13. app Infrastructure 2016 lb nat internet lb_pascal DB memcached batch Q4M Worker lb_push push lb_search search deploybase monitor dns logview cep global private logbatchlog lb_general
  14. 14. Softwares (2016) • nginx • PHP 5.6 • Apache + mod_php • Go • Node • MySQL • Q4M • memcached • Solr • Gaurun • fluentd • Norikra • Kibana • Zabbix • kurado • etc..
  15. 15. Improve? or Crisis! • Continuous Increase in Access • Continuous Increase in Data Volume • Growth of Specifications • Unstable Deployment
  16. 16. Continuous Increase in Access https://pixabay.com/en/traffic-rush-hour-rush-hour-urban-843309/
  17. 17. Continuous Increase in Access Problem • Lack of CPU Resources • Slow down response time • Lack of network bandwidth • Network congestion
  18. 18. Improvement: Introduce dedicated server • BEFORE • SAKURA Cloud • (Ex) AFTER • CPU: Xeon 6Core x 2 • Mem: 32G • DISK: 240GB SSD
  19. 19. Improvement: Introduce lb based on nginx • BEFORE • All httpd server was faced on the internet • DNS Round Robin • AFTER • nginx! • Reverse Proxy, TLS, SPDY Terminator internet lb lb lb lb DNS RR ©2011 Amazon Web Services LLC or its affiliates. All rights reserve User Users Client MultimedMobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Human Intelligence Tasks (HIT) Assignment/ Task RWorkersAmazon Mechanical Turk Non-Service Specific
  20. 20. Improvement: Continuous application tuning • MySQL index tuning • (Ex.) 2-dimensional large array -> convert 2nd tier to text data and parse Of course, There is no silver bullet.
  21. 21. Improvement: Continuous application tuning require_once( master_data.php ); was slow!! Large http://www.slideshare.net/kazeburo/big-master-data-php-blt-1
  22. 22. Improvement: Continuous application tuning http://www.slideshare.net/kazeburo/big-master-data-php-blt-1
  23. 23. Continuous Increase in Data Volume https://flic.kr/p/miwdvy
  24. 24. Problem: Increasing DB historical table records • Increasing DB historical table records • Shortage of DISK capacity • Slow down item search throughput • Increasing access log • Customer service tune around time be too slow
  25. 25. • DB table are partitioned into multiple servers • Slave servers are only in main cluster • Using DNS RR Improvement: Server partitioning (MySQL) Master Slave Slave BackupSlave Backup Master Backup Master Backup Master Anon DB Main todolists l2-db cs-tool anon-db
  26. 26. Improvement: Server partitioning (Solr) • solr • Master - Slave • latest & all cluster • nginx • load balancer • Lua controls cluster access lb_ search app Solr Master double write 更新は両方に Solr Slave Solr Slave Worker Solr Master Solr Slave Solr Slave latest cluster 直近N日 all index cluster 全商品 latestを先に検索し 件数が足りなければall
  27. 27. Improvement app Worker Batch access_log application_log app_error_log error_log php_log... log AWS S Check to make sure you recent set of AWS Simple This version was last upda (v1.4) Find the most recen aws.amazon.com/architect Usage Guidelines DEC 01 BigQuery nat logview kibana: Log Viewer cep AWS Check to make sure y recent set of AWS Sim This version was last u (v1.4) Find the most re aws.amazon.com/arch Always use Icon labe always include a label b the group in Arial. The Usage Guidelines DEC 01 Mackerel A Check to recent se This vers (v1.4) Fin aws.ama Always u always in the group Usage Guidel DEC 01 Slack Norikra: Stream Processing
  28. 28. Growth of Specifications https://flic.kr/p/7RrWCg
  29. 29. Problem: In deployment… • Large-scale deployment of multiple features • Unplanned, rushed deployment
  30. 30. Improvements: • Deploy many times per day, instead of once a week • Google Calendar & chat both based deployment
  31. 31. Improvements:
  32. 32. Improvement: Scheduled,automated deploy http://tech.mercari.com/entry/2015/10/15/183000
  33. 33. Unstable Deployment http://popsych.org/wp-content/uploads/2015/05/jenga-tower.jpg
  34. 34. Problem: Each deploy, get 50x responses • Cause • Inconsistence of PHP Opcache • Result • Negative customer feedback
  35. 35. Improvement: ngx_dynamic_upstream + rsync deploybase App YES!!! App App App App App App Worker Worker Batch lblblb • ngx_dynamic_upstream • Dynamic attach and detach app. server to lb • Using —rsync-path • detach from lb • rsync • attach lb
  36. 36. Conclusion http://s0.geograph.org.uk/geophotos/02/95/15/2951585_5b854214.jpg
  37. 37. We have improved continuously • Rome was not built in a day • We will continue doing improvements
  38. 38. Preface • Big “Master” Data (http://www.slideshare.net/ kazeburo/big-master-data-php-blt-1) • ngx_dynamic_upstream (https://github.com/ cubicdaiya/ngx_dynamic_upstream) • 大人のスタートアップは大人のリリースができる。 そう、ChatOpsならね。(http://tech.mercari.com/ entry/2015/10/15/183000)

×