ニュースパスのクローラーアーキテクチャとマイクロサービス

115 views
0 views

Published on

アプリ「ニュースパス」のクローラーアーキテクチャの解説です。

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
115
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ニュースパスのクローラーアーキテクチャとマイクロサービス

  1. 1. 2016/08/21 #crawler_ops @mosa_siru
  2. 2. @mosa_siru ( ) • • 2
  3. 3. @mosa_siru as engineer • DeNA • Gunosy • CTO
  4. 4. 1. 2. 3. 4. 5.
  5. 5. • 2016/06 KDDI 
 • •
  6. 6. • • 匠 • 
 • • • 匠
  7. 7. • • • RSS • t2small 2 ( ) • • • •
  8. 8. • XML 
 • 1 • • DB s3
  9. 9. • RSS2.0, Atom, RDF GunosyFeed Ver. 2 •
  10. 10.
  11. 11. • JobQueue (Python Celery) • • • 
 • Celery Flower
  12. 12. Celery Flower
  13. 13. Scheduler 

  14. 14. • 30 • Fetcher enqueue • • HTTP Scheduler
  15. 15. Fetcher 

  16. 16. • XML • XML hash hash • XML s3 Up Parser enqueue • s3 Fetcher
  17. 17. Parser(Updater) 

  18. 18. • XML parse Python feedparser • RSS2.0, Atom, RDF parse • XML • ( ) • • etc… Parser
  19. 19. • parse DB s3 up • DB insert/update/delete • update 
 update • mysql insert on duplicate key update update (1 1 update ) Updater
  20. 20. • url or guid • hash DB hash • url feed, title, ( guid …)
  21. 21. Content Generator 

  22. 22. • • HTML (js ) • URL URL (./hoge.html ) • css path • • (img ) s3 URL • hash s3 Content Generator
  23. 23. Enclosure Fetcher 

  24. 24. • s3 
 DB URL • • hash s3 Enclosure Fetcher
  25. 25. • HTTP Request Proxy (Squid) • Response Header • IP (Elastic IP) HTTP Proxy
  26. 26. Image Cropper 

  27. 27. • Microsoft FaceDetection API
 • crop Image Cropper ///
  28. 28. Akamai Image Converter • Akamai URL • Smalllight • • • •
  29. 29. Crop https://…/mychild.png
 https://…/mychild.png
 ?crop=200:200;220,210 • Crop
  30. 30. Quality https://…/mychild.png
 ?crop=200:200;220,210 • https://…/mychild.png
 ?crop=200:200;220,210
 &output-quality=10
 

  31. 31. Title Break Calculator 

  32. 32. Title Break Calculator • • API ( ) 
 

  33. 33. Indexer 

  34. 34. Indexer • index Indexer API • • Indexer API ElasticSearch
  35. 35. Classifier 

  36. 36. Classifier • Classifier API • • Classifier API
  37. 37.
  38. 38. 
 • 
 API • • 
 http://www.slideshare.net/ mosa_siru/ss-64839846
  39. 39. 
 • • Article DB write/read • API • Article API • Article DB read Article API
  40. 40. • 
 • • API • •
  41. 41. Cache Invalidation • Article API memcached 
 invalidation(memd ) • URL, URL Akamai query string 
 https://…/mychild.png?crop=200:200;220,210&output-quality=10
 &t=1468557607
  42. 42.
  43. 43. • • • URL • • •
  44. 44. XML • XML • XML 
 • Crawler API
  45. 45. XML
  46. 46. DEMO
  47. 47. • XML XML 
 • Slack •
  48. 48. • XML • 

  49. 49. • • • • • 匠
  50. 50. Gunosy @mosa_siru

×