Stream Processing in SmartNews #jawsdays

1. Stream Processing in SmartNews Takumi Sakamoto 2016.03.12

2. Takumi Sakamoto @takus 😍 = ⚽ ✈ 📷

3. http://bit.ly/1MCOyBX JAWSDAYS 2015

4. AWS Case Study http://aws.amazon.com/solutions/case-studies/smartnews/

5. What is SmartNews? • News Discovery App for Mobile • Launched in 2012 • 15M+ Downloads in World Wide https://www.smartnews.com/en/

6. How We Deliver News? Internet Algorithms Trending News

7. Why Stream Processing?

8. Today’s News is Wrapping Tomorrow’s Fish and Chips

9. ↑ Yesterday's News http://www.personalchefapproach.com/tomorrows-ﬁsh-n-chips-wrapper/

10. News Articles Lifetime https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/

11. Speed is Matter for Us

12. System Overview

13. News Delivery Pipeline CrawlerInternet Analyzer Indexer CloudSearch API Search API Gateway Mobile App API Tracker DynamoDB Index System Feedback System 1 minute 5 minute

14. Index System • Crawler • collect news articles & social signals • Analyzer • extract title, content, thumbnail... • classify topics (sports, politics, technology...) • Indexer • upload article metadata into CloudSearch

15. Feedback System • API Tracker • receive user's activity log from mobile app • Spark Streaming • generate various metrics for news ranking • stored metrics into DynamoDB

16. How to Glue Each Service?

17. Ref: Amazon Kinesis: Real-time Streaming Big data Processing Applications

18. Why Kinesis Streams? • Fully managed service • Multiple consumer applications • Reasonable pricing

19. Multiple Consumers Kinesis Stream Spark on EMR AWS Lambda Data Scientist I wanna consume streaming data by Spark Application Engineer I wanna add a streaming monitor by Lambda Empowers Engineers to Do Trial and Error

20. News Delivery Pipeline CrawlerInternet Analyzer Indexer CloudSearch API Search API Gateway Mobile App API Tracker DynamoDB Kinesis Stream Kinesis Stream Kinesis Stream

21. Data & Its Numbers • User activities • ~100 GBs per day (compressed) • 60+ record types • User demographics or conﬁgurations etc... • 15M+ records • Articles metadata • 100K+ records per day

22. How We Produce/Consume Kinesis Streams?

23. Index System Crawler KPL KPL KPL KCL KCL KCL KPL KPL KPL Analyzer KCL KCL KCL Indexer CloudSearch Collect, Analyze and Index Articles with Kinesis Libraries (KPL & KCL)

24. Kinesis Libraries • Kinesis Producer Library (KPL) • put records into an stream • asynchronous architecture (buffer records) • Kinesis Consumer Library (KCL) • consume and process data from an stream • handle complex tasks associated with distributed computing

25. KPL/KCL Monitoring • KPL/KCL publishes custom CloudWatch metrics • Key Metrics for KPL • User Record Received, User Record Pending • All Errors • Key Metrics for KCL • RecordsProcessed • MillisBehindLatest • RecordProcessor.processRecords.Time https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html

26. Monitoring with Datadog

27. Feedback System Generate Metrics by User Clusters for Ranking Articles Amazon CloudSearch API Search API Gateway Kinesis Stream Amazon S3 Hive / Spark DynamoDB User Clusters User Feedback API Tracker Amazon S3 Offline ETL / Machine Learning Push Notiﬁcation Article Metadata Metrics by Cluster

28. Why Metrics by Cluster? Consider Each User's Interests Ensure Diversity for Avoiding Filter Bubble https://en.wikipedia.org/wiki/Filter_bubble Amazon CloudSearch API DynamoDB Article raw score San Fransisco Giants … 3.5 New York Yankees … 6.2 FIFA World Cup … 20.4 U.S.Open Championships … 8.4 weight 1 0.6 0.2 0.2 score 3.5 3 4.08 1.68 + = User GET /news/sports Metrics by User Cluster Article Inventry userId: 1000 gender: Male age: 36 location: San Fransisco, US interests: Baseball

29. Input Data by Fluentd • Forwarder (running on each instances) • archive events to S3 • forward events to aggregators • Aggregator (HA Conﬁguration※) • put events into Kinesis Stream • alert and report (not mentioned here) ※ http://docs.ﬂuentd.org/articles/high-availability

30. Example Conﬁgurations <source> @type tail tag smartnews.user_activity ... </source> <match smartnews.user_activity> @type copy <store> @type s3 ... </store> <store> @type forward ... </store> </match> Forwarder <source> @type forward ... </source> <match smartnews.user_activity> @type copy <store> @type kinesis ... </store> <store> ... </store> </match> Aggregator http://docs.ﬂuentd.org/articles/kinesis-stream

31. Offline ETL Flow Transform Text Files into Columnar Files Various Machine Learning Tasks API RDS { “timestamp”: 1453161447, “userId”: 1234, “platform”: “ios”, “edition”: “ja_JP”, “action”: “viewArticle”, “data”: { “articleId: 1234, “duration”: 30.2 } } userId, age, gender, location, 1234, 28, M, Tokyo, … 1235, 32, F, Nagano, … 1240, 18, F, Keyoto, … Amazon S3 Hive on EMR Amazon S3 Airflow Manage Workflow Activities Users Spark on EMR

32. Airﬂow: Workﬂow Engine Execute Task A -> Task B -> Task C, D 5 * * * * app hive -f query_1.hql 15 * * * * app hive -f query_2.hql 30 * * * * app hive -f query_3.hql

33. Spark Streaming Kinesis Stream Shard 1 Shard 2 Shard3 Dstream 1 Dstream 2 Dstream 3 R D D RDD R D D R D D Female Male + Minutely RDD Teen Female Male Teen Female Male Teen Minutely Metrics by User Cluster DynamoDB . . . Pre Computed RDD Split Streams into Minutely RDD Join Minutely RDD on PreComputed RDD

34. Monitor Spark Streaming Spark UI is Useful for Monitoring

35. Integrate with CloudWatch class CloudWatchRelay(conf: SparkConf) extends StreamingListener { override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { putMetricToCloudWatch(s"BatchStarted", 1.0) } override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { putMetricToCloudWatch(s"BatchCompleted", 1.0) putMetricToCloudWatch(s"BatchRecordsProcessed", batchCompleted.batchInfo.numRecords toDouble) batchCompleted.batchInfo.processingDelay.foreach { delay => putMetricToCloudWatch(s"ProcessingDelay", delay) } batchCompleted.batchInfo.schedulingDelay.foreach { delay => putMetricToCloudWatch(s"SchedulingDelay", delay) } batchCompleted.batchInfo.totalDelay.foreach { delay => putMetricToCloudWatch(s"TotalDelay", delay) } } } Set Alert to SchedulingDelay

36. Summary

37. Summary • Fast & stable stream processing is crucial for SmartNews • lifetime of news is very short • process events as fast as possible • Kinesis Stream plays an important role • one-click provision & scale • empowers engineers to do trial & error

38. Discuss More? Join Our Free Lunch in Tokyo Office!!

39. We’re hiring!!! ML/NLP engineer Site reliability engineer Web application engineer iOS/Android engineer Ad engineer http://about.smartnews.com/en/careers/

40. See Also • SmartNews の Webmining を支えるプラットフォーム • Stream 処理と Offline 処理の統合 • Building a Sustainable Data Platform on AWS • AWS meetup「Apache Spark on EMR」

41. PipelineDB

42. PipelineDB • OSS & enterprise streaming SQL database • PostgreSQL compatible • connect to Chartio 😍 • join stream to normal PostgreSQL table • Support probabilistic data structures • e.g. HyperLogLog https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/

43. Realtime Monitoring API Gateway Stream Continuous View Continuous View Continuous View Discard raw record soon after consumed by Continuous View Incrementally updated in realtime PipelineDB Chartio AWS Lambda Slack Access Continuous View by PostgreSQL Client Record

44. Continuous View -- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day, hostname; -- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream; -- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;

45. Dashboard in Chartio 1. Building query (Drag&Drop / SQL) 2. Add step (ﬁlter、sort、modify) 3. Select visualize way (table、graph)

Stream Processing in SmartNews #jawsdays

SmartNews, Inc.

Stream Processing in SmartNews #jawsdays