13.
News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch API
Search
API
Gateway
Mobile App
API
Tracker
DynamoDB
Index System
Feedback System
1 minute
5 minute
14.
Index System
• Crawler
• collect news articles & social signals
• Analyzer
• extract title, content, thumbnail...
• classify topics (sports, politics, technology...)
• Indexer
• upload article metadata into CloudSearch
15.
Feedback System
• API Tracker
• receive user's activity log from mobile app
• Spark Streaming
• generate various metrics for news ranking
• stored metrics into DynamoDB
19.
Multiple Consumers
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Empowers Engineers to Do Trial and Error
20.
News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch API
Search
API
Gateway
Mobile App
API
Tracker
DynamoDB
Kinesis
Stream
Kinesis
Stream
Kinesis
Stream
21.
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
23.
Index System
Crawler
KPL
KPL
KPL
KCL
KCL
KCL
KPL
KPL
KPL
Analyzer
KCL
KCL
KCL
Indexer
CloudSearch
Collect, Analyze and Index Articles
with Kinesis Libraries (KPL & KCL)
24.
Kinesis Libraries
• Kinesis Producer Library (KPL)
• put records into an stream
• asynchronous architecture (buffer records)
• Kinesis Consumer Library (KCL)
• consume and process data from an stream
• handle complex tasks associated with distributed
computing
25.
KPL/KCL Monitoring
• KPL/KCL publishes custom CloudWatch metrics
• Key Metrics for KPL
• User Record Received, User Record Pending
• All Errors
• Key Metrics for KCL
• RecordsProcessed
• MillisBehindLatest
• RecordProcessor.processRecords.Time
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html
27.
Feedback System
Generate Metrics by User Clusters for
Ranking Articles
Amazon
CloudSearch
API
Search
API
Gateway
Kinesis
Stream
Amazon S3 Hive / Spark
DynamoDB
User
Clusters
User
Feedback
API
Tracker
Amazon S3
Offline ETL / Machine Learning
Push
Notification
Article
Metadata
Metrics
by Cluster
28.
Why Metrics by Cluster?
Consider Each User's Interests
Ensure Diversity for Avoiding Filter Bubble
https://en.wikipedia.org/wiki/Filter_bubble
Amazon
CloudSearch
API
DynamoDB
Article raw score
San Fransisco Giants … 3.5
New York Yankees … 6.2
FIFA World Cup … 20.4
U.S.Open Championships … 8.4
weight
1
0.6
0.2
0.2
score
3.5
3
4.08
1.68
+ =
User
GET /news/sports
Metrics by
User Cluster
Article
Inventry
userId: 1000
gender: Male
age: 36
location: San Fransisco, US
interests: Baseball
29.
Input Data by Fluentd
• Forwarder (running on each instances)
• archive events to S3
• forward events to aggregators
• Aggregator (HA Configuration※)
• put events into Kinesis Stream
• alert and report (not mentioned here)
※ http://docs.fluentd.org/articles/high-availability
33.
Spark Streaming
Kinesis Stream
Shard 1
Shard 2
Shard3
Dstream 1
Dstream 2
Dstream 3
R
D
D
RDD
R
D
D
R
D
D
Female
Male
+
Minutely RDD
Teen
Female
Male
Teen
Female
Male
Teen
Minutely Metrics by User Cluster
DynamoDB
.
.
.
Pre Computed RDD
Split Streams into Minutely RDD
Join Minutely RDD on PreComputed RDD
34.
Monitor Spark Streaming
Spark UI is Useful for Monitoring
37.
Summary
• Fast & stable stream processing is crucial for SmartNews
• lifetime of news is very short
• process events as fast as possible
• Kinesis Stream plays an important role
• one-click provision & scale
• empowers engineers to do trial & error
38.
Discuss More?
Join Our Free Lunch in Tokyo Office!!
39.
We’re hiring!!!
ML/NLP engineer
Site reliability engineer
Web application engineer
iOS/Android engineer
Ad engineer
http://about.smartnews.com/en/careers/
40.
See Also
• SmartNews の Webmining を支えるプラットフォーム
• Stream 処理と Offline 処理の統合
• Building a Sustainable Data Platform on AWS
• AWS meetup「Apache Spark on EMR」
42.
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
43.
Realtime Monitoring
API
Gateway
Stream
Continuous
View
Continuous
View
Continuous
View
Discard raw record soon after
consumed by Continuous View
Incrementally
updated in realtime
PipelineDB Chartio
AWS
Lambda
Slack
Access Continuous View
by PostgreSQL Client
Record
44.
Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day, hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
45.
Dashboard in Chartio
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment