Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries
Upcoming SlideShare
Loading in...5
×
 

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries

on

  • 1,709 views

Slides on the OSCON talk about the data platform used at Netflix for event collection, aggregation, and analysis. The platform helps Netflix process and analyze billions of events every day. Attendees ...

Slides on the OSCON talk about the data platform used at Netflix for event collection, aggregation, and analysis. The platform helps Netflix process and analyze billions of events every day. Attendees will learn how to assemble their own large-scale data pipeline/analytics platform using open source software from NetflixOSS and others, such as Kafka, ElasticSearch, Druid from Metamarkets, and Hive.

Statistics

Views

Total Views
1,709
Views on SlideShare
868
Embed Views
841

Actions

Likes
4
Downloads
19
Comments
0

5 Embeds 841

http://www.elasticsearch.org 794
https://twitter.com 29
http://www.slideee.com 10
https://www.linkedin.com 7
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Something happened. Our traffic turned into a hockey stick, and the number of applications exploded. So, log traffic also exploded. Simple log scraping wouldn’t cut it any more.
  • For one thing: interactive exploration. Sometimes we want to get data in real time so we can act quickly. Some data is only useful in a small time window after all. Sometimes we want to perform lots of experimental queries just to find the right insights. If we wait too long for a query back, we won’t be able to iterate fast enough. Either way, we need to get query results back in seconds.
  • Here is one example: we process more than 150 thousand events per second about user activities. What if we’d like to know the geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds.... <br /> <br /> But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?
  • Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesn’t help us catch unexpected errors. When we build an application, we instrument our code diligently, yet it’s very likely we miss some critical instrumentation points. There’s one thing that we always catch, though: logged errors and unhandled exceptions. It’s about The alert provides a precise entrypoint and the right context for people to drill down the right problems
  • Note this is different from alerting based on monitoring metrics. Monitoring metrics are great and versatile. But it doesn’t help us catch unexpected errors. When we build an application, we instrument our code diligently, yet it’s very likely we miss some critical instrumentation points. There’s one thing that we always catch, though: logged errors and unhandled exceptions. It’s about The alert provides a precise entrypoint and the right context for people to drill down the right problems

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries Presentation Transcript

  • Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software
  • Data Is the most important asset at Netflix
  • If all the data is easily available to all teams, it can be leveraged in new and exciting ways View slide
  • ~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day 3.2M messages per second at peak time 3GB per second at peak time Dashboard View slide
  • Type of Events • User Interface Events • Search Event (‘Matrix’ using PS3 …) • Star Rating Event (HoC : 5 stars, Xbox, US, …) • Infrastructural Events • RPC Call (API -> Billing Service, ‘/bill/..’, 200, …) • Log Errors (NPE, “Movie is null”, …, …) • Other Events …
  • Making Sense of Billions of Events
  • http://netflix.github.io +
  • A Humble Beginning
  • Evolution …Scale!
  • Application Application Application Application Application Application Application Application ApplicationApplication
  • We Want to Process App Data in Hadoop
  • Our Hadoop Ecosystem
  • @NetflixOSS Big Data Tools
  • Hadoop as a Service
  • Pig Scripting on Steroids
  • Pig Married to Clojure “Map-Reduce for Clojure”
  • S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
  • Efficient ETL with Cassandra Cassandra
  • Offline Analysis
  • Evolution … Speed!
  • We Want to Aggregate, Index, and Query Data in Real Time
  • Interactive Exploration
  • Let’s walk through some use cases
  • client activity event * /name = “movieStarts”
  • Pipeline Challenges • App owners: send and forget • Data scientists: validation, ETL, batch processing • DevOps: stream processing, targeted search
  • Message Routing
  • We Want to Consume Data Selectively in Different Ways
  • • Message broker • High-throughput • Persistent and replicated
  • There Is More
  • Intelligent Alerts
  • Intelligent Alerts
  • Guided Debugging in the Right Context
  • Guided Debugging in the Right Context
  • Guided Debugging in the Right Context
  • • Ad-hoc query with different dimensions • Quick aggregations and Top-N queries • Time series with flexible filters • Quick access to raw data using boolean queries What We Need
  • Druid • Rapid exploration of high dimensional data • Fast ingestion and querying • Time series
  • • Real-time indexing of event streams • Killer feature: boolean search • Great UI: Kibana
  • The Old Pipeline
  • The New Pipeline
  • There Is More
  • It’s Not All About Counters and Time Series
  • RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200 Status:200
  • Distributed Tracing
  • Distributed Tracing
  • Distributed Tracing
  • A System that Supports All These
  • A Data Pipeline To Glue Them All
  • Make It Simple
  • Message Producing • Simple and Uniform API • messageBus.publish(event)
  • Consumption Is Simple Too consumer.observe().subscribe(new Subscriber<>() { @Override public void onNext(Ackable<IncomingMessage> ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); consumer.pause(); consumer.resume()
  • RxJava • Functional reactive programming model • Powerful streaming API • Separation of logic and threading model
  • Design Decisions • Top Priority: app stability and throughput • Asynchronous operations • Aggressive buffering • Drops messages if necessary
  • Anything Can Fail
  • Cloud Resiliency
  • Fault Tolerance Features • Write and forward with auto-reattached EBS (Amazon’s Elastic Block Storage) • disk-backed queue: big-queue • Customized scaling down
  • There’s More to Do • Contribute to @NetflixOSS • Join us :-)
  • Summary http://netflix.github.io +
  • You can build your own web-scale data pipeline using open source components
  • Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/danny- yuan/4/374/862 Twitter: @g9yuayon