Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries

Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software

Data Is the most important asset at Netflix

If all the data is easily available to all teams, it can be leveraged in new and exciting ways View slide

~1000 Device Types ~500 Apps/Web Services ~100 Billion Events/Day 3.2M messages per second at peak time 3GB per second at peak time Dashboard View slide

Type of Events • User Interface Events • Search Event (‘Matrix’ using PS3 …) • Star Rating Event (HoC : 5 stars, Xbox, US, …) • Infrastructural Events • RPC Call (API -> Billing Service, ‘/bill/..’, 200, …) • Log Errors (NPE, “Movie is null”, …, …) • Other Events …

Making Sense of Billions of Events

http://netflix.github.io +

A Humble Beginning

Evolution …Scale!

Application Application Application Application Application Application Application Application ApplicationApplication

We Want to Process App Data in Hadoop

Our Hadoop Ecosystem

@NetflixOSS Big Data Tools

Hadoop as a Service

Pig Scripting on Steroids

Pig Married to Clojure “Map-Reduce for Clojure”

S3MPER S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index. S3mper is a library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.

Efficient ETL with Cassandra Cassandra

Offline Analysis

Evolution … Speed!

We Want to Aggregate, Index, and Query Data in Real Time

Interactive Exploration

Let’s walk through some use cases

client activity event * /name = “movieStarts”

Pipeline Challenges • App owners: send and forget • Data scientists: validation, ETL, batch processing • DevOps: stream processing, targeted search

Message Routing

We Want to Consume Data Selectively in Different Ways

• Message broker • High-throughput • Persistent and replicated

There Is More

Intelligent Alerts

Guided Debugging in the Right Context

• Ad-hoc query with different dimensions • Quick aggregations and Top-N queries • Time series with flexible filters • Quick access to raw data using boolean queries What We Need

Druid • Rapid exploration of high dimensional data • Fast ingestion and querying • Time series

• Real-time indexing of event streams • Killer feature: boolean search • Great UI: Kibana

The Old Pipeline

The New Pipeline

There Is More

It’s Not All About Counters and Time Series

RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200 Status:200

Distributed Tracing

A System that Supports All These

A Data Pipeline To Glue Them All

Make It Simple

Message Producing • Simple and Uniform API • messageBus.publish(event)

Consumption Is Simple Too consumer.observe().subscribe(new Subscriber<>() { @Override public void onNext(Ackable<IncomingMessage> ackable) { process(ackable.getEntity(MyEventType.class)); ackable.ack(); } }); consumer.pause(); consumer.resume()

RxJava • Functional reactive programming model • Powerful streaming API • Separation of logic and threading model

Design Decisions • Top Priority: app stability and throughput • Asynchronous operations • Aggressive buffering • Drops messages if necessary

Anything Can Fail

Cloud Resiliency

Fault Tolerance Features • Write and forward with auto-reattached EBS (Amazon’s Elastic Block Storage) • disk-backed queue: big-queue • Customized scaling down

There’s More to Do • Contribute to @NetflixOSS • Join us :-)

Summary http://netflix.github.io +

You can build your own web-scale data pipeline using open source components

Thank You! Sudhir Tonse http://www.linkedin.com/in/sudhirtonse Twitter: @stonse Danny Yuan http://www.linkedin.com/pub/danny- yuan/4/374/862 Twitter: @g9yuayon

http://www.elasticsearch.org	794
https://twitter.com	29
http://www.slideee.com	10
https://www.linkedin.com	7
http://webcache.googleusercontent.com	1

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries

by Sudhir Tonse

on Jul 28, 2014

Statistics

Views

Actions

5 Embeds 841

Accessibility

Categories

Upload Details

Usage Rights

Report content

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries Presentation Transcript