Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real-time Machine Learning Analytics
Using Structured Streaming and Kinesis
Firehose
Caryl Yuhas (@ckred)
Myles Baker (@my...
Impact of Real-Time Analytics
• Capturing customer interactions, user behavior,
and sensor readings is rapidly increasing
...
Challenges Building a Solution
• Performant, scalable real-time analytics requires
connecting multiple tools
• Streaming d...
The Meetup Streaming API
• Can we explore Meetup data in real-time?
• Can we predict RSVPs for new Meetups using streaming...
A Data Model for Training and Scoring
5
Members
Events Event
RSVPs
Model
Member
Predicted
RSVP
Offline Training Real-Time ...
Component Integration and Serving
6
Kinesis
Producer
AWS
S3
Spark
Model
Training
Spark
Structured
Streaming
Meetup
Stream
...
Producing the Kinesis Firehose Stream
7
requests.get() makes a request to the
Meetup API, keeping the stream open
boto3.cl...
Our Meetup ML Pipeline
8
Raw Text Feature Buckets
+ Indexed Labels
Bucketizer
CountVectorizer
StringIndexer
VectorAssemble...
Create an ML Pipeline
9
A Pipeline allows us to simply chain a
series of transformations and estimators
val pipeline = new...
Scoring the Model in Real-time
10
Load the trained modelval model = PipelineModel.load(...)
Stream meetup event data
Score...
ML Limitations in Structured Streaming
11
•Structured streaming does not support
operations needed by ML methods
–count, c...
Our Streaming Meetup ML Pipeline
12
Raw Text Raw Text +
Binary Label
Binarizer
Tokenizer,
Hashing TF
LogisticRegression
Fe...
Alternative Scoring: Model Export
13
1) Fit ML model in Databricks using Spark MLlib.
2) Export model (as JSON files) in D...
Thank you
caryl@databricks.com
mbaker@databricks.com
14
#SFexp6
mydpy/ss-2017-structured-streaming
Upcoming SlideShare
Loading in …5
×

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis Firehose

1,079 views

Published on

Persisting data from Amazon Kinesis using Amazon Kinesis Firehose is a popular

Published in: Technology
  • Be the first to comment

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis Firehose

  1. 1. Real-time Machine Learning Analytics Using Structured Streaming and Kinesis Firehose Caryl Yuhas (@ckred) Myles Baker (@mydpy) June 6th, 2017 1 #SFexp6
  2. 2. Impact of Real-Time Analytics • Capturing customer interactions, user behavior, and sensor readings is rapidly increasing • Businesses need to respond immediately to new information as it arrives • Real-time analytics is at the core of next- generation IT systems 2
  3. 3. Challenges Building a Solution • Performant, scalable real-time analytics requires connecting multiple tools • Streaming data comes with all of the problems of static data with added complexity • Machine learning models need to be trained on historical data and scored with real-time data 3
  4. 4. The Meetup Streaming API • Can we explore Meetup data in real-time? • Can we predict RSVPs for new Meetups using streaming data from the Meetup API? –Members –Events –RSVPs 4
  5. 5. A Data Model for Training and Scoring 5 Members Events Event RSVPs Model Member Predicted RSVP Offline Training Real-Time Scoring
  6. 6. Component Integration and Serving 6 Kinesis Producer AWS S3 Spark Model Training Spark Structured Streaming Meetup Stream Meetup Member API Meetup Prediction
  7. 7. Producing the Kinesis Firehose Stream 7 requests.get() makes a request to the Meetup API, keeping the stream open boto3.client creates a firehose kinesis client requests.get(apiURL, stream = True) kinesis = boto3.client('firehose') kinesis.put_record_batch( DeliveryStreamName='meetup', Records=rsvps) kinesis.put_record_batch() writes the records streamed to S3 using the Kinesis Firehose delivery stream ‘meetup’
  8. 8. Our Meetup ML Pipeline 8 Raw Text Feature Buckets + Indexed Labels Bucketizer CountVectorizer StringIndexer VectorAssembler LogisticRegression Feature Vectors Predictions
  9. 9. Create an ML Pipeline 9 A Pipeline allows us to simply chain a series of transformations and estimators val pipeline = new Pipeline() .setStages(Array( transformers, estimators, models)) Fit a model based on the pipelineval model = pipeline.fit(meetup) Save the model to disk for scoringmodel.write.overwrite().save(...)
  10. 10. Scoring the Model in Real-time 10 Load the trained modelval model = PipelineModel.load(...) Stream meetup event data Score the model val events = spark.readStream .parquet(...) model.transform(members)
  11. 11. ML Limitations in Structured Streaming 11 •Structured streaming does not support operations needed by ML methods –count, collect, round, aggregate*, etc. • Many models, transformers, and estimators are not supported –K-Means, SVM, CountVectorizer, VectorAssembler, StringIndexer, etc.
  12. 12. Our Streaming Meetup ML Pipeline 12 Raw Text Raw Text + Binary Label Binarizer Tokenizer, Hashing TF LogisticRegression Features + Binary Label Predictions
  13. 13. Alternative Scoring: Model Export 13 1) Fit ML model in Databricks using Spark MLlib. 2) Export model (as JSON files) in Databricks val lrModel = new LogisticRegression().fit(myData) ModelExporter.export(lrModel, "s3a:/...") 3) Deploy model in external system import com.databricks.ml.local.ModelImport val lrModel = ModelImport.import("s3a:/...") val jsonInput = json(...) val jsonOutput = lrModel.transform(jsonInput)
  14. 14. Thank you caryl@databricks.com mbaker@databricks.com 14 #SFexp6 mydpy/ss-2017-structured-streaming

×
Save this presentationTap To Close