Machine Learning in Snowflake
Snowflake is currently making waves globally as exciting new data warehouse built from scratch for the cloud.
It already has an impressive number of aggregation and analytic functions of the traditional statistical variety, but the upcoming release of Javascript Stored Procedures lets you put its efficient, elastic, distributed compute engine to use on more general tasks outside the SQL dialect.
I thought an ML workload would be a great way to road test the feature, and it turns out to be really good at it!
Should we?
Is a database engine a good place to train a model?
There has been a shifting line marking the territory between in-database processing and the computation that belongs in the outside world.
As the term “data mining” rose in popularity in the 1990s, DBMS vendors added statistical modelling functionality to their SQL dialects, for example IBM introduced “Intelligent Miner”, Oracle introduced “Oracle Data Mining”.
Then following the .com boom, data sets grew to the point where we started calling them “Big”, Hadoop rose in popularity as a way of scaling out processing — followed by Spark. Data mining workloads had shifted outside of the traditional DBMS.
But just as many declared SQL dead, in 2010 Hive was released and suddenly these large piles of data were known as data warehouses again. Fast forward to 2019, and now all the major cloud vendors provide some form of SQL interface over structured and semi-structured data in their object storage.
In mid-2018, Google announced a beta of BigQueryML, and now we’re back using SQL as an abstraction over machine learning, albeit on much larger data sets with much greater (and elastic) processing power. BigQueryML is still in Beta as at Feb 2019, it’s currently fairly basic and is limited to just linear regression and logistic regression.
As long as the algorithms can scale appropriately, there are many great reasons to use your data warehouse engine for ML:
- Simplicity — no need to manage another compute platform, integrate between systems and extract/analyse/load data
- Security — The data stays where it is well secured, no need to configure Snowflake credentials in external systems or worry about where copies of data might end up
- Performance — A good data warehouse engine will maintain a lot of metadata that is used to optimise queries, and these could be reused during the ML process to give it an advantage over a general-purpose compute platform
In Snowflake
So let’s see if we can implement a modest, minimal, but legitimate machine learning algorithm using just the Snowflake engine.
A Decision Tree can be used to build regression or classification models by breaking down a dataset into smaller and smaller subsets organised into a tree.
In order to minimise the amount of logic I need to implement, I’m going with a regression tree exercise using the ID3 algorithm, mainly so that I can reuse Snowflake’s existing Standard Deviation functions while building the tree.
Snowflake is not intended to be a general purpose cluster-computing framework like Spark, but it is exceptionally good at parallelising analytical queries.
For this reason, and also because javascript is single-threaded, my goal will be to give as much of the compute calculations over to the query engine as possible, leaving the stored procedure to perform more of a co-ordination role.
The Scenario
I’ve chosen the Bike Sharing in Washington D.C. data set, using the hourly frequency.
We can store it in a table like this:
create table bikes_hours(instant numeric, dteday date, season numeric, yr numeric, mnth numeric, hr numeric, holiday numeric, weekday numeric, workingday numeric, weathersit numeric, temp numeric, atemp numeric, hum numeric, windspeed numeric, casual numeric, registered numeric, cnt numeric)
and use the SnowSql command line tool to upload the csv file.
Training
First we’ll create a table to store our training runs and completed models, just something simple while we prove the concept:
create or replace sequence ml_model_runs_sequence start = 1 increment = 1;create or replace table ml_model_runs(run_id integer,
table_name varchar(256),
algorithm varchar(100),
training_parameters variant,
start_time timestamp,
end_time timestamp,
model_object variant);
Now to implement the ID3 decision tree algorithm in a javascript stored procedure. Mathematically, it looks like this:
We apply that formula at each branch point in order to choose which attribute to split on next.
I’m not one to wallow in complexity, it’s just a formal way of saying we’re going to split on the attribute that best “groups” the values in its children (i.e. has the most homogeneous values). To do this, we’ll compare standard deviations (how much the members of a group differ from the mean value) between parent and children.
Our stored procedure will take four parameters: a table name, a target variable (the name of the column containing the values we want to predict), the names of the columns to use as attributes in the tree, and the training parameters.
Starting at the top of the tree (the root node), we will run two queries at each branching point.
The first query gives us:
1) The standard deviation of all the target values from this node down, as we will pick the branch that reduces this value the most
2) The average value of all the target values from this node down, as ultimately average is used as a predictor when we reach the leaf
3) The coefficient of variation, can be used to stop building when it gets too small
4) The number of target values from this node down, can be used to stop building when it gets too small
5) For each potential branch below (from the list of remaining columns), the median value. This will be used for splitting the data if that attribute is chosen.
At the top of the tree, it looks like this:
but as the tree gets wider and wider, it starts to look like this:
The second query has two parts:
- a subquery which measures the standard deviation of the target below each child branch (each one multiplied by it’s size in relation to the to whole)
- an enclosing query to calculate the reduction in standard deviation, ordered highest to lowest so that we can easily pick the largest
Both queries use WHERE clauses to filter down to their position in the tree.
Where it gets complex is that we not only offload the branch selection to Snowflake, but we also ask it to calculate every branch for the current depth, using a UNION statement. This saves us from traversing the entire tree.
At the top of the tree, where there is one tree split to calculate and 5 possible attributes to assess, the query looks like this:
toward the bottom, as it evaluates every possible split for every branch, it looks like this:
Of the total number of available attributes (9), using configuration parameters I’ve limited it to evaluating 5 of them at a time.
At the first split, the query plan for the second query looks like this (the values to split each column on were provided by the first query):
and by the time you get down to some decent depth, the query plan starts to look like this:
The full code to do all of this is a bit long to post inline, here’s a link.
Before running, we’ll choose our variables. This isn’t an exercise in how to do data science, so I’ll just pick a bunch of them that look useful.
First, let’s add a score column to the dataset:
alter table bikes_hours add column score numeric
then, using Snowflake’s sample function, split the source table into training (90%):
create temporary table bikes_hours_training as
select * from bikes_hours sample (90)
…and the remaining 10% for testing (10%):
create temporary table bikes_hours_testing as
select * from bikes_hours where instant not in (select instant from bikes_hours_training)
Now we call the decision_tree_train proc to predict the ‘CASUAL’ column (number of casual riders each hour) using 9 of the columns.
call decision_tree_train('bikes_hours_training', 'CASUAL', 'season,hr,weekday,workingday,temp,atemp,weathersit,hum,holiday');
Here’s what the resulting models look like in the ml_model_runs table:
Here’s a snippet of the model object:
Scoring
To generate a prediction on a database record, we need a way to navigate down the tree using the values in its column.
Initially I thought a User Defined Table Function (UDTF) would be perfect for this (code here), as it should leverage the parallelism you get from the Snowflake engine. I built one that drilled down through the model (as a javascript object) until it reached the leaf. In practice, the UDF time limit (about a minute?) seemed to accumulate over the table and I hit it too often.
Then I figured a decision tree could easily be converted into a mammoth CASE statement and ran as plain SQL. So I wrote a stored procedure (code here) which generated the case statement and executed an UPDATE query against a chosen table, putting the predicted value in one of its columns.
So we run it, passing in the decision tree json object and the name of the column to put the score in:
call decision_tree_score('bikes_hours_testing',(select model_object from ml_model_runs where run_id=1),'SCORE');
Evaluating
A common way to evaluate a model is Root Mean Square Error. We can do this in Snowflake like so:
select sqrt(sum(square(casual-score))/count(*) ) from bikes_hours_testing
The result is 25.419340847
Considering the values range from 0 to 367, this is not too embarrassing.
Performance
Training
One of the great things about Snowflake is that query performance feedback is built right into the product. So after running my training proc, I just switch to the History tab to see a profile.
I used 9 attributes, but constrained the training to:
- max 5 attributes to compare at a time
- coefficient of variation lower limit of 10
and this yielded a binary tree with 33 leaves.
In total, it took 128 seconds to train on a medium cluster.
At the root node, the first query (summary of each node at each depth) took about 200ms, and gradually increase to 10 seconds at the final depth of 10.
The second query (evaluation of branch split candidates for each node at each depth) started out at 1.3 seconds, gradually increased to 23.8 seconds at a depth of 7, then shrank back to 14.8 seconds at the final depth of 10 (this is because branches terminate, so the tree does not fan out all the way down).
Scoring
Scoring was, of course, almost instantaneous. Passing a CASE statement to the Snowflake query engine (even one so large) is well optimised.
Conclusion
This was a fun exercise, and a good way to road test the new Stored Procedures feature. But most importantly, it works!
At the start of this experiment, I wasn’t sure whether or not it was going to perform very well, but by the end I was very excited by the results.
What next?
There are a number of improvements to be made to this basic implementation, including different splitting methods (gini, entropy) as well as better support for categorical values (instead of just splitting on the median value).
But more importantly, decision tree building forms the basis for the most popular ML algorithms for structured data (random forests, XGBoost), so there shouldn’t be any reason this approach couldn’t extend further.
The other pieces of the puzzle that could allow a data scientist to win a Kaggle competition entirely within Snowflake are:
Feature Engineering
Snowflake should excel here as well. Functions like one hot encoding should be trivial to build with the existing function library, and bucketing of values is also supported already.
Sampling
Stochastic sampling of datasets becomes important with tasks like building random forests. Snowflake appears to have a lot of flexibility with the built in sampling methods — another plus.