Aaron Sim took first place in our recent How Much Did It Rain? II competition. The goal of the challenge was to predict a set of hourly rainfall levels from sequences of weather radar measurements. Aaron and his research lab supervisor were in the midst of developing deep learning tools for their own research when the competition was launched. There was sufficient overlap in the statistical tools and datasets to make the competition a great ground for testing their approach on a new dataset. In this blog, Aaron shares his background, competition experience and methodology, and biggest takeaways (hint: Kaggle competitions are anything but covert). To read a more detailed technical analysis, take a look at his personal blog post on GitHub.
The Basics
What was your background prior to entering this challenge?
I am a postdoc researcher in the Theoretical Systems Biology Group at Imperial College London in the UK. My background is in theoretical physics where I worked on the geometry of string theory backgrounds (in other words, no data whatsoever). My current research involves the development of mathematical and statistical models of biological and other complex systems such as protein interaction networks and cities (lots of data!).
Profile for Aaron (aka PuPa) on Kaggle
What made you decide to enter this competition?
It was clear to me from the start that this prediction task – hourly rainfall values from variable-length, time-labelled, sequences of weather radar observations – is very nearly the classic type of problem that one would not hesitate to throw a recurrent neural network at (more of this below). Since I was in the midst of applying some deep learning methods in my current research project, I saw it as a good opportunity to validate some of my ideas in a different context. That, at least, is my post hoc justification for the time spent on this competition!
Let's Get Technical
What preprocessing and supervised learning methods did you use?
I used recurrent neural networks (RNN) exclusively with minimal preprocessing. As alluded to above, the prediction of cumulative values (hourly rainfall) from variable-length sequences of vectors with a time component is highly reminiscent of the so-called Adding Problem in machine learning – a toy sequence regression task that is frequently employed to demonstrate the power of RNNs in learning long-term dependencies (see Le et al., Sec 4.1, for a recent example).
Figure 1: The prediction target of 1.7 is obtained by adding up the numbers in the top row where the corresponding number in the bottom row is equal to one (i.e. the green boxes). The regression task is to infer this generative model from a training set of random sequences of arbitrary lengths and their targets.
In the rainfall prediction problem, the situation is somewhat less trivial as there is still the additional step of inferring the rainfall numbers (top row) from radar measurements. Furthermore, instead of binary 0/1 values (bottom row) one has continuous time readings between 0 and 60 minutes that have somewhat different roles. Nevertheless, the underlying structural similarities are compelling enough to suggest that RNNs, even simple vanilla RNNs with off-the-shelf architectures (see below), would be well suited to the problem.
Figure 2. A basic RNN setup. The bottom layer represents a single input sequence of radar measurements within a single hour. Each number is the minutes past the top of the hour of the measurement, which is preserved as a component of the feature vector. The output is the cumulative rainfall in the hour as measured by a rain gauge.
If there’s a secret sauce in my approach deepening and widening the above RNN architecture, however, it would probably be the implementation of training- and test-time augmentations to the radar sequences. One common way to reduce overfitting is to augment the training set via label-preserving transformations of the data. The canonical examples are found in image classification tasks where images are cropped and perturbed to improve the generalization capabilities of the classifier. Since we have here a regression problem, it is less obvious what such augmentations should be.
My solution was to implement a form of ‘dropin’ augmentation of the datasets where I lengthened the radar measurement sequences to a single fixed length by duplicating the vectors at random time points. This is, loosely speaking, the opposite of performing dropout on the input layer, hence the name. This is illustrated in the figure below:
Figure 3. Lengthening a length-5 sequence to length-8 sequences. Each coloured box represents a vector of radar measurements. Note that the temporal order of the augmented sequence is preserved.
The lengths of the sequences in both the training and test sets ranged from one to 19 measurements per hour. Over the competition I experimented with fixed augmented sequence lengths of 19, 21, 24 and 32 timepoints. I found that stretching out the sequence lengths beyond 21 steps was too aggressive as the models began to underfit.
My original intention was to find a way to standardise the sequence lengths to facilitate mini-batch training. However it soon became clear that this simple generalization of a basic padding operation could be a way to train the network to properly factor in the time intervals between observations; specifically, this is achieved by encouraging the network to ignore readings when the intervals are zero, thereby mimicking the input gate in gated variants of RNNs such as LSTM networks. To the best of my knowledge this is a novel, albeit simple, idea.
To predict each rainfall value at test time, I took the mean of 61 separate rain gauge predictions that were generated from different random dropin lengthening of the radar data sequences. Implementing this procedure alone led to a huge improvement (~0.03) in the public leaderboard score, which translates roughly into a jump from 40th position into a top-ten place.
I did not perform any data preprocessing beyond replacing missing components in the radar measurement vectors with zeros.
The best architecture I found over the competition is a 5-layer deep bidirectional RNN with 64 to 256 hidden units, with additional single dense layers after each hidden stack and a single linear layer at the bottom of the network to reduce the dimension of the input vectors. At the top of the network the vector at each time position is fed into a dense layer with a single output and a ReLU non-linearity. The final output is obtained by taking the mean of the predictions from the entire top layer. This is summarised in the figure below:
Figure 4. The best performing architecture in the final ensemble. The red numbers on the right indicate the size of each layer. The output from this single model had a public leaderboard score of 23.6971, which should be good enough for 5th place in the competition.
What were some of the challenges thrown up by this particular dataset?
Since the physical locations of the rain gauges and the calendar date and hour of their measurements were not provided for this competition, this had the somewhat unusual implication that it was difficult, if not impossible, to separate out from the training data a sufficiently independent holdout subset. This was discussed at some length in the forum.
Indeed as the competition progressed I became increasingly suspicious of what my local validation scores were indicating, especially as I was struggling to get my models to overfit (most definitely a data science first-world problem). I began to rely on the public leaderboard submissions to validate my models – yes I was one of those people making the maximum number of submissions every day (*yikes*). This goes completely against the conventional wisdom of building a robust local cross validation setup and holding one’s nerve and trusting it over the public leaderboard scores. I did, however, live in fear of a great leaderboard shakeup, which thankfully for me did not materialise.
Were you surprised by any of your findings?
My biggest surprise was that implementing dropout resulted in consistently poorer scores, contrary to what has been reported by many others for RNNs and in other models such as CNNs. I tried many combinations, including varying the dropout percentage and implementing it only at the top or bottom of the network, all without success.
Also, LSTM networks did not appear to work any better than RNNs with standard hidden layers. Perhaps the advantages are only really apparent for much longer sequences with more complex dependencies than the ones here.
Which tools did you use?
I used Python with Theano throughout and relied heavily on the Lasagne layer classes to build the RNN architectures. Additionally, I used scikit-learn to implement the cross-validation splits, and pandas and NumPy to process and format the data and submission files. I trained the models on several NVIDIA GPUs in my lab, which include two Tesla K20 and three M2090 cards.
Words of Wisdom
What have you taken away from this competition?
- It is impossible to hide your participation in a Kaggle competition from your partner/spouse. (“You’re doing another Kaggle competition, aren’t you?”)
- Throwing the neural network equivalent of ‘everything but the kitchen sink’ at any and every problem is almost never a bad idea.
- Don’t bother with feature engineering; the machines have won.
Do you have any advice for those just getting started in data science?
The field is moving very fast – one can’t just rely on standard statistics or machine learning textbooks (or even year-old blog posts). Most research papers are freely available on arXiv, often many months before they are properly published, so that is a good place to hang out.
Also, everything becomes a lot easier to understand once you’ve learnt how to build it. So get stuck in early and don’t worry about not understanding everything from the start.
Bio
Aaron Sim has a background in theoretical physics and is currently a postdoc researcher in the Theoretical Systems Biology Group at Imperial College London. Read more by Aaron on his github blog.