Predict US stocks closing movements
Thanks to Optiver and Kaggle for hosting this great financial competition. And thanks to the
great notebooks and discussions, I learned a lot. I am so happy to win my second solo win! 😃😀😀
My final model(CV/Private LB of 5.8117/5.4030) was a combination of CatBoost (5.8240/5.4165), GRU (5.8481/5.4259), and Transformer (5.8619/5.4296), with respective weights of 0.5, 0.3, 0.2 searched from validation set. And these models share same 300 features.
Besides, online learning(OL) and post-processing(PP) also play an important role in my final submission.
model name | validation set w/o PP | validation set w/ PP | test set w/o OL w/ PP | test set w/ OL one time w/ PP | test set w/ OL five times w/ PP |
---|---|---|---|---|---|
CatBoost | 5.8287 | 5.8240 | 5.4523 | 5.4291 | 5.4165 |
GRU | 5.8519 | 5.8481 | 5.4690 | 5.4368 | 5.4259 |
Transformer | 5.8614 | 5.8619 | 5.4678 | 5.4493 | 5.4296 |
GRU + Transformer | 5.8233 | 5.8220 | 5.4550 | 5.4252 | 5.4109 |
CatBoost + GRU + Transformer | 5.8142 | 5.8117 | 5.4438 | 5.4157 | 5.4030*(overtime) |
My validation strategy is pretty simple, train on first 400 days and choose last 81 days as my holdout validation set. The CV score aligns with leaderboard score very well which makes me believe that this competition wouldn't shake too much. So I just focus on improving CV in most of time.
My models have 300 features in the end. Most of these are commonly used, such like raw price, mid price, imbalance features, rolling features and historical target features.
I will introduce some features really helpful and other teams didn't share yet.
1 agg features based on seconds_in_bucket_group
pl.when(pl.col('seconds_in_bucket') < 300).then(0).when(pl.col('seconds_in_bucket') < 480).then(1).otherwise(2).cast(pl.Float32).alias('seconds_in_bucket_group'),
*[(pl.col(col).first() / pl.col(col)).over(['date_id', 'seconds_in_bucket_group', 'stock_id']).cast(pl.Float32).alias('{}_group_first_ratio'.format(col)) for col in base_features],
*[(pl.col(col).rolling_mean(100, min_periods=1) / pl.col(col)).over(['date_id', 'seconds_in_bucket_group', 'stock_id']).cast(pl.Float32).alias('{}_group_expanding_mean{}'.format(col, 100)) for col in base_features]
2 rank features grouped by seconds_in_bucket
*[(pl.col(col).mean() / pl.col(col)).over(['date_id', 'seconds_in_bucket']).cast(pl.Float32).alias('{}_seconds_in_bucket_group_mean_ratio'.format(col)) for col in base_features],
*[(pl.col(col).rank(descending=True,method='ordinal') / pl.col(col).count()).over(['date_id', 'seconds_in_bucket']).cast(pl.Float32).alias('{}_seconds_in_bucket_group_rank'.format(col)) for col in base_features],
Feature selection is important because we have to avoid memory error issue and run as many rounds of online training as possible.
I just choose top 300 features by CatBoost model's feature importance.
out = out - out.mean(1, keepdim=True)
4 sample weight
I retrain my model every 12 days, 5 times in total.
I think most teams can only use up to 200 features when training GBDT if online training strategy is adopted. Because it requires double memory consumption when concat historical data with online data.
The data loading trick can greatly increase this. For achieving this, you should save training data one file per day and also loading day by day.
data loading trick
def load_numpy_data(meta_data, features):
res = np.empty((len(meta_data), len(features)), dtype=np.float32)
all_date_id = sorted(meta_data['date_id'].unique())
data_index = 0
for date_id in tqdm(all_date_id):
tmp = h5py.File( '/path/to/{}.h5'.format(date_id), 'r')
tmp = np.array(tmp['data']['features'], dtype=np.float32)
res[data_index:data_index+len(tmp),:] = tmp
data_index += len(tmp)
return res
Actually, my best submission is overtime at last update. I just skip online training if total inference time meets certain value.
So there are 4 online training updates in total. I estimate that the best score would be around 5.400 if not overtime.
Anyway, I am really lucky!
Subtract weighted-mean is better than average-mean since metric already told.
test_df['stock_weights'] = test_df['stock_id'].map(stock_weights)
test_df['target'] = test_df['target'] - (test_df['target'] * test_df['stock_weights']).sum() / test_df['stock_weights'].sum()
Thank you all!
Please sign in to reply to this topic.
Posted a year ago
· 23rd in this Competition
Thank you for sharing your great ideas! If I understand correctly, the GRU model is designed to capture time-series dynamics, while the Transformer is for the cross-sectional dynamics, and thus the data feeding process is different for these two. Please feel free to correct me if my interpretation is off. Thanks once again!
Posted a year ago
· 45th in this Competition
Thanks for your sharing and congrats again for the grade! I am a little bit confused about the GRU and Transformer model input and output dimension in your model. I would appreciate a lot if you could elaborate more on below points.
GRU input tensor's shape is (batch_size, 55 time steps, dense_feature_dim), followed by 4 layers GRU, output tensor's shape is (batch_size, 55 time steps). Transformer input tensor's shape is (batch_size, 200 stocks, dense_feature_dim), followed by 4 layers transformer encoder layers, output tensor's shape is (batch_size, 200 stocks). >
In GRU, I am a little bit confused why the output shape is (batch size*55 timestamp)? I think usually we just use the last timestamp(x = x[:, -1, :]) so the dimension will be (batch size * d) and then it will become (batch size * 1) with some fully connected layer?
In transformer, if the input tensor's shape is (batch_size, 200 stocks, dense_feature_dim), will the label still be the target for each stock at each timestamp or you predict the targets for 200 stocks all together?
Posted a year ago
· 1st in this Competition
Posted a year ago
Congratulations! @hydantess
What are these operators: over
, when
, then
, otherwise
, over
, cast
, alias
? Are they from a third-party package or functions that you have designed yourself?
I actually have some trouble understanding the function over
. It looks like groupby
in pandas
. Can you show your magic features in LaTex?
Posted a year ago
Congratulations!@hydantess
I have two questions:
Posted a year ago
I am a novice who has just started with Kaggle, and I noticed that many competitions necessitate the use of an Online Learning Strategy. Could you explain the principles of this strategy or how to learn and apply it?
Posted a year ago
· 1st in this Competition
It means retrain models with more test data in prediction phase.
Posted a year ago
· 106th in this Competition
Thanks a lot for sharing and congrats for the victory . Any chance you can share the winning code? Also why is the output of the GRU of size batch x time steps? Did you do seq2seq modelling? I thought the output would be batch size x 1( predict target for each data point?
Posted a year ago
· 1st in this Competition
Sorry, I don't plan to share my code.
Yes, it's seq2seq model but not bidirection for avoiding label leak.
Posted a year ago
Thanks for sharing your solutions it is a great one indeed. However, I am curious why you don't use meta model to combine and ensemble your models, I think that in the competition of predicting energy behaviour you didn't use meta model also
Posted a year ago
· 1st in this Competition
Thanks for your suggestion. I am not familiar with meta model. What does this mean? Stacking or what else?
Posted a year ago
· 1st in this Competition
I know you mean. It's very time-consuming to train a second-level model and doesn't help much in fact. So i just use weighted sum.
Posted a year ago
· 14th in this Competition
Thanks for sharing!
The weighted postprocessing improved our selected sub from 5.4457 to 5.4405, good for 11th place. This was the magic pp, zero-sum and zero-mean did not make much difference to our subs. We couldn’t make it work during the competition 😓 should have spent more time trying to fix it
Posted a year ago
· 3110th in this Competition
gru should be trained and states reseted every day?
Posted a year ago
· 1st in this Competition
Just zero inited at the beginning of every day.
Posted a year ago
Congratulations . Thanks for sharing your great idea . I'm kind of new to ML and if I got it correctly the input data for the Transformer part of your model was different with the input data of the GRU or the CatBoost . right ?
And what do you mean by "GBDT" in "predict target bucket mean by GBDT" ?
Thanks
Posted a year ago
· 1st in this Competition
Thanks~
Posted a year ago
· 172nd in this Competition
Congratulations @hydantess, blew it away! Thanks for sharing the scores difference between the PP/OL additions too very informative..
Relocating to Netherlands/Optiver next step? 😅
Posted a year ago
Congratulations for your win! I did not have the chance to participate in the competition (I got into kaggle after the competition closed), however I am interested keeping up with the solutions.
My final model(CV/Private LB of 5.8117/5.4030) was a combination of CatBoost (5.8240/5.4165), GRU (5.8481/5.4259), and Transformer (5.8619/5.4296), with respective weights of 0.5, 0.3, 0.2 searched from validation set. And these models share same 300 features.
With Transformer, you mean Transformer models that are specific for tabular data, namely FT-Transformer or TabTransformer?