Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic.
Learn more
OK, Got it.
Optiver · Featured Code Competition · a year ago

Optiver - Trading at the Close

Predict US stocks closing movements

Optiver - Trading at the Close

hyd · 1st in this Competition · Posted a year ago
This post earned a gold medal

1st place solution

Thanks to Optiver and Kaggle for hosting this great financial competition. And thanks to the
great notebooks and discussions, I learned a lot. I am so happy to win my second solo win! 😃😀😀

Overview

My final model(CV/Private LB of 5.8117/5.4030) was a combination of CatBoost (5.8240/5.4165), GRU (5.8481/5.4259), and Transformer (5.8619/5.4296), with respective weights of 0.5, 0.3, 0.2 searched from validation set. And these models share same 300 features.

Besides, online learning(OL) and post-processing(PP) also play an important role in my final submission.

model name validation set w/o PP validation set w/ PP test set w/o OL w/ PP test set w/ OL one time w/ PP test set w/ OL five times w/ PP
CatBoost 5.8287 5.8240 5.4523 5.4291 5.4165
GRU 5.8519 5.8481 5.4690 5.4368 5.4259
Transformer 5.8614 5.8619 5.4678 5.4493 5.4296
GRU + Transformer 5.8233 5.8220 5.4550 5.4252 5.4109
CatBoost + GRU + Transformer 5.8142 5.8117 5.4438 5.4157 5.4030*(overtime)

Validation Strategy

My validation strategy is pretty simple, train on first 400 days and choose last 81 days as my holdout validation set. The CV score aligns with leaderboard score very well which makes me believe that this competition wouldn't shake too much. So I just focus on improving CV in most of time.

Magic Features

My models have 300 features in the end. Most of these are commonly used, such like raw price, mid price, imbalance features, rolling features and historical target features.
I will introduce some features really helpful and other teams didn't share yet.
1 agg features based on seconds_in_bucket_group

pl.when(pl.col('seconds_in_bucket') < 300).then(0).when(pl.col('seconds_in_bucket') < 480).then(1).otherwise(2).cast(pl.Float32).alias('seconds_in_bucket_group'),
 *[(pl.col(col).first() / pl.col(col)).over(['date_id', 'seconds_in_bucket_group', 'stock_id']).cast(pl.Float32).alias('{}_group_first_ratio'.format(col)) for col in base_features],
 *[(pl.col(col).rolling_mean(100, min_periods=1) / pl.col(col)).over(['date_id', 'seconds_in_bucket_group', 'stock_id']).cast(pl.Float32).alias('{}_group_expanding_mean{}'.format(col, 100)) for col in base_features]

2 rank features grouped by seconds_in_bucket

 *[(pl.col(col).mean() / pl.col(col)).over(['date_id', 'seconds_in_bucket']).cast(pl.Float32).alias('{}_seconds_in_bucket_group_mean_ratio'.format(col)) for col in base_features],
 *[(pl.col(col).rank(descending=True,method='ordinal') / pl.col(col).count()).over(['date_id', 'seconds_in_bucket']).cast(pl.Float32).alias('{}_seconds_in_bucket_group_rank'.format(col)) for col in base_features],

Feature Selection

Feature selection is important because we have to avoid memory error issue and run as many rounds of online training as possible.
I just choose top 300 features by CatBoost model's feature importance.

Model

  1. Nothing to say about CatBoost as usual, just simply train and predict.
  2. GRU input tensor's shape is (batch_size, 55 time steps, dense_feature_dim), followed by 4 layers GRU, output tensor's shape is (batch_size, 55 time steps).
  3. Transformer input tensor's shape is (batch_size, 200 stocks, dense_feature_dim), followed by 4 layers transformer encoder layers, output tensor's shape is (batch_size, 200 stocks). A small trick that turns output into zero mean is helpful.
out = out - out.mean(1, keepdim=True)

4 sample weight

Online Learning Strategy

I retrain my model every 12 days, 5 times in total.
I think most teams can only use up to 200 features when training GBDT if online training strategy is adopted. Because it requires double memory consumption when concat historical data with online data.
The data loading trick can greatly increase this. For achieving this, you should save training data one file per day and also loading day by day.

data loading trick

def load_numpy_data(meta_data, features):
    res = np.empty((len(meta_data), len(features)), dtype=np.float32)
    all_date_id = sorted(meta_data['date_id'].unique())
    data_index = 0
    for date_id in tqdm(all_date_id):
        tmp = h5py.File( '/path/to/{}.h5'.format(date_id), 'r')
        tmp = np.array(tmp['data']['features'], dtype=np.float32)
        res[data_index:data_index+len(tmp),:] = tmp
        data_index += len(tmp)
    return res

Actually, my best submission is overtime at last update. I just skip online training if total inference time meets certain value.
So there are 4 online training updates in total. I estimate that the best score would be around 5.400 if not overtime.
Anyway, I am really lucky!

Post Processing

Subtract weighted-mean is better than average-mean since metric already told.

test_df['stock_weights'] = test_df['stock_id'].map(stock_weights)
test_df['target'] = test_df['target'] - (test_df['target'] * test_df['stock_weights']).sum() / test_df['stock_weights'].sum()

What not worked for me

  1. ensemble with 1dCNN or MLP.
  2. multi-days input instead of singe day input when applying GRU models
  3. larger transformer, e.g. deberta
  4. predict target bucket mean by GBDT

Thank you all!

Please sign in to reply to this topic.

Posted a year ago

This post earned a silver medal

Excuse me for another question, Every 12 day you re-train the Catboost from scratch and fine-tune the two NNs am I right?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal

Right. ~~~~

Posted a year ago

· 23rd in this Competition

This post earned a silver medal

Thank you for sharing your great ideas! If I understand correctly, the GRU model is designed to capture time-series dynamics, while the Transformer is for the cross-sectional dynamics, and thus the data feeding process is different for these two. Please feel free to correct me if my interpretation is off. Thanks once again!

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal

You are right!

Posted a year ago

This post earned a bronze medal

Awesome sharing @hydantess

Posted a year ago

This post earned a bronze medal

Thank you for your sharing and gratz for the grade! Awesome work @hydantess !

Posted a year ago

· 45th in this Competition

This post earned a bronze medal

Thanks for your sharing and congrats again for the grade! I am a little bit confused about the GRU and Transformer model input and output dimension in your model. I would appreciate a lot if you could elaborate more on below points.

GRU input tensor's shape is (batch_size, 55 time steps, dense_feature_dim), followed by 4 layers GRU, output tensor's shape is (batch_size, 55 time steps). Transformer input tensor's shape is (batch_size, 200 stocks, dense_feature_dim), followed by 4 layers transformer encoder layers, output tensor's shape is (batch_size, 200 stocks). >

  1. In GRU, I am a little bit confused why the output shape is (batch size*55 timestamp)? I think usually we just use the last timestamp(x = x[:, -1, :]) so the dimension will be (batch size * d) and then it will become (batch size * 1) with some fully connected layer?

  2. In transformer, if the input tensor's shape is (batch_size, 200 stocks, dense_feature_dim), will the label still be the target for each stock at each timestamp or you predict the targets for 200 stocks all together?

Posted a year ago

· 106th in this Competition

This post earned a bronze medal

Yes I tried to ask the same question. Apparently some seq2seq modelling was used but I am personally still a bit confused (how many targets were used for the seq2seq?)

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal
  1. You're right. Will take the last timestamp when inference.
  2. Yes.

Posted a year ago

This post earned a bronze medal

Congratulations! @hydantess

  1. What are these operators: over, when, then, otherwise, over, cast, alias? Are they from a third-party package or functions that you have designed yourself?

  2. I actually have some trouble understanding the function over. It looks like groupby in pandas. Can you show your magic features in LaTex?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal
  1. This is polars.
  2. polars over == pandas groupby().transform()

Posted a year ago

· 57th in this Competition

This post earned a bronze medal

does this help a lot in preventing from overtime in submission?

Posted a year ago

This post earned a bronze medal

Congratulations!@hydantess
I have two questions:

  1. How can we make the model engage in online learning during the prediction phase when there is no target?
  2. Why doesn't the transformer output directly, like GRU does (batchx55), but instead predicts the target of 200 stocks at that moment directly?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal
  1. We can receive historical target info in inference phase, you can refer to this demo.
  2. I want transformer to learn info across different stocks, GRU to learn sequence info. You can combine these two modules in one model of course.

Posted a year ago

This post earned a bronze medal

Thank you for your sharing it's amazing to see unique techniques. @hydantess

Posted a year ago

· 54th in this Competition

This post earned a bronze medal

Congrats and thank you for sharing this!!

Posted a year ago

This post earned a bronze medal

I am a novice who has just started with Kaggle, and I noticed that many competitions necessitate the use of an Online Learning Strategy. Could you explain the principles of this strategy or how to learn and apply it?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

It means retrain models with more test data in prediction phase.

Posted a year ago

This post earned a bronze medal

Congratulations!

Posted a year ago

· 106th in this Competition

This post earned a bronze medal

Thanks a lot for sharing and congrats for the victory . Any chance you can share the winning code? Also why is the output of the GRU of size batch x time steps? Did you do seq2seq modelling? I thought the output would be batch size x 1( predict target for each data point?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal

Sorry, I don't plan to share my code.
Yes, it's seq2seq model but not bidirection for avoiding label leak.

Posted a year ago

This post earned a bronze medal

congratulation man, I hope you will take first place in the Kaggle world rank next month when end enefit competition. can you share your notebook

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal

Sorry, I don't plan to share my code.😂

Posted a year ago

This post earned a bronze medal

Thanks for sharing your solutions it is a great one indeed. However, I am curious why you don't use meta model to combine and ensemble your models, I think that in the competition of predicting energy behaviour you didn't use meta model also

hyd

Topic Author

Posted a year ago

· 1st in this Competition

Thanks for your suggestion. I am not familiar with meta model. What does this mean? Stacking or what else?

Posted a year ago

It can be done in countless ways, but in short it is when use the predictions of a model or group of models as feature to the single model that makes the final prediction, this model is the meta model

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal

I know you mean. It's very time-consuming to train a second-level model and doesn't help much in fact. So i just use weighted sum.

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

Thanks for sharing!

The weighted postprocessing improved our selected sub from 5.4457 to 5.4405, good for 11th place. This was the magic pp, zero-sum and zero-mean did not make much difference to our subs. We couldn’t make it work during the competition 😓 should have spent more time trying to fix it

Posted a year ago

Can I ask you what u mean by sub from 5.4457 to 5.4405

Posted a year ago

· 14th in this Competition

Our competition selected submission scored 5.4457 without any postprocessing
Adding the weighted mean postprocessing, improved it from 5.4457 to 5.4405

Posted a year ago

This post earned a bronze medal

Great sir keep it up!

Posted a year ago

· 3110th in this Competition

This post earned a bronze medal

gru should be trained and states reseted every day?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

Just zero inited at the beginning of every day.

Posted a year ago

· 108th in this Competition

This post earned a bronze medal

congratulations and thanks a lot for sharing

Posted a year ago

This post earned a bronze medal

Why the feature number you choose is 300, not 200 or 400?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a bronze medal

based on CV score, 300 is better than 200.
400 will have memory issue.

Posted a year ago

This post earned a bronze medal

Congratulations . Thanks for sharing your great idea . I'm kind of new to ML and if I got it correctly the input data for the Transformer part of your model was different with the input data of the GRU or the CatBoost . right ?
And what do you mean by "GBDT" in "predict target bucket mean by GBDT" ?
Thanks

hyd

Topic Author

Posted a year ago

· 1st in this Competition

Thanks~

  1. Yes
  2. It means Gradient Boosting Decision Tree.

Posted a year ago

This post earned a bronze medal

Congrats @hydantess to be the top placeholder!

Posted a year ago

· 172nd in this Competition

This post earned a bronze medal

Congratulations @hydantess, blew it away! Thanks for sharing the scores difference between the PP/OL additions too very informative..

Relocating to Netherlands/Optiver next step? 😅

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a silver medal

I love China. 😀

Posted a year ago

This post earned a bronze medal

Very informative

Posted a year ago

This post earned a bronze medal

Congratulations for your win! I did not have the chance to participate in the competition (I got into kaggle after the competition closed), however I am interested keeping up with the solutions.

My final model(CV/Private LB of 5.8117/5.4030) was a combination of CatBoost (5.8240/5.4165), GRU (5.8481/5.4259), and Transformer (5.8619/5.4296), with respective weights of 0.5, 0.3, 0.2 searched from validation set. And these models share same 300 features.

With Transformer, you mean Transformer models that are specific for tabular data, namely FT-Transformer or TabTransformer?

hyd

Topic Author

Posted a year ago

· 1st in this Competition

This post earned a silver medal

No, It's torch.transformerencoder ~

Posted a year ago

This post earned a bronze medal

Congratulations and continued success @hydantess