Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic.

Learn more

OK, Got it.

Optiver · Featured Code Competition · a year ago

Optiver - Trading at the Close

Predict US stocks closing movements

Optiver - Trading at the Close

Overview Data Code Models Discussion Leaderboard Rules

Daniel FG · 6th in this Competition · Posted a year ago

6th Place Solution

Thank you to Optiver and Kaggle for hosting this competition. This time I wanted to gain hands-on experience applying deep learning to trading and time series data, and didn't focus on extensive feature engineering or boosting trees based models.

Data Preprocessing: Zero imputation to handle missing values, and standard scaling to normalize the features.
Feature Engineering: Total of 35-36 features for the models, which included the raw input features, binary flags for the 'seconds_in_bucket' variable, and additional features borrowed from public notebooks such as volume, mid price, and various imbalance measures (liquidity, matched, size, pairwise, harmonic).
Modeling Approach:
- Sequence-to-sequence transformers (3 slightly varying models): dimensionality 64, encoder with 2 days of historical data, decoder with 4 stacked transformers layers, and head with simple linear output layer. Both encoder and decoder with stock-wise (attention) layers.
- GRU (1 model): similar seq-2-seq architecture (decoder-only), dimensionality 128, decoder with 2 GRU layers, and head consisting 2 fully-connected layers. Both decoder and head with stock-wise (attention) layers.
All models produced outputs of shape (batch_size, number_stocks, 55). To leverage the time series nature of the competition with revealed targets, online incremental learning was performed, where the models were updated each day using only the newly unseen data (for the decoder).
Validation Strategy: Simple time-based split was used for validation, with the first 359 days used for training and the last 121 days used for validation. Because the evaluation metric was unstable after each training epoch, exponential moving average was used to smooth the values and compare models. To assess online incremental learning, models were validated with the latest 20 days of data.
Postprocessing: All models except one were trained with an additional constraint to enforce that the sum of the model outputs is zero.
Ensembling: The final ensemble consisted of an average of predictions from the 3 transformer models and 1 GRU model.
Final Results: Final submission placed 6th on the private leaderboard with a mean absolute error (MAE) of 5.4285.

Please sign in to reply to this topic.

20 Comments

Yusef A.

Posted a year ago

· 5th in this Competition

Really impressive stuff - well done on the result. Is it at all possible to see the model code for either of the models? Would be really interested.

Daniel FG

Topic Author

Posted a year ago

· 6th in this Competition

Thank you, and congrats to your team for an exceptional 5th position. I require some time to prepare the code for sharing, but I'm more than happy to address any specific questions you might have in the meantime.

C R Suthikshn Kumar

Posted a year ago

· 344th in this Competition

Congratulations on 6th rank in this competition. Thanks for sharing insights of your solution.

ANSHUL GUPTA1502

Posted a year ago

Congrats @danielfg for 6th rank in the competition.

Ravi Ramakrishnan

Posted a year ago

· 14th in this Competition

Great work @danielfg
Congratulations for the prize winning solution and best wishes for the future.
I appreciate the different approach and complete reliance on deep learning models rather than the typical boosted tree solutions for such problems

Daniel FG

Topic Author

Posted a year ago

· 6th in this Competition

Thank you, congratulations to your team too

kerry sun

Posted a year ago

Congratulations

HW

Posted a year ago

· 30th in this Competition

Congrats! Very cool DL approach. Would you like to share the code for your solution?

dollartree

Posted a year ago

· 443rd in this Competition

Congratulations！

giampaolo1980

Posted a year ago

· 106th in this Competition

HI Daniel, I personally think what you have done here is amazing. Achieving the 6th place using less than 40 features…. Had you used ~100ish and more, you would have probably won the competition. Anyway, please post the modelling part of the code if/when you find time, if it is still something that you intend to do. Myself -as well as many other fellow kagglers- are eager to learn the tricks about transformers for time-series( as in: 'stuff that really works').

Gerryl

Posted a year ago

Congratulations!
As for cross-sectional attention(stock-wise), will you incorparate this into every cross-section during sequence modeling or only use this after temporal dim reduction （pooling or use last timestamp）？ I guess use t this for every cross-section is not GRAM friendly…

Daniel FG

Topic Author

Posted a year ago

· 6th in this Competition

will you incorparate this into every cross-section

Yes

Gerryl

Posted a year ago

Congratulations！ But why do you call seq2seq gru-based model decoder only? I think it might be called encoder-decoder paradigm?

Daniel FG

Topic Author

Posted a year ago

· 6th in this Competition

I initially designed the network as seq2seq, but ultimately opted for a single-decoder approach in the GRU model due to the following reasons:
- It blends better with the transformer model.
- The encoder didn't contribute significantly and controlling overfitting became challenging.

leo

Posted a year ago

· 34th in this Competition

Sequence-to-sequence transformers on tabular data. A real GM solution. Congratulations!

xianZ_waikato

Posted a year ago

· 163rd in this Competition

Congratulations，i also build a transformer based on axial attention but the only with the mental features, the preformance is that good. it's unusual to do Feature Engineering, so why do you think Feature Engineering is good to this projetc?

Yousif Hajajj

Posted a year ago

Congratulations on 6th rank in this competition🥳

skygoal9

Posted a year ago

· 108th in this Competition

The DL models mentioned sound very impressive given the input features are simple.
Look forward to learning more from your shared code later.

thanks a lot!!

JRPC

Posted a year ago

· 552nd in this Competition

Hello Daniel,

Impressive work on the Kaggle Optiver competition! Your approach, especially with the seq2seq transformers and GRU model, caught my eye. Any chance you could share your code or dive a bit into how you tackled the feature engineering and preprocessing? I'm keen to learn from your techniques and apply some insights to my own projects.

Thanks a ton!

Daniel FG

Topic Author

Posted a year ago

· 6th in this Competition

I just used features that are already in most of the public notebooks.

df['seconds_in_bucket_flag_1'] = df['seconds_in_bucket'] >= 300 - 60
df['seconds_in_bucket_flag_2'] = df['seconds_in_bucket'] >= 300
df['seconds_in_bucket_flag_3'] = df['seconds_in_bucket'] >= 480 - 60
df['seconds_in_bucket_flag_4'] = df['seconds_in_bucket'] >= 480

df["volume"] = df['ask_size'] + df['bid_size']
df["mid_price"] = (df['ask_price'] + df['bid_price']) / 2
df["liquidity_imbalance"] = (df['bid_size'] - df['ask_size']) / df["volume"]
df["matched_imbalance"] = (df['imbalance_size'] - df['matched_size']) / (df['imbalance_size'] + df['matched_size'])
df["size_imbalance"] = df['bid_size'] / df['ask_size']
df['harmonic_imbalance'] = 2 / ((1 / df['bid_size']) + (1 / df['ask_size'] ))

from itertools import combinations
prices = ["reference_price", "far_price", "near_price", "ask_price", "bid_price", "wap"]
for c in combinations(prices, 2):
    df[f"{c[0]}_{c[1]}_imb"] = df.eval(f"({c[0]} - {c[1]})/({c[0]} + {c[1]})")

For preprocessing, please refer to StandardScaler and SimpleImputer.