Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic.

Learn more

OK, Got it.

Optiver · Featured Code Competition · a year ago

Optiver - Trading at the Close

Predict US stocks closing movements

Optiver - Trading at the Close

Overview Data Code Models Discussion Leaderboard Rules

ADAM. · 9th in this Competition · Posted a year ago

9th Place Solution

A big thanks to Optiver and Kaggle for hosting this competition. This competition has a really stable correlation between local cv and lb.

Actually I entered this game a little late, about 30 days before its ends and I am not good at NN, so I only focus on Gradient Boosting tree models and its feature engineering. I noticed there are many top solutions using NN and it is really a good opportunity for me to learn NN.

Model

Xgboost with 3 different seeds and same 157 features
- There is not much difference between Xgboost and Lightgbm in lb score. But GPU Xgboost trains faster than GPU Lightgbm.

Feature Engineering

Firstly, create some "basic features" based on raw features(i.e. add, subtract, multiply, divide from raw features). Also, create some median-scaled raw size features.

size_col = ['imbalance_size','matched_size','bid_size','ask_size']
for _ in size_col:
    train[f"scale_{_}"] = train[_] / train.groupby(['stock_id'])[_].transform('median')

Secondly, do further feature engineering/aggregation on raw features and "basic features"
- imb1, imb2 features
- market_urgency feateures I copied from public notebook
- diff features on different time window
- shift features on different time window
- rolling_mean/std features on different time window
- using history wap to calculate target of 6 second before. Then, do some rolling_mean
- some global date_id+seconds weighted features
- MACD feateures
- target rolliong_mean over stock_id + seconds_in_bucket

Feature Selection

Because we have limit on inference time and memory, it's essential to do some feature selection. I add features group by group and check whether the local cv improves. Each feature group usually have 10 - 30 features. If one groups make local cv improve, I add feature one by one insides this feature group and usually kept only 5-10 most effective features.
I keep 157 features in my final model.

Post-processing:

Subtract weighted sum. From the definition of target, we can know weighted sum of target for all stocks should be zero.

test_df['pred'] = lgb_predictions
test_df['w_pred'] = test_df['weight'] * test_df['pred']
test_df["post_num"] = test_df.groupby(["date_id","seconds_in_bucket"])['w_pred'].transform('sum') / test_df.groupby(["date_id","seconds_in_bucket"])['weight'].transform('sum')
test_df['pred'] = test_df['pred'] - test_df['post_num']

Others:

xgb mae objective
xgb sample_weight 1.5 weight for latest 45 days data
Online training. I only retrain model twice. one is N day (N is the start date of private lb), the other is N+30 day.
polars and reduce_mem_usage function helps a lot

Codes

train: https://github.com/ChunhanLi/9th-kaggle-optiver-trading-close
inference: https://www.kaggle.com/code/hookman/9th-submission

Please sign in to reply to this topic.

28 Comments

2 appreciation comments

Demon_1023

Posted 10 months ago

· 1240th in this Competition

你好，请问特征生成时shift是不是写错了，-mock_period好像是用到未来信息了

target mock系列

for mock_period in [1,3,12,6]:

    df = df.with_columns([
        pl.col("wap").shift(-mock_period).over("stock_id","date_id").alias(f"wap_shift_n{mock_period}")
    ])
    df = df.with_columns([
        (pl.col(f"wap_shift_n{mock_period}")/pl.col("wap")).alias("target_single")
    ])

ADAM.

Topic Author

Posted 10 months ago

· 9th in this Competition

应该不会出问题的。这里我仔细核对过的，具体细节确实有些忘了。

大致看了下是这个逻辑

比如10s 20s 30s

shift(-3) -> 30s的wap移到10s位置，做一些计算得到特征A。

我后面要平移回去的

pl.col("target_mock").shift(mock_period).over("stock_id","date_id").alias(f"target_mock_shift{mock_period}")

这里把10s的特征A shift到30s的位置，所以这里30s的特征A(10s shift过来的这个)其实还是基于30s以前的数据算的，没有穿越。

我这个写法读起来不太友好，把10s的wap移到30s来计算特征A其实和这个等价的，但是可读性更高。

Demon_1023

Posted 10 months ago

· 1240th in this Competition

感谢你的回复，的确target_mock在你的代码里shift回去会得到target_mock_shift{mock_period}特征，他没有泄露信息。
pl.col("target_mock").shift(mock_period).over("stock_id","date_id").alias(f"target_mock_shift{mock_period}")
不过在构建target_mock_shift{mock_period}这些特征的过程中，也创造了wap_shift_n{mock_period}和target_single这些特征。这些特征没有删除掉，好像是用到了未来信息……
即 30s的wap移到10s位置，这个wap就留在了10s位置，构造完特征后没有将中间过程的特征删除

ADAM.

Topic Author

Posted 10 months ago

· 9th in this Competition

wap_shift_n{mock_period}和target_single这些特征没有加入我的模型的。可以看下add_cols变量，只有这里面的变量，我才会加入模型。

Ethan

Posted a year ago

· 13th in this Competition

Congrats! Let‘s look forward next new GM.

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

Thanks. haha

WindClimber

Posted a year ago

· 23rd in this Competition

Thank you for sharing! In my case, GPU XGboost trains much faster than GPU LightGBM as well.

Funny

Posted a year ago

· 12th in this Competition

Thanks for sharing. Do you have insight how important "1.5 weight for latest 45 days data" is? I tried to put weight for lgb model but did not succeeded. I think it might be one of the keys providing the recent data is very relevant for prediciton

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

As far as I remembered, setting sample weight can improve round 0.001 in my local cv.

hoo yuet

Posted a year ago

观摩学习！对于新手很大帮助

Somy2022

Posted a year ago

Can I ask you how you handle NaN value @hookman

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

fillna(-9e10)

MengMai

Posted a year ago

请问最开始的权重

weight_df['weight'] = [
0.004, 0.001, 0.002, 0.006, 0.004, 0.004, 0.002, 0.006, 0.006, 0.002, 0.002, 0.008,
0.006, 0.002, 0.008, 0.006, 0.002, 0.006, 0.004, 0.002, 0.004, 0.001, 0.006, 0.004,
0.002, 0.002, 0.004, 0.002, 0.004, 0.004, 0.001, 0.001, 0.002, 0.002, 0.006, 0.004,
0.004, 0.004, 0.006, 0.002, 0.002, 0.04 , 0.002, 0.002, 0.004, 0.04 , 0.002, 0.001,
0.006, 0.004, 0.004, 0.006, 0.001, 0.004, 0.004, 0.002, 0.006, 0.004, 0.006, 0.004,
0.006, 0.004, 0.002, 0.001, 0.002, 0.004, 0.002, 0.008, 0.004, 0.004, 0.002, 0.004,
0.006, 0.002, 0.004, 0.004, 0.002, 0.004, 0.004, 0.004, 0.001, 0.002, 0.002, 0.008,
0.02 , 0.004, 0.006, 0.002, 0.02 , 0.002, 0.002, 0.006, 0.004, 0.002, 0.001, 0.02,
0.006, 0.001, 0.002, 0.004, 0.001, 0.002, 0.006, 0.006, 0.004, 0.006, 0.001, 0.002,
0.004, 0.006, 0.006, 0.001, 0.04 , 0.006, 0.002, 0.004, 0.002, 0.002, 0.006, 0.002,
0.002, 0.004, 0.006, 0.006, 0.002, 0.002, 0.008, 0.006, 0.004, 0.002, 0.006, 0.002,
0.004, 0.006, 0.002, 0.004, 0.001, 0.004, 0.002, 0.004, 0.008, 0.006, 0.008, 0.002,
0.004, 0.002, 0.001, 0.004, 0.004, 0.004, 0.006, 0.008, 0.004, 0.001, 0.001, 0.002,
0.006, 0.004, 0.001, 0.002, 0.006, 0.004, 0.006, 0.008, 0.002, 0.002, 0.004, 0.002,
0.04 , 0.002, 0.002, 0.004, 0.002, 0.002, 0.006, 0.02 , 0.004, 0.002, 0.006, 0.02,
0.001, 0.002, 0.006, 0.004, 0.006, 0.004, 0.004, 0.004, 0.004, 0.002, 0.004, 0.04,
0.002, 0.008, 0.002, 0.004, 0.001, 0.004, 0.006, 0.004,
]

是如何确定的

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

Check this discussion.

MeiYuxin

Posted a year ago

Thanks for sharing。I have a question. Why do you build cv using days>390。

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

Actually, I forgot it. Maybe some public discussion or notebook use this and I want to compare with them. I think It doesn't matter you use 45/60/90 as cv. It is all aligned with lb.

MeiYuxin

Posted a year ago

but I get result this:
15个特征时：
验证得分：5.89235
公榜得分：5.4095

66个特征时：
验证得分：5.90648
调参后验证得分：5.89542
公榜得分：5.3968

Felix Levesque

Posted a year ago

Any idea why your solution running on my computer crashes? It seems to work fine if I do something like :
train = train[train["date_id"] >= 350]
to reduce the dataset size, but if not it crashes all the time.
(I have 64 Gb RAM and 16 cores)

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

I have 128GB RAM so I didn't spend time reducing memory in training phase. Maybe you can use reduce_mem_usage function on the featured dataframe before going into Xgboost. Besides, you can also cast dtypes in generate_features_no_hist_polars(i.e. float64 -> float32) to save memory.

Hina Ismail

Posted a year ago

Nice work sir

Somy2022

Posted a year ago

Can u explain why you do this step " # 阶段1
(pl.col('ask_size') * pl.col('ask_price')).alias("ask_money"),
(pl.col('bid_size') * pl.col('bid_price')).alias("bid_money"),
(pl.col('ask_size') + pl.col("auc_ask_size")).alias("ask_size_all"),
(pl.col('bid_size') + pl.col("auc_bid_size")).alias("bid_size_all"),
(pl.col('ask_size') + pl.col("auc_ask_size") + pl.col('bid_size') + pl.col("auc_bid_size")).alias("volumn_size_all"),
(pl.col('reference_price') * pl.col('auc_ask_size')).alias("ask_auc_money"),
(pl.col('reference_price') * pl.col('auc_bid_size')).alias("bid_auc_money"),
(pl.col('ask_size') * pl.col('ask_price') + pl.col('bid_size') * pl.col('bid_price')).alias("volumn_money"),
(pl.col('ask_size') + pl.col('bid_size')).alias('volume_cont'),
(pl.col('ask_size') - pl.col('bid_size')).alias('diff_ask_bid_size'),
(pl.col('imbalance_size') + 2 * pl.col('matched_size')).alias('volumn_auc'),
((pl.col('imbalance_size') + 2 * pl.col('matched_size')) * pl.col("reference_price")).alias('volumn_auc_money'),
((pl.col('ask_price') + pl.col('bid_price'))/2).alias('mid_price'),
((pl.col('near_price') + pl.col('far_price'))/2).alias('mid_price_near_far'),
(pl.col('ask_price') - pl.col('bid_price')).alias('price_diff_ask_bid'),
(pl.col('ask_price') / pl.col('bid_price')).alias('price_div_ask_bid'),
(pl.col('imbalance_buy_sell_flag') * pl.col('scale_imbalance_size')).alias('flag_scale_imbalance_size'),
(pl.col('imbalance_buy_sell_flag') * pl.col('imbalance_size')).alias('flag_imbalance_size'),
(pl.col('imbalance_size') / pl.col('matched_size') * pl.col('imbalance_buy_sell_flag')).alias("div_flag_imbalance_size_2_balance"),
((pl.col('ask_price') - pl.col('bid_price')) * pl.col('imbalance_size')).alias('price_pressure'),
((pl.col('ask_price') - pl.col('bid_price')) * pl.col('imbalance_size') * pl.col('imbalance_buy_sell_flag')).alias('price_pressure_v2'),
((pl.col("ask_size") - pl.col("bid_size")) / (pl.col("far_price") - pl.col("near_price"))).alias("depth_pressure"),
(pl.col("bid_size") / pl.col("ask_size")).alias("div_bid_size_ask_size"),
])"

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

I want to create some simple and useful features according to cross validation and business knowledge.

· 9th in this Competition

Check the definition of target. Weighted sum of target for all stocks is 0.

This comment has been deleted.

Appreciation (2)

Onelux

Posted a year ago

· 183rd in this Competition

Congrats! and Thanks for sharing.

max riffi

Posted 7 months ago

Thanks for sharing

Optiver - Trading at the Close

Optiver - Trading at the Close

9th Place Solution

Model

Feature Engineering

Feature Selection

Post-processing:

Others:

Codes

28 Comments

Demon_1023

target mock系列

ADAM.

Demon_1023

ADAM.

Ethan

ADAM.

WindClimber

Funny

ADAM.

hoo yuet

Somy2022

ADAM.

MengMai

ADAM.

MeiYuxin

ADAM.

MeiYuxin

Felix Levesque

ADAM.

Hina Ismail

Somy2022

ADAM.

Yisak Birhanu Bule

achaosss

ADAM.

achaosss

Somy2022

Azaneo

Appreciation (2)

Onelux

max riffi