Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic.
Learn more
OK, Got it.
Optiver · Featured Code Competition · a year ago

Optiver - Trading at the Close

Predict US stocks closing movements

Optiver - Trading at the Close

ADAM. · 9th in this Competition · Posted a year ago
This post earned a gold medal

9th Place Solution

A big thanks to Optiver and Kaggle for hosting this competition. This competition has a really stable correlation between local cv and lb.

Actually I entered this game a little late, about 30 days before its ends and I am not good at NN, so I only focus on Gradient Boosting tree models and its feature engineering. I noticed there are many top solutions using NN and it is really a good opportunity for me to learn NN.

Model

  • Xgboost with 3 different seeds and same 157 features
    • There is not much difference between Xgboost and Lightgbm in lb score. But GPU Xgboost trains faster than GPU Lightgbm.

Feature Engineering

  • Firstly, create some "basic features" based on raw features(i.e. add, subtract, multiply, divide from raw features). Also, create some median-scaled raw size features.
size_col = ['imbalance_size','matched_size','bid_size','ask_size']
for _ in size_col:
    train[f"scale_{_}"] = train[_] / train.groupby(['stock_id'])[_].transform('median')
  • Secondly, do further feature engineering/aggregation on raw features and "basic features"
    • imb1, imb2 features
    • market_urgency feateures I copied from public notebook
    • diff features on different time window
    • shift features on different time window
    • rolling_mean/std features on different time window
    • using history wap to calculate target of 6 second before. Then, do some rolling_mean
    • some global date_id+seconds weighted features
    • MACD feateures
    • target rolliong_mean over stock_id + seconds_in_bucket

Feature Selection

  • Because we have limit on inference time and memory, it's essential to do some feature selection. I add features group by group and check whether the local cv improves. Each feature group usually have 10 - 30 features. If one groups make local cv improve, I add feature one by one insides this feature group and usually kept only 5-10 most effective features.
  • I keep 157 features in my final model.

Post-processing:

  • Subtract weighted sum. From the definition of target, we can know weighted sum of target for all stocks should be zero.
test_df['pred'] = lgb_predictions
test_df['w_pred'] = test_df['weight'] * test_df['pred']
test_df["post_num"] = test_df.groupby(["date_id","seconds_in_bucket"])['w_pred'].transform('sum') / test_df.groupby(["date_id","seconds_in_bucket"])['weight'].transform('sum')
test_df['pred'] = test_df['pred'] - test_df['post_num']

Others:

  • xgb mae objective
  • xgb sample_weight 1.5 weight for latest 45 days data
  • Online training. I only retrain model twice. one is N day (N is the start date of private lb), the other is N+30 day.
  • polars and reduce_mem_usage function helps a lot

Codes

train: https://github.com/ChunhanLi/9th-kaggle-optiver-trading-close
inference: https://www.kaggle.com/code/hookman/9th-submission

Please sign in to reply to this topic.

Posted 10 months ago

· 1240th in this Competition

你好,请问特征生成时shift是不是写错了,-mock_period好像是用到未来信息了

target mock系列

for mock_period in [1,3,12,6]:

    df = df.with_columns([
        pl.col("wap").shift(-mock_period).over("stock_id","date_id").alias(f"wap_shift_n{mock_period}")
    ])
    df = df.with_columns([
        (pl.col(f"wap_shift_n{mock_period}")/pl.col("wap")).alias("target_single")
    ])

ADAM.

Topic Author

Posted 10 months ago

· 9th in this Competition

应该不会出问题的。这里我仔细核对过的,具体细节确实有些忘了。

大致看了下是这个逻辑

比如10s 20s 30s

shift(-3) -> 30s的wap移到10s位置,做一些计算得到特征A。

我后面要平移回去的

pl.col("target_mock").shift(mock_period).over("stock_id","date_id").alias(f"target_mock_shift{mock_period}")

这里把10s的特征A shift到30s的位置,所以这里30s的特征A(10s shift过来的这个)其实还是基于30s以前的数据算的,没有穿越。

我这个写法读起来不太友好,把10s的wap移到30s来计算特征A其实和这个等价的,但是可读性更高。

Posted 10 months ago

· 1240th in this Competition

感谢你的回复,的确target_mock在你的代码里shift回去会得到target_mock_shift{mock_period}特征,他没有泄露信息。
pl.col("target_mock").shift(mock_period).over("stock_id","date_id").alias(f"target_mock_shift{mock_period}")
不过在构建target_mock_shift{mock_period}这些特征的过程中,也创造了wap_shift_n{mock_period}和target_single这些特征。这些特征没有删除掉,好像是用到了未来信息……
即 30s的wap移到10s位置,这个wap就留在了10s位置,构造完特征后没有将中间过程的特征删除

ADAM.

Topic Author

Posted 10 months ago

· 9th in this Competition

wap_shift_n{mock_period}和target_single这些特征没有加入我的模型的。可以看下add_cols变量,只有这里面的变量,我才会加入模型。

Posted a year ago

· 13th in this Competition

This post earned a bronze medal

Congrats! Let‘s look forward next new GM.

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

Thanks. haha

Posted a year ago

· 23rd in this Competition

This post earned a bronze medal

Thank you for sharing! In my case, GPU XGboost trains much faster than GPU LightGBM as well.

Posted a year ago

· 12th in this Competition

This post earned a bronze medal

Thanks for sharing. Do you have insight how important "1.5 weight for latest 45 days data" is? I tried to put weight for lgb model but did not succeeded. I think it might be one of the keys providing the recent data is very relevant for prediciton

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

As far as I remembered, setting sample weight can improve round 0.001 in my local cv.

Posted a year ago

This post earned a bronze medal

观摩学习!对于新手很大帮助

Posted a year ago

Can I ask you how you handle NaN value @hookman

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

fillna(-9e10)

Posted a year ago

请问最开始的权重

weight_df['weight'] = [
0.004, 0.001, 0.002, 0.006, 0.004, 0.004, 0.002, 0.006, 0.006, 0.002, 0.002, 0.008,
0.006, 0.002, 0.008, 0.006, 0.002, 0.006, 0.004, 0.002, 0.004, 0.001, 0.006, 0.004,
0.002, 0.002, 0.004, 0.002, 0.004, 0.004, 0.001, 0.001, 0.002, 0.002, 0.006, 0.004,
0.004, 0.004, 0.006, 0.002, 0.002, 0.04 , 0.002, 0.002, 0.004, 0.04 , 0.002, 0.001,
0.006, 0.004, 0.004, 0.006, 0.001, 0.004, 0.004, 0.002, 0.006, 0.004, 0.006, 0.004,
0.006, 0.004, 0.002, 0.001, 0.002, 0.004, 0.002, 0.008, 0.004, 0.004, 0.002, 0.004,
0.006, 0.002, 0.004, 0.004, 0.002, 0.004, 0.004, 0.004, 0.001, 0.002, 0.002, 0.008,
0.02 , 0.004, 0.006, 0.002, 0.02 , 0.002, 0.002, 0.006, 0.004, 0.002, 0.001, 0.02,
0.006, 0.001, 0.002, 0.004, 0.001, 0.002, 0.006, 0.006, 0.004, 0.006, 0.001, 0.002,
0.004, 0.006, 0.006, 0.001, 0.04 , 0.006, 0.002, 0.004, 0.002, 0.002, 0.006, 0.002,
0.002, 0.004, 0.006, 0.006, 0.002, 0.002, 0.008, 0.006, 0.004, 0.002, 0.006, 0.002,
0.004, 0.006, 0.002, 0.004, 0.001, 0.004, 0.002, 0.004, 0.008, 0.006, 0.008, 0.002,
0.004, 0.002, 0.001, 0.004, 0.004, 0.004, 0.006, 0.008, 0.004, 0.001, 0.001, 0.002,
0.006, 0.004, 0.001, 0.002, 0.006, 0.004, 0.006, 0.008, 0.002, 0.002, 0.004, 0.002,
0.04 , 0.002, 0.002, 0.004, 0.002, 0.002, 0.006, 0.02 , 0.004, 0.002, 0.006, 0.02,
0.001, 0.002, 0.006, 0.004, 0.006, 0.004, 0.004, 0.004, 0.004, 0.002, 0.004, 0.04,
0.002, 0.008, 0.002, 0.004, 0.001, 0.004, 0.006, 0.004,
]

是如何确定的

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

Check this discussion.

Posted a year ago

Thanks for sharing。I have a question. Why do you build cv using days>390。

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

Actually, I forgot it. Maybe some public discussion or notebook use this and I want to compare with them. I think It doesn't matter you use 45/60/90 as cv. It is all aligned with lb.

Posted a year ago

but I get result this:
15个特征时:
验证得分:5.89235
公榜得分:5.4095

66个特征时:
验证得分:5.90648
调参后验证得分:5.89542
公榜得分:5.3968

Profile picture for MeiYuxin
Profile picture for ADAM.

Posted a year ago

Any idea why your solution running on my computer crashes? It seems to work fine if I do something like :
train = train[train["date_id"] >= 350]
to reduce the dataset size, but if not it crashes all the time.
(I have 64 Gb RAM and 16 cores)

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

I have 128GB RAM so I didn't spend time reducing memory in training phase. Maybe you can use reduce_mem_usage function on the featured dataframe before going into Xgboost. Besides, you can also cast dtypes in generate_features_no_hist_polars(i.e. float64 -> float32) to save memory.

Posted a year ago

Nice work sir

Posted a year ago

Can u explain why you do this step " # 阶段1
(pl.col('ask_size') * pl.col('ask_price')).alias("ask_money"),
(pl.col('bid_size') * pl.col('bid_price')).alias("bid_money"),
(pl.col('ask_size') + pl.col("auc_ask_size")).alias("ask_size_all"),
(pl.col('bid_size') + pl.col("auc_bid_size")).alias("bid_size_all"),
(pl.col('ask_size') + pl.col("auc_ask_size") + pl.col('bid_size') + pl.col("auc_bid_size")).alias("volumn_size_all"),
(pl.col('reference_price') * pl.col('auc_ask_size')).alias("ask_auc_money"),
(pl.col('reference_price') * pl.col('auc_bid_size')).alias("bid_auc_money"),
(pl.col('ask_size') * pl.col('ask_price') + pl.col('bid_size') * pl.col('bid_price')).alias("volumn_money"),
(pl.col('ask_size') + pl.col('bid_size')).alias('volume_cont'),
(pl.col('ask_size') - pl.col('bid_size')).alias('diff_ask_bid_size'),
(pl.col('imbalance_size') + 2 * pl.col('matched_size')).alias('volumn_auc'),
((pl.col('imbalance_size') + 2 * pl.col('matched_size')) * pl.col("reference_price")).alias('volumn_auc_money'),
((pl.col('ask_price') + pl.col('bid_price'))/2).alias('mid_price'),
((pl.col('near_price') + pl.col('far_price'))/2).alias('mid_price_near_far'),
(pl.col('ask_price') - pl.col('bid_price')).alias('price_diff_ask_bid'),
(pl.col('ask_price') / pl.col('bid_price')).alias('price_div_ask_bid'),
(pl.col('imbalance_buy_sell_flag') * pl.col('scale_imbalance_size')).alias('flag_scale_imbalance_size'),
(pl.col('imbalance_buy_sell_flag') * pl.col('imbalance_size')).alias('flag_imbalance_size'),
(pl.col('imbalance_size') / pl.col('matched_size') * pl.col('imbalance_buy_sell_flag')).alias("div_flag_imbalance_size_2_balance"),
((pl.col('ask_price') - pl.col('bid_price')) * pl.col('imbalance_size')).alias('price_pressure'),
((pl.col('ask_price') - pl.col('bid_price')) * pl.col('imbalance_size') * pl.col('imbalance_buy_sell_flag')).alias('price_pressure_v2'),
((pl.col("ask_size") - pl.col("bid_size")) / (pl.col("far_price") - pl.col("near_price"))).alias("depth_pressure"),
(pl.col("bid_size") / pl.col("ask_size")).alias("div_bid_size_ask_size"),
])"

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

I want to create some simple and useful features according to cross validation and business knowledge.

Posted a year ago

thank you for your summary of solution, as well as code publication

Posted a year ago

I still cannot understand "Subtract weighted sum" .Could you please provide more explanations?

ADAM.

Topic Author

Posted a year ago

· 9th in this Competition

Check the definition of target. Weighted sum of target for all stocks is 0.

Posted 4 months ago

Thank you so much, I learn a lot from your code.

Posted a year ago

Are research papers published for this competition, especially for the leaders? I want to read more about the subject

Posted a year ago

reading the code is like so much more efficient

This comment has been deleted.

This comment has been deleted.

Appreciation (2)

Posted a year ago

· 183rd in this Competition

This post earned a bronze medal

Congrats! and Thanks for sharing.

Posted 7 months ago

Thanks for sharing