Predict US stocks closing movements
A big thanks to Optiver and Kaggle for hosting this competition. This competition has a really stable correlation between local cv and lb.
Actually I entered this game a little late, about 30 days before its ends and I am not good at NN, so I only focus on Gradient Boosting tree models and its feature engineering. I noticed there are many top solutions using NN and it is really a good opportunity for me to learn NN.
size_col = ['imbalance_size','matched_size','bid_size','ask_size']
for _ in size_col:
train[f"scale_{_}"] = train[_] / train.groupby(['stock_id'])[_].transform('median')
test_df['pred'] = lgb_predictions
test_df['w_pred'] = test_df['weight'] * test_df['pred']
test_df["post_num"] = test_df.groupby(["date_id","seconds_in_bucket"])['w_pred'].transform('sum') / test_df.groupby(["date_id","seconds_in_bucket"])['weight'].transform('sum')
test_df['pred'] = test_df['pred'] - test_df['post_num']
reduce_mem_usage function
helps a lottrain: https://github.com/ChunhanLi/9th-kaggle-optiver-trading-close
inference: https://www.kaggle.com/code/hookman/9th-submission
Please sign in to reply to this topic.
Posted 10 months ago
· 1240th in this Competition
你好,请问特征生成时shift是不是写错了,-mock_period好像是用到未来信息了
for mock_period in [1,3,12,6]:
df = df.with_columns([
pl.col("wap").shift(-mock_period).over("stock_id","date_id").alias(f"wap_shift_n{mock_period}")
])
df = df.with_columns([
(pl.col(f"wap_shift_n{mock_period}")/pl.col("wap")).alias("target_single")
])
Posted 10 months ago
· 9th in this Competition
应该不会出问题的。这里我仔细核对过的,具体细节确实有些忘了。
大致看了下是这个逻辑
比如10s 20s 30s
shift(-3) -> 30s的wap移到10s位置,做一些计算得到特征A。
我后面要平移回去的
pl.col("target_mock").shift(mock_period).over("stock_id","date_id").alias(f"target_mock_shift{mock_period}")
这里把10s的特征A shift到30s的位置,所以这里30s的特征A(10s shift过来的这个)其实还是基于30s以前的数据算的,没有穿越。
我这个写法读起来不太友好,把10s的wap移到30s来计算特征A其实和这个等价的,但是可读性更高。
Posted 10 months ago
· 1240th in this Competition
感谢你的回复,的确target_mock在你的代码里shift回去会得到target_mock_shift{mock_period}特征,他没有泄露信息。
pl.col("target_mock").shift(mock_period).over("stock_id","date_id").alias(f"target_mock_shift{mock_period}")
不过在构建target_mock_shift{mock_period}这些特征的过程中,也创造了wap_shift_n{mock_period}和target_single这些特征。这些特征没有删除掉,好像是用到了未来信息……
即 30s的wap移到10s位置,这个wap就留在了10s位置,构造完特征后没有将中间过程的特征删除
Posted 10 months ago
· 9th in this Competition
wap_shift_n{mock_period}和target_single
这些特征没有加入我的模型的。可以看下add_cols变量,只有这里面的变量,我才会加入模型。
Posted a year ago
· 12th in this Competition
Thanks for sharing. Do you have insight how important "1.5 weight for latest 45 days data" is? I tried to put weight for lgb model but did not succeeded. I think it might be one of the keys providing the recent data is very relevant for prediciton
Posted a year ago
· 9th in this Competition
As far as I remembered, setting sample weight can improve round 0.001 in my local cv.
Posted a year ago
请问最开始的权重
weight_df['weight'] = [
0.004, 0.001, 0.002, 0.006, 0.004, 0.004, 0.002, 0.006, 0.006, 0.002, 0.002, 0.008,
0.006, 0.002, 0.008, 0.006, 0.002, 0.006, 0.004, 0.002, 0.004, 0.001, 0.006, 0.004,
0.002, 0.002, 0.004, 0.002, 0.004, 0.004, 0.001, 0.001, 0.002, 0.002, 0.006, 0.004,
0.004, 0.004, 0.006, 0.002, 0.002, 0.04 , 0.002, 0.002, 0.004, 0.04 , 0.002, 0.001,
0.006, 0.004, 0.004, 0.006, 0.001, 0.004, 0.004, 0.002, 0.006, 0.004, 0.006, 0.004,
0.006, 0.004, 0.002, 0.001, 0.002, 0.004, 0.002, 0.008, 0.004, 0.004, 0.002, 0.004,
0.006, 0.002, 0.004, 0.004, 0.002, 0.004, 0.004, 0.004, 0.001, 0.002, 0.002, 0.008,
0.02 , 0.004, 0.006, 0.002, 0.02 , 0.002, 0.002, 0.006, 0.004, 0.002, 0.001, 0.02,
0.006, 0.001, 0.002, 0.004, 0.001, 0.002, 0.006, 0.006, 0.004, 0.006, 0.001, 0.002,
0.004, 0.006, 0.006, 0.001, 0.04 , 0.006, 0.002, 0.004, 0.002, 0.002, 0.006, 0.002,
0.002, 0.004, 0.006, 0.006, 0.002, 0.002, 0.008, 0.006, 0.004, 0.002, 0.006, 0.002,
0.004, 0.006, 0.002, 0.004, 0.001, 0.004, 0.002, 0.004, 0.008, 0.006, 0.008, 0.002,
0.004, 0.002, 0.001, 0.004, 0.004, 0.004, 0.006, 0.008, 0.004, 0.001, 0.001, 0.002,
0.006, 0.004, 0.001, 0.002, 0.006, 0.004, 0.006, 0.008, 0.002, 0.002, 0.004, 0.002,
0.04 , 0.002, 0.002, 0.004, 0.002, 0.002, 0.006, 0.02 , 0.004, 0.002, 0.006, 0.02,
0.001, 0.002, 0.006, 0.004, 0.006, 0.004, 0.004, 0.004, 0.004, 0.002, 0.004, 0.04,
0.002, 0.008, 0.002, 0.004, 0.001, 0.004, 0.006, 0.004,
]
是如何确定的
Posted a year ago
Thanks for sharing。I have a question. Why do you build cv using days>390。
Posted a year ago
· 9th in this Competition
Actually, I forgot it. Maybe some public discussion or notebook use this and I want to compare with them. I think It doesn't matter you use 45/60/90 as cv. It is all aligned with lb.
Posted a year ago
Any idea why your solution running on my computer crashes? It seems to work fine if I do something like :
train = train[train["date_id"] >= 350]
to reduce the dataset size, but if not it crashes all the time.
(I have 64 Gb RAM and 16 cores)
Posted a year ago
· 9th in this Competition
I have 128GB RAM so I didn't spend time reducing memory in training phase. Maybe you can use reduce_mem_usage
function on the featured dataframe before going into Xgboost. Besides, you can also cast dtypes in generate_features_no_hist_polars
(i.e. float64 -> float32) to save memory.
Posted a year ago
Can u explain why you do this step " # 阶段1
(pl.col('ask_size') * pl.col('ask_price')).alias("ask_money"),
(pl.col('bid_size') * pl.col('bid_price')).alias("bid_money"),
(pl.col('ask_size') + pl.col("auc_ask_size")).alias("ask_size_all"),
(pl.col('bid_size') + pl.col("auc_bid_size")).alias("bid_size_all"),
(pl.col('ask_size') + pl.col("auc_ask_size") + pl.col('bid_size') + pl.col("auc_bid_size")).alias("volumn_size_all"),
(pl.col('reference_price') * pl.col('auc_ask_size')).alias("ask_auc_money"),
(pl.col('reference_price') * pl.col('auc_bid_size')).alias("bid_auc_money"),
(pl.col('ask_size') * pl.col('ask_price') + pl.col('bid_size') * pl.col('bid_price')).alias("volumn_money"),
(pl.col('ask_size') + pl.col('bid_size')).alias('volume_cont'),
(pl.col('ask_size') - pl.col('bid_size')).alias('diff_ask_bid_size'),
(pl.col('imbalance_size') + 2 * pl.col('matched_size')).alias('volumn_auc'),
((pl.col('imbalance_size') + 2 * pl.col('matched_size')) * pl.col("reference_price")).alias('volumn_auc_money'),
((pl.col('ask_price') + pl.col('bid_price'))/2).alias('mid_price'),
((pl.col('near_price') + pl.col('far_price'))/2).alias('mid_price_near_far'),
(pl.col('ask_price') - pl.col('bid_price')).alias('price_diff_ask_bid'),
(pl.col('ask_price') / pl.col('bid_price')).alias('price_div_ask_bid'),
(pl.col('imbalance_buy_sell_flag') * pl.col('scale_imbalance_size')).alias('flag_scale_imbalance_size'),
(pl.col('imbalance_buy_sell_flag') * pl.col('imbalance_size')).alias('flag_imbalance_size'),
(pl.col('imbalance_size') / pl.col('matched_size') * pl.col('imbalance_buy_sell_flag')).alias("div_flag_imbalance_size_2_balance"),
((pl.col('ask_price') - pl.col('bid_price')) * pl.col('imbalance_size')).alias('price_pressure'),
((pl.col('ask_price') - pl.col('bid_price')) * pl.col('imbalance_size') * pl.col('imbalance_buy_sell_flag')).alias('price_pressure_v2'),
((pl.col("ask_size") - pl.col("bid_size")) / (pl.col("far_price") - pl.col("near_price"))).alias("depth_pressure"),
(pl.col("bid_size") / pl.col("ask_size")).alias("div_bid_size_ask_size"),
])"
Posted a year ago
· 9th in this Competition
I want to create some simple and useful features according to cross validation and business knowledge.
This comment has been deleted.
This comment has been deleted.