Predict US stocks closing movements
Hi everyone, our solution for 14th place and for many of us our first gold medal can be found here:
https://www.kaggle.com/code/cody11null/14th-place-gold-cat-lgb-refit
For this competition my team of @ravi20076 , @yeoyunsianggeremie , and @mcpenguin used a XGBoost and Light Gradient Boosted ensemble with as much refitting as we could spare. We had 170 features and our original parameters come from an Optuna trial similar to those in my github. I had a great time in this competition and it was a pleasure to work alongside my friends. I learned a lot about the fundamentals in this competition, quick code, and focusing on what techniques will get better signals rather than overfitting on training data.
Please sign in to reply to this topic.
Posted a year ago
Congrats @cody11null on securing 14th place in this competition and thank you for sharing such valuable resource
Posted a year ago
· 507th in this Competition
Huge congrats! I assume this is the similar notebook on Github you used for tuning? Curious which objective function you used, given competition scoring was MAE, but I heard some folks got better results with RMSE? Thanks
Posted a year ago
· 14th in this Competition
We used MAE for everything, including tuning and model training and evaluation @tztang
Posted a year ago
· 29th in this Competition
Thanks so much for sharing. I really want to see more gold sharing in this competition. Mainly on how people validate online learning.
The features are pretty standard. df.groupby('stock_id'), is also known as performing not as well as "grouping by within a day" at many people's local validation. Did I miss any magic features?
It seems that online learning seems very important (A lot more than I thought).
How did you validate this setup?
Posted a year ago
· 14th in this Competition
In our local validation, grouping by stock_id is better than grouping by both stock_id and date_id. We did not use any postprocessing in our evaluation. We used dates 436-480
For online learning, we saved the data in each date_id and time_id in a CSV file, then loaded them one by one for prediction. Evaluation was done by each group of date_id and time_id and concatenated together to calculate the MAE. Same dates 436-480. Whether online learning or not we used this same method to evaluate when we want to be in sync with the API
Posted a year ago
· 5th in this Competition
Hey,
Our approach was different and based on the .refit() and .train(init_model) on LightGBM. .refit() got us the better score (5.4344):
We retrained daily at the start of the day, based on the previous day's data and the same day the previous week. I.e. if we were to predict a Friday - we would refit with data from Thursday and Friday the week before (assuming no holidays etc.) checking on this condition:
#If start of new day
if test.seconds_in_bucket.unique()[0] == 0:
print(f"Day Completed in : {time.time()-start}")
start = time.time()
#Process revealed target
revealed_targets_df = process_revealed_targets(new_revealed_targets=revealed_targets, old_revealed_targets=revealed_targets_df)
#Update model if new day + after 480
if test.date_id.max() > 476:
UPDATE_MODELS=True
else:
UPDATE_MODELS=False
#Don't update if not new day
else:
UPDATE_MODELS=False
And then our actual online learning was pretty simple for refit - we found this was better than booster.train(init_model):
######################################################INFERENCE REFITTING START######################################################
if UPDATE_MODELS:
try:
input_start_time = time.time()
#NUM DAYS NEESD TO BE T+1 -> starts at current day
lgb_input = get_raw_lgb_input(X_input = train, num_days =6, feature_names = feature_names, _revealed_targets_df = revealed_targets_df)
dt = revealed_targets_df['date_id'][-1]
X_refit_input = lgb_input.filter((pl.col('date_id') == dt-4) | (pl.col('date_id') == dt))[feature_names]
y_refit_input = lgb_input.filter((pl.col('date_id') == dt-4) | (pl.col('date_id') == dt))['target'].to_numpy()
# X_refit_input = lgb_input.filter((pl.col('date_id') == dt))[feature_names]
# y_refit_input = lgb_input.filter((pl.col('date_id') == dt))['target'].to_numpy()
print(f"Completed getting 5 day input in {time.time() - input_start_time}")
for idx in range(len(lgb_models)):
input_start_time = time.time()
lgb_models[idx] = lgb_models[idx].refit(data = X_refit_input, label = y_refit_input, feature_name = feature_names, decay_rate=0.993)
print(f"Completed 1 day refit for fold {idx} in {time.time() - input_start_time}")
except Exception as e:
print(e)
print('Failed to refit')
######################################################INFERENCE REFITTING END######################################################
#If no need to update, then just take the last 1 day
else:
lgb_input = get_raw_lgb_input(X_input = train, num_days = 1, feature_names = feature_names, _revealed_targets_df = revealed_targets_df)
try:
input_start_time = time.time()
for mdl in lgb_models:
lgb_prediction = mdl.predict(lgb_input[feature_names][-n_inputs : ])
ensemble_predictions.append(lgb_prediction)
print(f"Completed LGB prediction loop in {time.time() - input_start_time}")
except Exception as e:
print(e)
And our worse submission (5.4270) was the same flavour but with a different decay parameter:
lgb_models[idx] = lgb_models[idx].refit(data = X_refit_input, label = y_refit_input, feature_name = feature_names, decay_rate=0.993)
refit will effectively keep the tree structures the same but update the values according to the new data so initial fitting parameters are extremely important for regularization.
Our validation process was to log everything via MLFlow and look at impact of different decay rates over a testing period of days 440-480 to validate.
Posted a year ago
· 14th in this Competition
@yusefkaggle your idea to use the previous week data was extremely good and is a key takeaway from the competition. I think this matters a lot, especially week starts and ends where several stocks show up/ down trend movements. This is a key lesson for future challenges!!
Posted a year ago
· 5th in this Competition
Great solution - ours was very similar - but we used Catboost and LGB and we refit a little differently.
Special thanks to @ravi20076 (!) for the hard work in posting many public resources throughout - really upped the competition standard a lot I think.
Posted a year ago
· 14th in this Competition
What postprocessing did you use in your submission ? I wonder if it made a difference, for us, there is barely any difference between our zero-sum and no postprocessing submission. We did not use the zero-mean as the local CV and the public LB was worse than zero-sum with refitting.
EDIT: the zero-sum postprocessing does help. It’s the feature set having different base accuracies. Our EWM and bollinger band features have worsened the private LB even though the CV (436-480) and public LB was better
| PostProcess | Feature Set 1 | Feature Set 2 |
| None | 5.4475 | 5.4457 |
| Zero Sum | 5.4458 | 5.4440 |
Posted a year ago
· 5th in this Competition
I decomposed the target and effectively what we know is that the targets are relative performances. A strong up/down movement in one will actually affect all targets - and depended on their index weight.
So we should have roughly the condition:
weight_i * target_i = 0
where i is the individual stock.
So we did this:
def weight_adjustment(pred, test_weights, adjustment=True):
if adjustment==True:
pred = pred - np.average(pred, weights = test_weights)
return pred
From earliest test I can see this took us from 5.3329 -> 5.3297 so was a pretty helpful change.
Posted a year ago
· 14th in this Competition
@yusefkaggle thanks for the mention! I try to help others to the best I can. These responses motivate me to share better and help!
Best wishes!
BTW, our solution also used Catboost and LightGBM. We discarded XGBoost as it did not give a good CV score. My gold medal submission notebook highlights this approach.
As mentioned by @yeoyunsianggeremie, post-processing did not make a significant change for us. Both our chosen submissions would have given us the same rank. We did not try the mean submission and used zero-sum and no post-processing.
Posted a year ago
· 552nd in this Competition
Thanks for sharing.
It sounds like I encountered an error when trying to execute a notebook solution. The error indicates that a file (LGBM1R_V27.model) was not found in the specified directory (/kaggle/input/optivermodels/). This could be due to various reasons such as the file not being present in the directory, a typo in the file or path name, or an issue with the file access permissions. If this was a part of a notebook solution you were following, it's important to ensure that all necessary files are correctly placed and accessible. If the solution didn't mention the need for this file or didn't provide a way to obtain it, then it would be reasonable to consider downvoting the solution to alert others about this issue.
Posted a year ago
· 552nd in this Competition
--> 650 with open(filename, 'rb') as f:
651 with _read_fileobject(f, filename, mmap_mode) as fobj:
652 if isinstance(fobj, str):
653 # if the returned file object is a string, this means we
654 # try to load a pickle file generated with an version of
655 # Joblib so we load it with joblib compatibility function.
FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/optivermodels/LGBM1R_V27.model'
Posted a year ago
· 14th in this Competition
@juanricardop my root work will resolve the issue. The notebook in the post was a copy-edit of the base work perhaps creating some unknown permission issue/ error. I have made the dataset with the model public as well.
Best wishes and happy learning, thanks for bringing this to our attention!
Posted a year ago
· 14th in this Competition
@achaosss that is just an experiment number
lgbm regression 1 version 27 becomes lgbm1rv27