Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic.
Learn more
OK, Got it.
Optiver · Featured Code Competition · a year ago

Optiver - Trading at the Close

Predict US stocks closing movements

Optiver - Trading at the Close

Cody_Null · 14th in this Competition · Posted a year ago
This post earned a gold medal

14th Place Gold Solution

Hi everyone, our solution for 14th place and for many of us our first gold medal can be found here:

https://www.kaggle.com/code/cody11null/14th-place-gold-cat-lgb-refit

For this competition my team of @ravi20076 , @yeoyunsianggeremie , and @mcpenguin used a XGBoost and Light Gradient Boosted ensemble with as much refitting as we could spare. We had 170 features and our original parameters come from an Optuna trial similar to those in my github. I had a great time in this competition and it was a pleasure to work alongside my friends. I learned a lot about the fundamentals in this competition, quick code, and focusing on what techniques will get better signals rather than overfitting on training data.

Please sign in to reply to this topic.

Posted a year ago

This post earned a bronze medal

Congrats @cody11null on securing 14th place in this competition and thank you for sharing such valuable resource

Posted a year ago

· 507th in this Competition

This post earned a bronze medal

Huge congrats! I assume this is the similar notebook on Github you used for tuning? Curious which objective function you used, given competition scoring was MAE, but I heard some folks got better results with RMSE? Thanks

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

We used MAE for everything, including tuning and model training and evaluation @tztang

Posted a year ago

· 29th in this Competition

This post earned a bronze medal

Thanks so much for sharing. I really want to see more gold sharing in this competition. Mainly on how people validate online learning.

The features are pretty standard. df.groupby('stock_id'), is also known as performing not as well as "grouping by within a day" at many people's local validation. Did I miss any magic features?

It seems that online learning seems very important (A lot more than I thought).

How did you validate this setup?

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

In our local validation, grouping by stock_id is better than grouping by both stock_id and date_id. We did not use any postprocessing in our evaluation. We used dates 436-480

For online learning, we saved the data in each date_id and time_id in a CSV file, then loaded them one by one for prediction. Evaluation was done by each group of date_id and time_id and concatenated together to calculate the MAE. Same dates 436-480. Whether online learning or not we used this same method to evaluate when we want to be in sync with the API

Posted a year ago

· 29th in this Competition

Thank you! How did you decide splitting at 436?

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

Based on public discussion posts which determine that the size of the public LB is 45 days. We think this split where the size of public LB and validation is the same, will have a better CV/LB correlation

Posted a year ago

· 5th in this Competition

This post earned a bronze medal

Hey,

Our approach was different and based on the .refit() and .train(init_model) on LightGBM. .refit() got us the better score (5.4344):

We retrained daily at the start of the day, based on the previous day's data and the same day the previous week. I.e. if we were to predict a Friday - we would refit with data from Thursday and Friday the week before (assuming no holidays etc.) checking on this condition:

    #If start of new day
    if test.seconds_in_bucket.unique()[0] == 0:
        print(f"Day Completed in : {time.time()-start}")
        start = time.time()

        #Process revealed target
        revealed_targets_df = process_revealed_targets(new_revealed_targets=revealed_targets, old_revealed_targets=revealed_targets_df)

        #Update model if new day + after 480
        if test.date_id.max() > 476:
            UPDATE_MODELS=True
        else:
            UPDATE_MODELS=False

    #Don't update if not new day 
    else:
        UPDATE_MODELS=False

And then our actual online learning was pretty simple for refit - we found this was better than booster.train(init_model):

                ######################################################INFERENCE REFITTING START######################################################
                if UPDATE_MODELS:
                    try:
                        input_start_time = time.time()
                        #NUM DAYS NEESD TO BE T+1 -> starts at current day  
                        lgb_input = get_raw_lgb_input(X_input = train, num_days =6, feature_names = feature_names, _revealed_targets_df = revealed_targets_df)
                        dt = revealed_targets_df['date_id'][-1]
                        X_refit_input = lgb_input.filter((pl.col('date_id') == dt-4) | (pl.col('date_id') == dt))[feature_names]
                        y_refit_input = lgb_input.filter((pl.col('date_id') == dt-4) | (pl.col('date_id') == dt))['target'].to_numpy()

#                         X_refit_input = lgb_input.filter((pl.col('date_id') == dt))[feature_names]
#                         y_refit_input = lgb_input.filter((pl.col('date_id') == dt))['target'].to_numpy()
                        print(f"Completed getting 5 day input in {time.time() - input_start_time}")

                        for idx in range(len(lgb_models)):
                            input_start_time = time.time()
                            lgb_models[idx] = lgb_models[idx].refit(data = X_refit_input, label = y_refit_input, feature_name = feature_names, decay_rate=0.993)
                            print(f"Completed 1 day refit for fold {idx} in {time.time() - input_start_time}")
                    except Exception as e:
                        print(e)
                        print('Failed to refit')

                ######################################################INFERENCE REFITTING END######################################################
                #If no need to update, then just take the last 1 day
                else:
                    lgb_input = get_raw_lgb_input(X_input = train, num_days = 1, feature_names = feature_names, _revealed_targets_df = revealed_targets_df)

                try:
                    input_start_time = time.time()
                    for mdl in lgb_models:
                        lgb_prediction = mdl.predict(lgb_input[feature_names][-n_inputs : ])
                        ensemble_predictions.append(lgb_prediction)
                    print(f"Completed LGB prediction loop in {time.time() - input_start_time}")
                except Exception as e:
                    print(e)

And our worse submission (5.4270) was the same flavour but with a different decay parameter:

lgb_models[idx] = lgb_models[idx].refit(data = X_refit_input, label = y_refit_input, feature_name = feature_names, decay_rate=0.993)

refit will effectively keep the tree structures the same but update the values according to the new data so initial fitting parameters are extremely important for regularization.

Our validation process was to log everything via MLFlow and look at impact of different decay rates over a testing period of days 440-480 to validate.

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

@yusefkaggle your idea to use the previous week data was extremely good and is a key takeaway from the competition. I think this matters a lot, especially week starts and ends where several stocks show up/ down trend movements. This is a key lesson for future challenges!!

Posted a year ago

This post earned a bronze medal

Congrats!

Posted a year ago

· 5th in this Competition

This post earned a bronze medal

Great solution - ours was very similar - but we used Catboost and LGB and we refit a little differently.
Special thanks to @ravi20076 (!) for the hard work in posting many public resources throughout - really upped the competition standard a lot I think.

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

What postprocessing did you use in your submission ? I wonder if it made a difference, for us, there is barely any difference between our zero-sum and no postprocessing submission. We did not use the zero-mean as the local CV and the public LB was worse than zero-sum with refitting.

EDIT: the zero-sum postprocessing does help. It’s the feature set having different base accuracies. Our EWM and bollinger band features have worsened the private LB even though the CV (436-480) and public LB was better

| PostProcess | Feature Set 1 | Feature Set 2 |
| None | 5.4475 | 5.4457 |
| Zero Sum | 5.4458 | 5.4440 |

Posted a year ago

· 5th in this Competition

This post earned a bronze medal

I decomposed the target and effectively what we know is that the targets are relative performances. A strong up/down movement in one will actually affect all targets - and depended on their index weight.

So we should have roughly the condition:

weight_i * target_i = 0
where i is the individual stock.

So we did this:

def weight_adjustment(pred, test_weights, adjustment=True):
    if adjustment==True:
        pred = pred - np.average(pred, weights = test_weights)
    return pred

From earliest test I can see this took us from 5.3329 -> 5.3297 so was a pretty helpful change.

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

@yusefkaggle thanks for the mention! I try to help others to the best I can. These responses motivate me to share better and help!
Best wishes!

BTW, our solution also used Catboost and LightGBM. We discarded XGBoost as it did not give a good CV score. My gold medal submission notebook highlights this approach.

As mentioned by @yeoyunsianggeremie, post-processing did not make a significant change for us. Both our chosen submissions would have given us the same rank. We did not try the mean submission and used zero-sum and no post-processing.

Posted a year ago

· 344th in this Competition

This post earned a bronze medal

Congratulations on securing 14th place in this competition. Thanks for your proactive sharing of the details.

This comment has been deleted.

Posted a year ago

This post earned a bronze medal

Nice work, congratulations

Posted a year ago

· 552nd in this Competition

Thanks for sharing.
It sounds like I encountered an error when trying to execute a notebook solution. The error indicates that a file (LGBM1R_V27.model) was not found in the specified directory (/kaggle/input/optivermodels/). This could be due to various reasons such as the file not being present in the directory, a typo in the file or path name, or an issue with the file access permissions. If this was a part of a notebook solution you were following, it's important to ensure that all necessary files are correctly placed and accessible. If the solution didn't mention the need for this file or didn't provide a way to obtain it, then it would be reasonable to consider downvoting the solution to alert others about this issue.

Posted a year ago

· 552nd in this Competition

This post earned a bronze medal

--> 650 with open(filename, 'rb') as f:
651 with _read_fileobject(f, filename, mmap_mode) as fobj:
652 if isinstance(fobj, str):
653 # if the returned file object is a string, this means we
654 # try to load a pickle file generated with an version of
655 # Joblib so we load it with joblib compatibility function.

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/optivermodels/LGBM1R_V27.model'

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

@juanricardop my root work will resolve the issue. The notebook in the post was a copy-edit of the base work perhaps creating some unknown permission issue/ error. I have made the dataset with the model public as well.

Best wishes and happy learning, thanks for bringing this to our attention!

Posted a year ago

This post earned a bronze medal

Thank you very much. I have noticed that the lgb model trained by the train code is LGBM_V27, but the infer code is using LGBM1R_V27. What does "1R" mean?

Posted a year ago

· 14th in this Competition

This post earned a bronze medal

@achaosss that is just an experiment number
lgbm regression 1 version 27 becomes lgbm1rv27

Appreciation (2)

Posted a year ago

This post earned a bronze medal

Thank you for sharing your great work!

Posted a year ago

· 23rd in this Competition

This post earned a bronze medal

Thank you for sharing your great work!