Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic.

Learn more

OK, Got it.

Optiver · Featured Code Competition · a year ago

Optiver - Trading at the Close

Predict US stocks closing movements

Optiver - Trading at the Close

Overview Data Code Models Discussion Leaderboard Rules

Cody_Null · 14th in this Competition · Posted a year ago

14th Place Gold Solution

Hi everyone, our solution for 14th place and for many of us our first gold medal can be found here:

https://www.kaggle.com/code/cody11null/14th-place-gold-cat-lgb-refit

For this competition my team of @ravi20076 , @yeoyunsianggeremie , and @mcpenguin used a XGBoost and Light Gradient Boosted ensemble with as much refitting as we could spare. We had 170 features and our original parameters come from an Optuna trial similar to those in my github. I had a great time in this competition and it was a pleasure to work alongside my friends. I learned a lot about the fundamentals in this competition, quick code, and focusing on what techniques will get better signals rather than overfitting on training data.

Please sign in to reply to this topic.

17 Comments

2 appreciation comments

ANSHUL GUPTA1502

Posted a year ago

Congrats @cody11null on securing 14th place in this competition and thank you for sharing such valuable resource

Taizhuo Tang

Posted a year ago

· 507th in this Competition

Huge congrats! I assume this is the similar notebook on Github you used for tuning? Curious which objective function you used, given competition scoring was MAE, but I heard some folks got better results with RMSE? Thanks

Ravi Ramakrishnan

Posted a year ago

· 14th in this Competition

We used MAE for everything, including tuning and model training and evaluation @tztang

Anthony Chiu

Posted a year ago

· 29th in this Competition

Thanks so much for sharing. I really want to see more gold sharing in this competition. Mainly on how people validate online learning.

The features are pretty standard. df.groupby('stock_id'), is also known as performing not as well as "grouping by within a day" at many people's local validation. Did I miss any magic features?

It seems that online learning seems very important (A lot more than I thought).

How did you validate this setup?

bogoconic1

Posted a year ago

· 14th in this Competition

In our local validation, grouping by stock_id is better than grouping by both stock_id and date_id. We did not use any postprocessing in our evaluation. We used dates 436-480

For online learning, we saved the data in each date_id and time_id in a CSV file, then loaded them one by one for prediction. Evaluation was done by each group of date_id and time_id and concatenated together to calculate the MAE. Same dates 436-480. Whether online learning or not we used this same method to evaluate when we want to be in sync with the API

Anthony Chiu

Posted a year ago

· 29th in this Competition

Thank you! How did you decide splitting at 436?

bogoconic1

Posted a year ago

· 14th in this Competition

Based on public discussion posts which determine that the size of the public LB is 45 days. We think this split where the size of public LB and validation is the same, will have a better CV/LB correlation

Yusef A.

Posted a year ago

· 5th in this Competition

Hey,

Our approach was different and based on the .refit() and .train(init_model) on LightGBM. .refit() got us the better score (5.4344):

We retrained daily at the start of the day, based on the previous day's data and the same day the previous week. I.e. if we were to predict a Friday - we would refit with data from Thursday and Friday the week before (assuming no holidays etc.) checking on this condition:

    #If start of new day
    if test.seconds_in_bucket.unique()[0] == 0:
        print(f"Day Completed in : {time.time()-start}")
        start = time.time()

        #Process revealed target
        revealed_targets_df = process_revealed_targets(new_revealed_targets=revealed_targets, old_revealed_targets=revealed_targets_df)

        #Update model if new day + after 480
        if test.date_id.max() > 476:
            UPDATE_MODELS=True
        else:
            UPDATE_MODELS=False

    #Don't update if not new day 
    else:
        UPDATE_MODELS=False

And then our actual online learning was pretty simple for refit - we found this was better than booster.train(init_model):

                ######################################################INFERENCE REFITTING START######################################################
                if UPDATE_MODELS:
                    try:
                        input_start_time = time.time()
                        #NUM DAYS NEESD TO BE T+1 -> starts at current day  
                        lgb_input = get_raw_lgb_input(X_input = train, num_days =6, feature_names = feature_names, _revealed_targets_df = revealed_targets_df)
                        dt = revealed_targets_df['date_id'][-1]
                        X_refit_input = lgb_input.filter((pl.col('date_id') == dt-4) | (pl.col('date_id') == dt))[feature_names]
                        y_refit_input = lgb_input.filter((pl.col('date_id') == dt-4) | (pl.col('date_id') == dt))['target'].to_numpy()

#                         X_refit_input = lgb_input.filter((pl.col('date_id') == dt))[feature_names]
#                         y_refit_input = lgb_input.filter((pl.col('date_id') == dt))['target'].to_numpy()
                        print(f"Completed getting 5 day input in {time.time() - input_start_time}")

                        for idx in range(len(lgb_models)):
                            input_start_time = time.time()
                            lgb_models[idx] = lgb_models[idx].refit(data = X_refit_input, label = y_refit_input, feature_name = feature_names, decay_rate=0.993)
                            print(f"Completed 1 day refit for fold {idx} in {time.time() - input_start_time}")
                    except Exception as e:
                        print(e)
                        print('Failed to refit')

                ######################################################INFERENCE REFITTING END######################################################
                #If no need to update, then just take the last 1 day
                else:
                    lgb_input = get_raw_lgb_input(X_input = train, num_days = 1, feature_names = feature_names, _revealed_targets_df = revealed_targets_df)

                try:
                    input_start_time = time.time()
                    for mdl in lgb_models:
                        lgb_prediction = mdl.predict(lgb_input[feature_names][-n_inputs : ])
                        ensemble_predictions.append(lgb_prediction)
                    print(f"Completed LGB prediction loop in {time.time() - input_start_time}")
                except Exception as e:
                    print(e)

And our worse submission (5.4270) was the same flavour but with a different decay parameter:

lgb_models[idx] = lgb_models[idx].refit(data = X_refit_input, label = y_refit_input, feature_name = feature_names, decay_rate=0.993)

refit will effectively keep the tree structures the same but update the values according to the new data so initial fitting parameters are extremely important for regularization.

Our validation process was to log everything via MLFlow and look at impact of different decay rates over a testing period of days 440-480 to validate.

Ravi Ramakrishnan

Posted a year ago

· 14th in this Competition

@yusefkaggle your idea to use the previous week data was extremely good and is a key takeaway from the competition. I think this matters a lot, especially week starts and ends where several stocks show up/ down trend movements. This is a key lesson for future challenges!!

SQLRockstar

Posted a year ago

Congrats!

Yusef A.

Posted a year ago

· 5th in this Competition

Great solution - ours was very similar - but we used Catboost and LGB and we refit a little differently.
Special thanks to @ravi20076 (!) for the hard work in posting many public resources throughout - really upped the competition standard a lot I think.

bogoconic1

Posted a year ago

· 14th in this Competition

What postprocessing did you use in your submission ? I wonder if it made a difference, for us, there is barely any difference between our zero-sum and no postprocessing submission. We did not use the zero-mean as the local CV and the public LB was worse than zero-sum with refitting.

EDIT: the zero-sum postprocessing does help. It’s the feature set having different base accuracies. Our EWM and bollinger band features have worsened the private LB even though the CV (436-480) and public LB was better

| PostProcess | Feature Set 1 | Feature Set 2 |
| None | 5.4475 | 5.4457 |
| Zero Sum | 5.4458 | 5.4440 |

Yusef A.

Posted a year ago

· 5th in this Competition

I decomposed the target and effectively what we know is that the targets are relative performances. A strong up/down movement in one will actually affect all targets - and depended on their index weight.

So we should have roughly the condition:

weight_i * target_i = 0
where i is the individual stock.

So we did this:

def weight_adjustment(pred, test_weights, adjustment=True):
    if adjustment==True:
        pred = pred - np.average(pred, weights = test_weights)
    return pred

From earliest test I can see this took us from 5.3329 -> 5.3297 so was a pretty helpful change.

Ravi Ramakrishnan

Posted a year ago

· 14th in this Competition

@yusefkaggle thanks for the mention! I try to help others to the best I can. These responses motivate me to share better and help!
Best wishes!

BTW, our solution also used Catboost and LightGBM. We discarded XGBoost as it did not give a good CV score. My gold medal submission notebook highlights this approach.

As mentioned by @yeoyunsianggeremie, post-processing did not make a significant change for us. Both our chosen submissions would have given us the same rank. We did not try the mean submission and used zero-sum and no post-processing.

C R Suthikshn Kumar

Posted a year ago

· 344th in this Competition

Congratulations on securing 14th place in this competition. Thanks for your proactive sharing of the details.

This comment has been deleted.

kerry sun

Posted a year ago

Nice work, congratulations

JRPC

Posted a year ago

· 552nd in this Competition

Thanks for sharing.
It sounds like I encountered an error when trying to execute a notebook solution. The error indicates that a file (LGBM1R_V27.model) was not found in the specified directory (/kaggle/input/optivermodels/). This could be due to various reasons such as the file not being present in the directory, a typo in the file or path name, or an issue with the file access permissions. If this was a part of a notebook solution you were following, it's important to ensure that all necessary files are correctly placed and accessible. If the solution didn't mention the need for this file or didn't provide a way to obtain it, then it would be reasonable to consider downvoting the solution to alert others about this issue.

JRPC

Posted a year ago

· 552nd in this Competition

--> 650 with open(filename, 'rb') as f:
651 with _read_fileobject(f, filename, mmap_mode) as fobj:
652 if isinstance(fobj, str):
653 # if the returned file object is a string, this means we
654 # try to load a pickle file generated with an version of
655 # Joblib so we load it with joblib compatibility function.

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/optivermodels/LGBM1R_V27.model'

Ravi Ramakrishnan

Posted a year ago

· 14th in this Competition

@juanricardop my root work will resolve the issue. The notebook in the post was a copy-edit of the base work perhaps creating some unknown permission issue/ error. I have made the dataset with the model public as well.

Best wishes and happy learning, thanks for bringing this to our attention!

achaosss

Posted a year ago

Thank you very much. I have noticed that the lgb model trained by the train code is LGBM_V27, but the infer code is using LGBM1R_V27. What does "1R" mean?

Ravi Ramakrishnan

Posted a year ago

· 14th in this Competition

@achaosss that is just an experiment number
lgbm regression 1 version 27 becomes lgbm1rv27

Appreciation (2)

achaosss

Posted a year ago

Thank you for sharing your great work!

WindClimber

Posted a year ago

· 23rd in this Competition

Thank you for sharing your great work!