Who Will Win IPL 2025? Using Machine Learning to Predict the Season 18 Champion
After my previous attempt 2 years ago to predict the champions of S16, I’m back again with another ML project to predict the winners this year. I got some great engagement on LinkedIn and here last time, with the inputs I tried to implement in this project. Unlike last time, I attempted earlier in this season with 19/74 matches completed. If you’re curious about my previous attempt, more details here.
So, if you’re ready, let's get started with the journey to predict who will be crowned champions this year. But before that,
Disclaimer: Please note that you should not use these results to place bets. I created this as a simple mathematical exercise to better grasp the capabilities of ML and my passion for the game.
The standard machine learning workflow contains :
Where is the code?
The notebook for the project can be found here. I also uploaded the newly created dataset to Kaggle. Have a look at the notebook for an in-depth analysis of the project
- Data Collection
The dataset here was used, which has records of all the matches from 2008–2024. The data for 2025 season was recorded manually from the internet, by recording completed match data from Cricbuzz, and a new prediction dataset was prepared.
Here is the snapshot of the data
2. Data Exploration and Cleaning
The dataset had high null values for a few of the columns. I went ahead with addressing these nulls and also eliminating columns that are not relevant for our analysis. (More details in the notebook)
We also had old names of the franchises that had to be updated to make them consistent across the dataset.
3. Data Pre-processing
To make the dataset more meaningful and help the model learn the patterns efficiently, I created a ‘Team Rank’ column that ranks the IPL teams just like the ICC ranking, but based on their win percentage over the years. I included these data in my dataset.
I concatenated the S18 records till Match-19 for the main dataset and had the new matches data ready for predicting game winners
Once I have the dataset curated and ready, it's time to encode the data to make it machine-readable. To do that, since most of my data has categorical values, I used Label encoding to help the model understand the data
Once the data is encoded, it's time to feed the data to the model to train it.
4. Model development
For my use case for a classification task and with a dataset with categorical values with complex relationships, tree-based models such as Decision Tree, Random Forest, and Gradient Boosting are good choices. I also tried training my data for KNN and Logistic regression models.
After training and testing multiple models, XGBoost, which handles null values well, gave better performance with an accuracy of 74%
Hyper-parameter tuning
Hyperparameter tuning is the process of selecting the optimal set of hyperparameters for a machine learning algorithm to improve its performance on a given dataset. Hyperparameters are parameters that are not learned during the training process but instead are set before training begins
I used RandomSearchCV to find the best hyperparameters for a machine learning model. It helps to find the best parameters by:
- Randomly sampling combinations of parameters from a defined grid
- Evaluating them using cross-validation
- Returning the best-performing configuration
5. Prediction on the testing set
Now comes the part that you’ve been waiting for: to get prediction results for the upcoming matches (from Game 20) . I encoded my testing data with the same encoders used in training to keep consistency in the encoders of the team names. There were challenges here to maintain consistency across the train & test data and also avoid data leakage by not fitting the full data to my model.
Once I have the test data, its time to ask my best fitted model to predict the results and voila, by that I have prediction results for all the upcoming matches in IPL-2025!
By doing this, I have predicted the results for all the upcoming matches. I took these results to prepare the final points table at the end of the season.
In IPL, the top 4 from the league stage advance to the knockouts and the teams here are: DC, GT, RCB and PK ( Ironically, 3 teams that haven't won the title before! )
The knockout stage in IPL follows the fixtures below.
I used the results from the above by asking the model to predict the winners of the games in the above order. The same steps were performed to encode and predict the winner of the game.
According to this model, the Gujarat Titans are likely to win IPL 2025
Learnings:
- This project gave me good exposure to wrangling unclean data to dig into the function for 30 minutes, just to identify a whitespace in a column that was the root cause
- It refreshed my understanding of tree-based models like Decision tree, random Forest, along with topics of entropy and information gain
- This was my first time implementing Gradient Boost and XGBoost
Future enhancements and limitations:
- Model accuracy can be improved by adding more data while training our model. (Eg: Head-to-Head stats of the teams, venue performance stats etc)
- There is a chance of Label encoding creating artificial encoding on categorical data (eg: Csk->1, Mi->3 ), which might confuse the tree-based models
- This model doesn't take into consider of NRR factor which will hold a major value in determining knockout contenders.
If you have any other inputs/suggestions, please feel free to hit me up
Again, thanks for reading till here!
Aditya
References: