NCAA 2025 tournament prediction

This project aimed to turn the chaos of March Madness into measurable probabilities. Using Kaggle’s official NCAA dataset - spanning teams, seeds, boxscores, rankings, and coaching history - we engineered features and trained models to predict the probability of one team defeating another. The best submission achieved a Brier score of 0.126, outperforming the public benchmark through smart feature engineering and LightGBM modeling.

Skills

Sports Analytics · Predictive Modeling · Feature Engineering · Machine Learning

Skills

Sports Analytics · Predictive Modeling · Feature Engineering · Machine Learning

Skills

Sports Analytics · Predictive Modeling · Feature Engineering · Machine Learning

Tools

Python · Pandas · LightGBM · Scikit-learn · Matplotlib · Kaggle

Tools

Python · Pandas · LightGBM · Scikit-learn · Matplotlib · Kaggle

Tools

Python · Pandas · LightGBM · Scikit-learn · Matplotlib · Kaggle

Orange Flower
Orange Flower
Orange Flower

Case Challenge

Case Challenge

Case Challenge

March Madness is famously unpredictable, with upsets shaping the bracket each year. The challenge was to combine decades of NCAA data into a robust model that could balance historical patterns, team strength, coaching experience, and player stats to generate accurate win probabilities for the 2025 men’s tournament.

The dataset, provided and pre-cleaned by Kaggle, included team histories, tournament seeds, game boxscores, geography, rankings from over 40 systems, and coaching records — offering both depth and variety for feature engineering.

Solution / Approach

Solution / Approach

Solution / Approach

Data Exploration (EDA)

  • Seeds & Upsets: Upsets most common between seeds 9–13.

  • Conference Trends: Some conferences consistently over/under-perform.

  • Coach Analysis: Experience & clutch history strongly influence deep runs.

  • Ranking Systems: AP & Sagarin showed high predictive power.

Feature Engineering

  • Tournament history & deep-run frequency.

  • Elo ratings and differences in seed.

  • Win margins, upset counts.

  • Coaching experience and stability.

  • Boxscore aggregates (FG%, rebounds, steals, turnovers).

Modeling

  • Used LightGBM as main classifier (fast + handles categorical + non-linear features well).

  • Trained on historical tournament matchups.

  • Predicted probabilities for each 2025 matchup.

How It Works

How It Works

How It Works

The following show the steps/features applied for male dataset; yet there are some missing data in the female dataset; hence there will be some difference in the two models.

Seed & Ranking Features – Captures historical imbalance between seed expectations and real outcomes.


Coaching Records – Measures clutch factor & late-stage success.

Boxscore Aggregates – Provides a team-strength snapshot beyond win/loss records.

LightGBM Predictions – Outputs calibrated win probabilities for every possible matchup.

Key Outcomes

Key Outcomes

Key Outcomes

  • Achieved a Brier score of 0.126 on the men’s bracket.

  • Outperformed Kaggle’s public benchmark with stronger feature engineering.

  • Found that coaching experience + ranking aggregates were the strongest predictors.

  • Demonstrated that historical deep runs often signal under-valued teams.

Project Structure

Project Structure

Project Structure

  • ncaa_eda.ipynb → Exploratory Data Analysis & visualizations.

  • ncaa_feature_engineering_and_model.ipynb → Feature engineering + LightGBM modeling.

  • README.md → Full documentation.

-> View full code at github

Acknowledgments

Acknowledgments

Acknowledgments

This project was built by my amazing mate Dylan and me (Thao)!
We absolutely loved exploring this dataset — from the richness of the historical records to the excitement of seeing data turn into real March Madness insights.

We extend our gratitude to:

  • Kenneth Massey for providing much of the historical ranking data

  • Jeff Sonas of Sonas Consulting for his support in assembling the full competition dataset

And big thanks to the Kaggle competition organizers for making this incredible basketball dataset available to the public.

Stay Connected

+61 421 718 726

thanhthao.chu05@gmail.com

Create a free website with Framer, the website builder loved by startups, designers and agencies.