Skip to content

abdibasidadan-byte/MachineLearningPipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

============================================================

NCAA Match Prediction Script Using Logistic Regression

============================================================

Script Description

This script aims to predict the probability that a college basketball

team wins against another in a match. It leverages historical data

on teams, regular season results, and tournament seeds to build

a predictive model.

Purpose of the exercise:

- Understand how to transform raw match data into features usable

by a machine learning model.

- Build a supervised model capable of predicting the match winner.

- Evaluate the model using standard metrics (log loss, ROC-AUC)

and apply it to tournament simulations.

Data context:

- "teams": team information (TeamID, TeamName, first and last

Division 1 season)

- "results": regular season match results (winning team, losing team,

score, match day)

- "seed_round_slots": information on tournament seeds and match slots

Key variables:

- "team_stats": number of wins and losses per team per season

- "match_data": prepared match dataset for model training

- "X", "y": features and target for training

- "model": trained logistic regression model

- "matchup_example": sample tournament matches for prediction

Model:

- Logistic Regression

- It is supervised because it learns from labeled data: each historical

match has a label "1" if Team1 wins, "0" otherwise.

- Suitable for binary classification and allows estimating the probability

of a team winning.

Objectives:

1. Load the necessary CSV files.

2. Compute wins and losses for each team and season.

3. Create a match dataset ready for training.

4. Normalize the data and split into training and test sets.

5. Train a supervised Logistic Regression model.

6. Evaluate the model using log loss and ROC-AUC.

7. Prepare a sample tournament matchup and predict win probabilities.

#Imports import pandas as pd from pathlib import Path import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss, roc_auc_score, roc_curve import matplotlib.pyplot as plt import seaborn as sns

#Load CSV files DATA_DIR = Path("/home/abdibasidadan/Téléchargements") teams = pd.read_csv(DATA_DIR / "MTeams.csv") results = pd.read_csv(DATA_DIR / "MRegularSeasonCompactResults.csv") seed_round_slots = pd.read_csv(DATA_DIR / "MNCAATourneySeedRoundSlots.csv")

#Compute team statistics team_stats = results.groupby(['Season', 'WTeamID']).size().reset_index(name='W') team_stats_L = results.groupby(['Season', 'LTeamID']).size().reset_index(name='L') team_stats = pd.merge(team_stats, team_stats_L, left_on=['Season','WTeamID'], right_on=['Season','LTeamID'], how='outer') team_stats['TeamID'] = team_stats['WTeamID'].combine_first(team_stats['LTeamID']) team_stats['Wins'] = team_stats['W'].fillna(0) team_stats['Losses'] = team_stats['L'].fillna(0) team_stats = team_stats[['Season','TeamID','Wins','Losses']]

#Prepare match dataset def create_match_dataset(results): data = [] for _, row in results.iterrows(): data.append([row['Season'], row['WTeamID'], row['LTeamID'], 1]) data.append([row['Season'], row['LTeamID'], row['WTeamID'], 0]) df = pd.DataFrame(data, columns=['Season','Team1','Team2','Target']) return df match_data = create_match_dataset(results)

#Merge team stats match_data = pd.merge(match_data, team_stats, left_on=['Season','Team1'], right_on=['Season','TeamID'], how='left') match_data = match_data.rename(columns={'Wins':'Team1_Wins','Losses':'Team1_Losses'}).drop(columns=['TeamID']) match_data = pd.merge(match_data, team_stats, left_on=['Season','Team2'], right_on=['Season','TeamID'], how='left') match_data = match_data.rename(columns={'Wins':'Team2_Wins','Losses':'Team2_Losses'}).drop(columns=['TeamID']) match_data.fillna(0, inplace=True) X = match_data[['Team1_Wins','Team1_Losses','Team2_Wins','Team2_Losses']] y = match_data['Target']

#Split train/test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Normalization scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

#Train model model = LogisticRegression() model.fit(X_train_scaled, y_train)

#Evaluation y_pred_proba = model.predict_proba(X_test_scaled)[:,1] loss = log_loss(y_test, y_pred_proba) roc_auc = roc_auc_score(y_test, y_pred_proba) print(f"Log Loss: {loss:.4f}") print(f"ROC-AUC: {roc_auc:.4f}")

#Plot ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) plt.figure(figsize=(8,6)) plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.4f})') plt.plot([0,1],[0,1],'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend() plt.show()

#Tournament example matchup_example = pd.DataFrame({ 'Season': [2024, 2024], 'Team1': [1101, 1102], 'Team2': [1103, 1104] })

#Merge stats matchup_example = pd.merge(matchup_example, team_stats, left_on=['Season','Team1'], right_on=['Season','TeamID'], how='left') matchup_example = matchup_example.rename(columns={'Wins':'Team1_Wins','Losses':'Team1_Losses'}).drop(columns=['TeamID']) matchup_example = pd.merge(matchup_example, team_stats, left_on=['Season','Team2'], right_on=['Season','TeamID'], how='left') matchup_example = matchup_example.rename(columns={'Wins':'Team2_Wins','Losses':'Team2_Losses'}).drop(columns=['TeamID']) matchup_example.fillna(0, inplace=True) X_tourney = matchup_example[['Team1_Wins','Team1_Losses','Team2_Wins','Team2_Losses']] X_tourney_scaled = scaler.transform(X_tourney) matchup_example['Prob_Team1_Win'] = model.predict_proba(X_tourney_scaled)[:,1] print(matchup_example[['Team1','Team2','Prob_Team1_Win']])

#Conclusion

#The supervised logistic regression model demonstrates a reasonable #ability to predict winners of college basketball matches. On the #test set, it achieves a Log Loss of approximately 0.515 and an ROC-AUC #of 0.822, indicating the model can distinguish winning and losing #teams based on historical win/loss records. #Observations: #1. Predicted probabilities for the example tournament matches show #realistic but simplified trends, as only wins and losses are used #as features. #2. Matches with probabilities near 0.5 indicate balanced contests #or limited historical data. #3. For accurate predictions of real tournaments, it is necessary #to correctly map seeds to TeamIDs and simulate all tournament rounds.


#Data Sources

  1. MTeams.csv, MRegularSeasonCompactResults.csv, MNCAATourneySeedRoundSlots.csv

About

Probabilistic Prediction of NCAA Basketball Match Outcomes using Reproducible Machine Learning Pipelines

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published