Skip to content

BaptisteMERESSE/Hackathon-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PISA Math Score Prediction - Hackathon Solution

A comprehensive machine learning pipeline for predicting PISA math scores with maximum accuracy using ensemble methods and advanced feature engineering.

🎯 Goal

Predict PISA math scores for students based on ~300 features including parent income, reading scores, socioeconomic indicators, etc.

πŸ“Š Dataset

  • X_train.csv: ~1 million rows Γ— 300 columns of features
  • y_train.csv: Math scores (target variable)
  • Features include: PISA reading/science scores, socioeconomic status, parent education, wealth indicators, etc.

πŸš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Preprocess Data (ONE TIME ONLY)

Important: Run this first to preprocess your data once:

cd ML
python preprocessing.py

This will create X_train_processed.csv that all models will use.

3. Train Models

Now you can train any model individually or all at once:

# Train individual models
python XGBoost.py
python Random_forest.py
python advanced_models.py

# Or train all models at once
python master_training.py

This will:

  • βœ… Preprocess the data (handle missing values, scale features, engineer new features)
  • βœ… Train XGBoost with hyperparameter tuning
  • βœ… Train Random Forest and ExtraTrees
  • βœ… Train LightGBM and CatBoost
  • βœ… Create ensemble models (weighted average, median, stacking)
  • βœ… Save all models to models/ directory

3. Train Models

Now you can train any model individually or all at once:

# Train individual models
python XGBoost.py
python Random_forest.py
python advanced_models.py

# Or train all models at once
python master_training.py

4. Check for Overfitting

4. Check for Overfitting

After training, visualize train/validation metrics:

cd ML
python check_overfitting.py

This will generate plots for each model showing training vs validation performance.

5. Generate Predictions

cd ML
python predict.py ../data/X_test.csv

Predictions will be saved to results/predictions.csv

πŸ“ Project Structure

Hackaton/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ X_train.csv              # Training features
β”‚   β”œβ”€β”€ y_train.csv              # Training targets
β”‚   β”œβ”€β”€ X_test.csv               # Test features (for prediction)
β”‚   β”œβ”€β”€ X_train_processed.csv   # Preprocessed training data (generated)
β”‚   └── preprocessor.pkl         # Fitted preprocessor (generated)
β”œβ”€β”€ ML/
β”‚   β”œβ”€β”€ data_exploration.py      # Exploratory data analysis
β”‚   β”œβ”€β”€ preprocessing.py         # Data preprocessing & feature engineering
β”‚   β”œβ”€β”€ XGBoost.py              # XGBoost model with Optuna tuning
β”‚   β”œβ”€β”€ Random_forest.py        # Random Forest & ExtraTrees models
β”‚   β”œβ”€β”€ advanced_models.py      # LightGBM & CatBoost models
β”‚   β”œβ”€β”€ ensemble.py             # Ensemble & stacking models
β”‚   β”œβ”€β”€ master_training.py      # Main training pipeline
β”‚   └── predict.py              # Prediction script
β”œβ”€β”€ models/                      # Saved models (generated)
β”œβ”€β”€ results/                     # Predictions (generated)
└── requirements.txt             # Python dependencies

πŸ”§ Key Features

1. Advanced Preprocessing

  • Missing value imputation: KNN imputer for numeric features
  • Feature scaling: Robust scaling (resilient to outliers)
  • Correlation removal: Drops highly correlated features (>0.95)
  • Categorical encoding: Label encoding for categorical features
  • Feature selection: Optional mutual information-based selection

2. Feature Engineering

  • PISA score aggregations: Mean, std, max, min of reading/science scores
  • Socioeconomic indicators: Aggregations of income, education, wealth features
  • Polynomial features: Squared and sqrt transformations of key predictors
  • Interaction features: Multiplicative interactions between top features

3. Multiple Models

  • XGBoost: Gradient boosting with GPU support and Optuna tuning
  • Random Forest: Classic ensemble with randomized search
  • ExtraTrees: More randomized trees for diversity
  • LightGBM: Fast gradient boosting with leaf-wise growth
  • CatBoost: Gradient boosting with built-in categorical handling

4. Ensemble Methods

  • Weighted Average: Optimally weighted combination of models
  • Median Ensemble: Robust to outlier predictions
  • Stacking: Meta-learner (Ridge) on top of base model predictions

πŸŽ›οΈ Hyperparameter Tuning

All models use automated hyperparameter optimization:

  • XGBoost & LightGBM & CatBoost: Optuna with 30-50 trials
  • Random Forest: RandomizedSearchCV with 20-50 iterations
  • Cross-validation: 5-fold CV for robust evaluation

πŸ“ˆ Expected Performance

The ensemble models typically achieve the best performance by combining strengths of different algorithms. Expected metrics:

  • RMSE: Target metric for competition
  • MAE: Mean absolute error
  • RΒ²: Coefficient of determination

πŸ› οΈ Advanced Usage

Train Individual Models

cd ML

# Preprocess data only
python preprocessing.py

# Train specific models
python XGBoost.py
python Random_forest.py
python advanced_models.py

# Create ensembles
python ensemble.py

Generate Predictions with All Models

from predict import predict_with_all_models
predict_with_all_models('../data/X_test.csv')

This creates separate prediction files for each model.

Data Exploration

cd ML
python data_exploration.py

Generates detailed statistics about the dataset.

πŸ” Model Selection Strategy

  1. For maximum accuracy: Use stacking_model.pkl or ensemble_weighted.pkl
  2. For speed: Use lightgbm_model.pkl or xgboost_model.pkl
  3. For interpretability: Use random_forest_model.pkl and check feature importance

πŸ’‘ Tips for Best Results

  1. GPU Acceleration: If you have a CUDA-capable GPU, XGBoost, LightGBM, and CatBoost will automatically use it
  2. More Tuning: Increase n_trials in model files for better hyperparameters (takes longer)
  3. Feature Engineering: Add domain-specific features in preprocessing.py
  4. Ensemble Tuning: Adjust ensemble weights in ensemble.py

πŸ“Š Monitoring Training

The training pipeline will print:

  • Data shapes at each step
  • Cross-validation scores
  • Feature importance rankings
  • Training/validation metrics
  • Best hyperparameters found

πŸ› Troubleshooting

Out of Memory

  • Reduce use_knn_imputer=False in preprocessing
  • Use smaller datasets for tuning, then retrain on full data
  • Reduce n_trials in hyperparameter tuning

Missing Dependencies

pip install --upgrade -r requirements.txt

GPU Not Detected

  • Install CUDA toolkit and cuDNN
  • Install GPU versions: pip install xgboost[gpu] lightgbm[gpu]

πŸ“ Next Steps

  1. Run master_training.py to train all models
  2. Check models/ directory for saved models
  3. Use predict.py with your test data
  4. Submit results/predictions.csv to the competition

πŸ† Competition Strategy

The stacking ensemble typically performs best because it:

  • Combines diverse models (tree-based + gradient boosting)
  • Learns optimal weighting through meta-model
  • Reduces variance through ensemble averaging
  • Captures different patterns with different algorithms

Good luck with your hackathon! πŸš€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages