A comprehensive machine learning pipeline for predicting PISA math scores with maximum accuracy using ensemble methods and advanced feature engineering.
Predict PISA math scores for students based on ~300 features including parent income, reading scores, socioeconomic indicators, etc.
- X_train.csv: ~1 million rows Γ 300 columns of features
- y_train.csv: Math scores (target variable)
- Features include: PISA reading/science scores, socioeconomic status, parent education, wealth indicators, etc.
pip install -r requirements.txtImportant: Run this first to preprocess your data once:
cd ML
python preprocessing.pyThis will create X_train_processed.csv that all models will use.
Now you can train any model individually or all at once:
# Train individual models
python XGBoost.py
python Random_forest.py
python advanced_models.py
# Or train all models at once
python master_training.pyThis will:
- β Preprocess the data (handle missing values, scale features, engineer new features)
- β Train XGBoost with hyperparameter tuning
- β Train Random Forest and ExtraTrees
- β Train LightGBM and CatBoost
- β Create ensemble models (weighted average, median, stacking)
- β
Save all models to
models/directory
Now you can train any model individually or all at once:
# Train individual models
python XGBoost.py
python Random_forest.py
python advanced_models.py
# Or train all models at once
python master_training.pyAfter training, visualize train/validation metrics:
cd ML
python check_overfitting.pyThis will generate plots for each model showing training vs validation performance.
cd ML
python predict.py ../data/X_test.csvPredictions will be saved to results/predictions.csv
Hackaton/
βββ data/
β βββ X_train.csv # Training features
β βββ y_train.csv # Training targets
β βββ X_test.csv # Test features (for prediction)
β βββ X_train_processed.csv # Preprocessed training data (generated)
β βββ preprocessor.pkl # Fitted preprocessor (generated)
βββ ML/
β βββ data_exploration.py # Exploratory data analysis
β βββ preprocessing.py # Data preprocessing & feature engineering
β βββ XGBoost.py # XGBoost model with Optuna tuning
β βββ Random_forest.py # Random Forest & ExtraTrees models
β βββ advanced_models.py # LightGBM & CatBoost models
β βββ ensemble.py # Ensemble & stacking models
β βββ master_training.py # Main training pipeline
β βββ predict.py # Prediction script
βββ models/ # Saved models (generated)
βββ results/ # Predictions (generated)
βββ requirements.txt # Python dependencies
- Missing value imputation: KNN imputer for numeric features
- Feature scaling: Robust scaling (resilient to outliers)
- Correlation removal: Drops highly correlated features (>0.95)
- Categorical encoding: Label encoding for categorical features
- Feature selection: Optional mutual information-based selection
- PISA score aggregations: Mean, std, max, min of reading/science scores
- Socioeconomic indicators: Aggregations of income, education, wealth features
- Polynomial features: Squared and sqrt transformations of key predictors
- Interaction features: Multiplicative interactions between top features
- XGBoost: Gradient boosting with GPU support and Optuna tuning
- Random Forest: Classic ensemble with randomized search
- ExtraTrees: More randomized trees for diversity
- LightGBM: Fast gradient boosting with leaf-wise growth
- CatBoost: Gradient boosting with built-in categorical handling
- Weighted Average: Optimally weighted combination of models
- Median Ensemble: Robust to outlier predictions
- Stacking: Meta-learner (Ridge) on top of base model predictions
All models use automated hyperparameter optimization:
- XGBoost & LightGBM & CatBoost: Optuna with 30-50 trials
- Random Forest: RandomizedSearchCV with 20-50 iterations
- Cross-validation: 5-fold CV for robust evaluation
The ensemble models typically achieve the best performance by combining strengths of different algorithms. Expected metrics:
- RMSE: Target metric for competition
- MAE: Mean absolute error
- RΒ²: Coefficient of determination
cd ML
# Preprocess data only
python preprocessing.py
# Train specific models
python XGBoost.py
python Random_forest.py
python advanced_models.py
# Create ensembles
python ensemble.pyfrom predict import predict_with_all_models
predict_with_all_models('../data/X_test.csv')This creates separate prediction files for each model.
cd ML
python data_exploration.pyGenerates detailed statistics about the dataset.
- For maximum accuracy: Use
stacking_model.pklorensemble_weighted.pkl - For speed: Use
lightgbm_model.pklorxgboost_model.pkl - For interpretability: Use
random_forest_model.pkland check feature importance
- GPU Acceleration: If you have a CUDA-capable GPU, XGBoost, LightGBM, and CatBoost will automatically use it
- More Tuning: Increase
n_trialsin model files for better hyperparameters (takes longer) - Feature Engineering: Add domain-specific features in
preprocessing.py - Ensemble Tuning: Adjust ensemble weights in
ensemble.py
The training pipeline will print:
- Data shapes at each step
- Cross-validation scores
- Feature importance rankings
- Training/validation metrics
- Best hyperparameters found
- Reduce
use_knn_imputer=Falsein preprocessing - Use smaller datasets for tuning, then retrain on full data
- Reduce
n_trialsin hyperparameter tuning
pip install --upgrade -r requirements.txt- Install CUDA toolkit and cuDNN
- Install GPU versions:
pip install xgboost[gpu] lightgbm[gpu]
- Run
master_training.pyto train all models - Check
models/directory for saved models - Use
predict.pywith your test data - Submit
results/predictions.csvto the competition
The stacking ensemble typically performs best because it:
- Combines diverse models (tree-based + gradient boosting)
- Learns optimal weighting through meta-model
- Reduces variance through ensemble averaging
- Captures different patterns with different algorithms
Good luck with your hackathon! π