This project implements a machine learning pipeline to predict flight departure delays of 15 minutes or more (DepDel15) using integrated operational and weather data. By analyzing 105,783 flight records across 10 major U.S. airports, the system identifies high-risk delay windows to support proactive airline operations planning.
- ROC-AUC: 0.735
- Recall: 0.543
- PR-AUC: 0.448
The system utilizes a modular processing architecture to merge disparate datasets:
- Flight Data Processing:
- Ingested 3 months of Bureau of Transportation Statistics (BTS) data.
- Filtered to 10 primary hubs (ATL, EWR, JFK, LAS, LAX, MCO, MIA, ORD, SEA, SFO).
- Feature selection reduced 100+ raw columns to 15 high-impact predictors.
- Weather Data Integration:
- Flattened hourly JSON observations into structured features.
- Extracted:
tempC,windspeedKmph,precipMM,visibility, andweatherCode.
- Feature Engineering:
- Temporal alignment: Merged datasets on
Airport,Date, andDeparture Hour. - Engineered
dep_hourandDayOfWeekto capture cyclic delay patterns.
- Temporal alignment: Merged datasets on
- Propagation Effect: Delay probability scales from ~10% in the morning to ~30% in the evening.
- Non-Linearity: Weather impacts (wind/visibility) exhibit threshold-based behavior rather than linear correlations.
- Interaction Driven: The strongest predictors are interactions between
Departure HourandDay of Week.
We compared multiple architectures, prioritizing Recall to maximize early-warning capabilities for operational risk mitigation.
| Model | ROC-AUC | PR-AUC | Precision | Recall | F1 |
|---|---|---|---|---|---|
| XGBoost (Final) | 0.735 | 0.448 | 0.403 | 0.543 | 0.463 |
| Random Forest | 0.730 | 0.436 | 0.443 | 0.461 | 0.452 |
| Gradient Boosting | 0.730 | 0.440 | 0.461 | 0.404 | 0.431 |
| Logistic Regression | 0.679 | 0.321 | 0.423 | 0.221 | 0.291 |
- Algorithm: XGBoost
- Hyperparameters:
learning_rate: 0.1,max_depth: 4,n_estimators: 300,subsample: 0.8. - Interpretability: SHAP analysis identifies
dep_hour,Origin, andwindgustKmphas the primary drivers of model decisions.
- Language: Python 3.13.5
- Data Science:
pandas,numpy,scikit-learn - ML Framework:
xgboost - Visualization:
matplotlib,seaborn,shap
- Data Expansion: Incorporate full-year seasonality and aircraft tail-number rotation tracking.
- Advanced Features: Integrate real-time METAR/TAF weather alerts and airport congestion metrics.
- Deployment: Implement temporal cross-validation and an automated early-warning alert system.
pip install -r requirements.txtRun the following notebooks sequentially to reproduce the results:
00_data_directory.ipynb01_flight_data_preprocessing.ipynb02_weather_data_preprocessing.ipynb03_merge_flight_weather_data.ipynb04_exploratory_data_analysis.ipynb05_baseline_modeling_data.ipynb06_final_modeling_data.ipynb