This project implements a complete data engineering and machine learning pipeline to analyze historical Formula 1 data and estimate each driver’s likelihood of becoming World Champion at different points in the season, based on performance patterns from past champions.
In simple terms:
The model evaluates how closely each driver's performance resembles that of past World Champions, producing a score between 0 and 1 that represents their championship likelihood at that moment.
The pipeline is structured into several layers that follow modern data architecture patterns:
-
Raw Layer
- Contains the original CSV files exactly as downloaded.
-
Bronze Layer (Normalized Raw Data)
- Converts raw CSVs into Delta Lake tables.
- Ensures consistent schemas and efficient columnar storage.
-
Silver Layer (Cleaned & Enriched Data)
- Contains curated datasets derived from SQL transformations.
- Includes multi-season performance summaries and race-level stats.
-
Feature Store
- Contains driver-level, time-aware features such as:
- rolling performance metrics
- season-to-date performance
- sprint and race averages
- gain/loss indicators
- historical form vs. current form
- Contains driver-level, time-aware features such as:
-
ABT (Analytical Base Table)
- Consolidates all drivers, features, and labels into a training-ready dataset.
-
Machine Learning Model
- Trains a Random Forest classifier to estimate championship likelihood.
- Generates:
- OOT (Out-of-Time) evaluation
- Future season predictions
- Line charts and bar chart race GIFs showing probability evolution.
F1_DATA_ENGINEERING/
│
├── data/
│ ├── raw/ # Original CSV files
│ ├── bronze/
│ │ └── results/ # Delta table (Bronze)
│ ├── silver/
│ │ ├── abt_champions/ # ABT (Analytical Base Table)
│ │ ├── champions/ # Silver table
│ │ └── feature_store_drivers/ # Driver-level feature store
│
├── figures/
│ ├── combined_history.png
│ ├── future_bar_race.gif
│ ├── future_top5.png
│ ├── oot_bar_race.gif
│ └── oot_top_drivers.png
│
├── scripts/
│ ├── 01_raw.py
│ ├── 02_bronze.py
│ ├── 03_feature_store.py
│ ├── 04_silver.py
│ └── 05_ml_model.py
│ └── spark_ops.py
│
├── sql/
│ ├── abt_champions.sql
│ ├── champions.sql
│ └── feature_store_drivers.sql
│
├── .gitignore
├── pyproject.toml
├── README.md
└── uv.lock
- PySpark
- Delta Lake
- SparkSQL
- Rich CLI
- Scikit-learn
- feature_engine
- Pandas / NumPy
- Matplotlib, Seaborn
- bar_chart_race (GIF animation)
uv run scripts/01_raw.py --start 1990 --stop 2025uv run scripts/02_bronze.pyuv run scripts/03_feature_store.py --query sql/feature_store_drivers.sql --start 1990-01-01 --stop 2026-01-01uv run scripts/04_silver.py --query sql/champions.sql
uv run scripts/04_silver.py --query sql/abt_champions.sql uv run scripts/05_ml_model.pyOutputs are saved to the figures/ directory.
It estimates how strongly each driver’s current season performance resembles the profile of a typical F1 World Champion.
Given a driver’s historical performance up to any point in the season, the model assigns a score between 0 and 1 representing how likely that driver is to end the year as World Champion — based on patterns extracted from past champions.
Important: Probabilities for different drivers do not have to sum to 1.
- Probability curves
- OOT predictions
- Future season forecasts
- Top-5 trends
- Animated bar chart race GIF
- Binary classification (
championvsnot champion) - Out-of-time validation
- Driver-year sampling
- Rolling windows
- Season aggregates
- Sprint & race metrics
- Consistency indicators

