A modular Python project for extracting, processing, analyzing, and visualizing football (soccer) event data, with a focus on passes, carries, and shots. The project features advanced machine learning models for predicting pass success, alongside robust support for feature engineering, exploratory data analysis (EDA), and animated visualizations of match events using StatsBomb open data.
Demo: https://youtu.be/oa1YaRwfkO4
- Data Pipeline: Load, filter, and process StatsBomb event data for passes, carries, and shots.
- Feature Engineering: Add tactical, spatial, and contextual features for modeling (e.g., xG, pass success).
- Machine Learning Model: Advanced models for predicting pass success, tracked via MLflow.
- Exploratory Data Analysis (EDA): Utilities for missing values, class distribution, data types, duplicate removal, and visualizations.
- Visualization: Animated and static visualizations of passes, carries, and shots on a football pitch.
- Web App: Interactive Streamlit dashboard for match insights, pass networks, shot maps, and model leaderboards.
football_stream_processor/
├── .github/ # Contains CI workflows
├── open-data/ # Git submodule: StatsBomb open data
│ └── data/
| └── events/ # Contains events json files
| └── matches/ # Contains matches json files
├── scripts/ # Standalone scripts
│ └── eda.py # Run EDA from the command line
│
├── resources/ # Contains visualizations, figures, and generated assets
│
├── src/ # Source code for the project
| ├── app/ # Streamlit application
| │ │ ├── components/ # Reusable UI components
| │ │ ├── webpages/ # Page modules (overview, analysis, insights)
| │ │ └── main.py # Entry-point for Streamlit
│ └── __init__.py
│
| ├── football_stream_processor # Core library
│ │ ├── animation/ # Contains animation visualizations script
│ │ │ ├── __init.py__
│ │ │ ├── animate_match.py # Animates the entire match
│ │ │ └── animate_passes.py # Animates match passes
│ │ │
│ │ ├── match/ # Contains match related scripts
│ │ │ ├── __init.py__
│ │ │ ├── match_summary.py # Gives match summary
│ │ │ └── simualte.py # Simulates match events in real time
│ │ │
│ │ ├── models/ # Machine learning models
│ │ │ ├── __init__.py
│ │ │ ├── xg_model/ # Pass prediction model
│ │ │
│ └── └── utils/ # Utility functions and classes
│ ├── __init__.py
│ ├── animation_utils.py
│ └── eda_utils.py # EDA utility functions
│
├── tests/ # Unit and integration tests
├── Dockerfile # Multi-stage Docker build for the web app
├── .dockerignore # Docker ignore file
├── .gitignore # Git ignore file
├── .gitmodules # Git modules file
├── README.md # Project README
├── poetry.lock # poetry lock file
├── pyproject.toml # Contains dependencies
├── .python-version # Contains python versions
└── LICENCE
-
Clone the repo and submodule:
git clone https://github.com/SmarthBakshi/Stream-Processor.git cd Stream-Processor git submodule update --init --recursive -
Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 - -
Install dependencies:
poetry install
To prepare the pass data dataframe, run:
poetry run python data_pipeline.py --limit 100The --limit argument specifies the number of event JSON files to use for preparing the pass_data dataframe.
poetry run python scripts/eda.pypython animate/animate_passes.py --file open-data/data/events/22912.json --save
python animate/animate_match.py --file open-data/data/events/22912.json --saveTo train the xG prediction model, run:
python src/football_stream_processor/models/xg_model/train.pypoetry run streamlit run src/app/main.py \
--server.port=8501 --server.enableCORS=false- Run the
train.pyscript to have the mlflow runs and experiments stored inmlflow/ - Build the Docker Image
docker build -t football-dashboard .- Run the docker Image
docker run --rm \
-p 8501:8501 \
-v $(pwd)/open-data:/app/open-data \
-v $(pwd)/mlflow/mlruns:/app/mlflow/mlruns football-dashboard- GitHub Actions (.github/workflows/ci.yml):
- Checks out code
- Installs Poetry & dependencies
- Runs pytest & linting (ruff)
You can deploy the Docker image to any container host (Render, AWS ECS/Fargate, etc.). Be sure to mount or include the open-data/data folder and set environment variables: MLFLOW_TRACKING_URI & MLFLOW_EXPERIMENT_NAME
- Add new features: Extend
feat_engineering.pywith new feature functions. - Add new EDA steps: Extend
eda_utils.pyor subclassPassDataEDA. - Visualizations: Modify or add new plotting functions in the
animate/directory. - Log MLFlow results: Change the 'MLFLOW_TRACKING_URI' in config.py to log locally or to a remote server
This project is licensed under the MIT License.