This repo contains a modular and extensible demo stack for Data Engineering workflows, using open-source tools.
- Drink distribution company Case study from PwC. Download here
- Taxi industry TLC Trip Record Data from NYC. Download here
To automatically download, unzip, normalize tables and save them into a postgreSQL database. Requires Docker
docker-compose build
docker-compose up -d
docker-compose logs pyspark-notebook --follow- Run jupyter notebooks files from
notebooks/via VSCode or by using http://localhost:8888/ + token - Use Ctrl+C to Stop or
docker-compose down
docker-compose up -d
docker-compose logs metabase --follow- Access Metabase dashboard from http://localhost:3000/
- Use Ctrl+C to Stop or docker-compose down
Requires Astronomer
cd airflow
astro dev start
docker exec -it da-spark-master chmod 777 storage - Access via http://localhost:8080/
- Set up multi-container stack with Jupyter notebook + Spark for local development
- Download .csv files and unzip them
- Normalize tables using PySpark
- Configure PosgreSQL DB and write table outputs
- Use psycopg2 to add PK and FK constraints into Database
- Finish basic Airflow (Astro) configuration for master and worker setup
- Set up basic DAG example for data ingestion
- Fix spark-worker without write permissions
- Migrate notebooks files to new DAG in Airflow environment
- Configure daily scheduler and conditional download based on filename (DAG)
- Set up Metabase app for easy-to-use data visualization
- Create meaningful visualizations in Metabase
- Migrate parquet to delta lake to allow for efficient storage of historical records
- Deploy Airflow (Astro) setup Astronomer Cloud
- Databricks integration: Migrate SparkSubmitOperator to DatabricksSubmitRunOperator
- Storage migration: use AWS S3 buckets instead
- Database server migration: use AWS RDS service

.png)
.png)
.png)
.png)
.png)