Skip to content

A modular and extensible demo stack for Data Engineering workflows, using open-source tools.

Notifications You must be signed in to change notification settings

fideldalmasso/data_engineering_stack_demo

Repository files navigation

Data Engineering Stack Demo

This repo contains a modular and extensible demo stack for Data Engineering workflows, using open-source tools.

Source datasets

Stack components

1. PySpark + Jupyter = ETL

To automatically download, unzip, normalize tables and save them into a postgreSQL database. Requires Docker

docker-compose build
docker-compose up -d
docker-compose logs pyspark-notebook --follow
  • Run jupyter notebooks files from notebooks/ via VSCode or by using http://localhost:8888/ + token
  • Use Ctrl+C to Stop or docker-compose down

2. Metabase = Data Visualization

docker-compose up -d
docker-compose logs metabase --follow

3. Airflow (Astro CLI) = Workflow Orchestration

Requires Astronomer

cd airflow
astro dev start
docker exec -it da-spark-master chmod 777 storage 

TODO List

  • Set up multi-container stack with Jupyter notebook + Spark for local development
  • Download .csv files and unzip them
  • Normalize tables using PySpark
  • Configure PosgreSQL DB and write table outputs
  • Use psycopg2 to add PK and FK constraints into Database
  • Finish basic Airflow (Astro) configuration for master and worker setup
  • Set up basic DAG example for data ingestion
  • Fix spark-worker without write permissions
  • Migrate notebooks files to new DAG in Airflow environment
  • Configure daily scheduler and conditional download based on filename (DAG)
  • Set up Metabase app for easy-to-use data visualization
  • Create meaningful visualizations in Metabase
  • Migrate parquet to delta lake to allow for efficient storage of historical records
  • Deploy Airflow (Astro) setup Astronomer Cloud
  • Databricks integration: Migrate SparkSubmitOperator to DatabricksSubmitRunOperator
  • Storage migration: use AWS S3 buckets instead
  • Database server migration: use AWS RDS service

Drink distribution company: ERD

ERD diagram

Stack Screenshots

Metabase auto-generated graphs Metabase dashboard Jupyter notebook with PySpark PostgreSQL DB in DBeaver Airflow DAG execution

About

A modular and extensible demo stack for Data Engineering workflows, using open-source tools.

Topics

Resources

Stars

Watchers

Forks