Skip to content

AnnieFiB/DataEngineering

Repository files navigation

Data Engineering Stack Portfolio

Python Airflow Spark PostgreSQL Great Expectations API

Repository Structure

/
│──DataEngineering/
     |
     │── airflow_dags/     # Apache Airflow DAGs & scheduling workflows
     │   ├── app/
     │   ├── dags/
     │   ├── Dockerfile
     │   └── requirements.txt
     │── pyspark/                   # PySpark jobs and transformation logic
     │── DWHmodelling/              # Database design & data warehouse schemas
     │── projects_python_scripts/   # Python scripts for ETL/data ops
     │── API_WebScr/           # APIs, crawling and data collection scripts
     │   └── Dockerfile
     │── assets/                    # Images, diagrams, or templates
     ├── .gitignore
     ├── cleanup.bat
     ├── requirements.txt
     ├── docker-compose.yml         # Master Compose file
     └── README.md


Key Components

1. Data Engineering

Feature Description
Data Models Database and data warehouse schema design (star/snowflake)
Airflow DAG-based orchestration and task scheduling
Spark Distributed data processing and transformation
Data Quality Great Expectations for rule-based validation and profiling
APIs & Webscraping Collecting structured/unstructured data from web & endpoints

Workflow Example

# 1. 
python DataEngineering/pipelines/data_cleaning.py

## Maintenance

```bash
# Run cleanup script (Windows)
cleanup.bat
cleanup.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published