/
│──DataEngineering/
|
│── airflow_dags/ # Apache Airflow DAGs & scheduling workflows
│ ├── app/
│ ├── dags/
│ ├── Dockerfile
│ └── requirements.txt
│── pyspark/ # PySpark jobs and transformation logic
│── DWHmodelling/ # Database design & data warehouse schemas
│── projects_python_scripts/ # Python scripts for ETL/data ops
│── API_WebScr/ # APIs, crawling and data collection scripts
│ └── Dockerfile
│── assets/ # Images, diagrams, or templates
├── .gitignore
├── cleanup.bat
├── requirements.txt
├── docker-compose.yml # Master Compose file
└── README.md
| Feature | Description |
|---|---|
| Data Models | Database and data warehouse schema design (star/snowflake) |
| Airflow | DAG-based orchestration and task scheduling |
| Spark | Distributed data processing and transformation |
| Data Quality | Great Expectations for rule-based validation and profiling |
| APIs & Webscraping | Collecting structured/unstructured data from web & endpoints |
# 1.
python DataEngineering/pipelines/data_cleaning.py
## Maintenance
```bash
# Run cleanup script (Windows)
cleanup.bat
cleanup.sh