This repository demonstrates a complete ETL (Extract, Transform, Load) pipeline using Python. It covers the workflow from raw data extraction to building a data mart, performing data quality checks, and generating insights with visualizations.
DB_Connection/ # Database connection scripts
DataLake/ # Raw data storage
Information_Mart/ # Final processed data
Visualizations/ # Generated charts
extracted/ # Extracted datasets
staging_1/ # First staging layer
staging_2/ # Second staging layer
schema_model.db # Database schema
data_mart.db # Final data mart
Schema_Diagram.png # ER diagram of the database
requirements.txt # Python dependencies
Extraction.py # Data extraction script
Transformation.py # Data cleaning and transformation script
Modeling.py # Aggregation / modeling script
Quality_check.py # Data quality validation
Visualization.py # Generate charts & visualizations
main.py # Main pipeline execution
- Raw data collected from different sources and stored in
DataLake/andextracted/. Extraction.pyautomates the extraction process.
-
Transformation.pyperforms data cleaning and preparation:- Remove duplicates and invalid records
- Handle missing values
- Convert datatypes
- Create staging tables in
staging_1/&staging_2/
Modeling.pyloads transformed data into the data mart (Information_Mart/).- Database schema stored in
schema_model.dband visualized inSchema_Diagram.png.
-
Quality_check.pyvalidates:- No missing or inconsistent data
- Correct data types and formats
Visualization.pygenerates charts saved inVisualizations/.
Sample Charts:
Database Schema:
- Handling inconsistent/missing data across multiple sources.
- Designing multiple staging layers for better transformations.
- Ensuring ETL scripts are modular and reusable.
- Optimizing queries and transformations for performance.
- Clone the repository:
git clone https://github.com/keroloshany47/End_To_End_ETL_Using_Python.git- Install dependencies:
pip install -r requirements.txt- Configure database connection in
DB_Connection/if needed (SQLite databases are included). - Run the full pipeline:
python main.py- Check
Visualizations/for generated charts.
Run the scripts in the following order:
- Extraction
python Extraction.py- Transformation
python Transformation.py- Modeling / Load to Data Mart
python Modeling.py- Data Quality Check
python Quality_check.py- Visualization
python Visualization.pyAfter each step, outputs and processed data will be saved in their respective folders (staging_1/, staging_2/, Information_Mart/, Visualizations/).
This project demonstrates a complete ETL workflow with Python, from raw data extraction to building a data mart and generating meaningful insights. The pipeline is modular, reusable, and scalable for new datasets.




