A practical portfolio of data engineering pipelines, orchestrated DAGs, and analytics notebooks. These projects demonstrate end-to-end ETL processes, real-time ingestion, data lake design, and Python-based transformations.
- Apache Airflow DAG orchestration
- Batch and streaming ETL pipelines
- Python & Pandas-based data wrangling
- Data validation and unit testing
- Jupyter notebooks with visual insights
Power companies in emerging markets struggle to track real-time grid performance.
Build an end-to-end pipeline that:
- Ingests data from simulated smart meters via AWS Kinesis
- Transforms with AWS Glue + Apache Hudi
- Loads into Redshift
- Visualized in Amazon QuickSight
- Stream real-time energy usage
- Aggregate usage by time, region, household
- Detect anomalies and outages
Government data is available but not easily analyzable for citizens or journalists.
Create a public analytics dashboard:
- ETL pipelines in Apache Airflow
- Cleaned datasets in BigQuery
- Visualizations in Metabase
- Public search and filter frontend using Next.js
- Process and publish monthly updated datasets
- Make visual data stories (health, education, environment)
- Enable CSV downloads and API access
Election stakeholders need real-time sentiment insights from social media.
Stream political tweets and comments:
- Kafka or Kinesis Firehose for ingestion
- Spark Structured Streaming for processing
- S3 + PrestoDB for storage and querying
- Dashboard built with Apache Superset
- Classify sentiments: positive, neutral, negative
- Track by politician, region, or hashtag
- Show trending concerns or hate speech