Domain: Data Engineering | Data Platform | Infrastructure Engineering Architecture Level: Beginner → Advanced
Eight production-quality projects demonstrating progressive expertise across the full spectrum of modern data engineering — from foundational batch pipelines to advanced platform engineering, data mesh, and enterprise DevOps.
Each project includes: architecture diagrams, production code, unit tests, deployment scripts, trade-off analysis, failure handling, and cost estimates.
BEGINNER INTERMEDIATE ADVANCED
────────────────────────────────────────────────────────────────────────
01-batch-etl-pipeline → 03-streaming-pipeline → 06-k8s-data-platform
02-data-warehouse-modeling 04-data-lake-medallion 07-data-mesh
05-infrastructure-as-code 08-cicd-data-pipelines
Daily sales data ingestion: CRM API + S3 → Redshift
- Extract: REST API pagination, CSV, S3 raw zone; watermark-based incremental loads
- Transform: Business rules, revenue calculations, SCD-ready metadata
- Validate: 15+ data quality rules across completeness/accuracy/consistency/timeliness
- Load: S3 Parquet (Hive-partitioned) + Redshift staging table upsert
- Orchestrate: Airflow DAG with branching, sensors, retry logic, Slack alerting
- Stack: Python · Pandas · Apache Airflow · Amazon S3 · Amazon Redshift
- Tests: 20+ pytest unit tests with full coverage
Star schema for retail analytics with SCD Type 2
- Schema: Fact tables (order line item grain), 5 dimension tables
- SCD Type 2: Customer, Product, Sales Rep — full point-in-time history
- Performance: DISTKEY/SORTKEY optimization, materialized views
- Transformations: dbt incremental models, snapshot tests
- Stack: PostgreSQL / Redshift · dbt Core · SQL
- Tests: dbt schema tests + custom SQL assertions
Real-time fraud detection: 100+ TPS, sub-2s alert latency
- Ingest: Confluent Kafka (3-broker, RF=3, exactly-once)
- Process: Spark Structured Streaming — rule-based fraud scoring + velocity windows
- Sink: Delta Lake (Bronze) + Kafka fraud alerts topic
- Monitor: Prometheus metrics, Grafana dashboards, Kafka UI
- Stack: Apache Kafka 7.5 · Spark Structured Streaming 3.5 · Delta Lake · Docker
- Concepts: Exactly-once semantics, watermarking, backpressure, late data
Bronze → Silver → Gold data lake for enterprise analytics
- Bronze: Raw ingestion (CSV/Parquet/Kafka), schema enforcement, time travel
- Silver: Deduplication (Delta MERGE), PII masking, quality scoring
- Gold: Customer 360, Daily Summary, ML Feature Store
- Stack: Apache Spark · Delta Lake · S3 · AWS Glue
- Features: ACID transactions, schema evolution, Z-ORDER, auto-compaction
Full AWS data platform provisioned via Terraform
- Modules: VPC (multi-AZ, NAT, flow logs), S3 (5-tier lake, KMS, lifecycle), IAM (least-privilege)
- Environments: dev / staging / prod with isolated state
- State: S3 backend + DynamoDB locking, KMS-encrypted
- CI/CD: GitHub Actions — validate → plan → apply dev → manual approve → apply prod
- Security: tfsec + Checkov scanning, OIDC authentication (no stored credentials)
Complete modern data stack on Amazon EKS
- Airflow: KubernetesExecutor, HA scheduler (2 replicas), KEDA autoscaling
- Spark: Spark Operator, Dynamic Allocation (2-20 executors), ResourceQuota
- Observability: Prometheus PrometheusRules, Grafana, AlertManager, Loki
- Security: IRSA, External Secrets Operator, RBAC, LimitRange
- Stack: Amazon EKS · Airflow 2.8 · Spark 3.5 · Helm · KEDA · Prometheus
Domain-oriented architecture with federated governance
- Domains: Sales, Finance, Operations — independent pipelines + data products
- Contracts: Versioned YAML contracts (schema + SLA + quality thresholds)
- Catalog: Federated catalog with discovery, lineage, and access control
- Governance: Global PII/retention policies, per-domain flexibility
- Stack: Python · Apache Spark · Delta Lake · AWS Glue Data Catalog · DataHub
Enterprise DevOps for data pipelines
- Tests: Unit (85%+ coverage) → Integration (Docker) → dbt → E2E (staging)
- Build: Docker image + Trivy vulnerability scan → ECR
- Deploy: Blue-green deployment (zero-downtime, instant rollback)
- Quality Gates: Coverage thresholds, security scan, E2E gates, manual prod approval
- Stack: GitHub Actions · Docker · Kubernetes · Helm · pytest · ruff · mypy
LANGUAGES ORCHESTRATION STREAMING STORAGE INFRA
────────────────────────────────────────────────────────────────────────
Python 3.11 Apache Airflow Apache Kafka Amazon S3 Terraform
SQL (KubeExecutor) (Confluent) Delta Lake Kubernetes
HCL (Terraform) Spark Operator Spark Streaming Amazon Redshift Helm
YAML KEDA Kafka Streams PostgreSQL Docker
TESTING MONITORING SECURITY GOVERNANCE
────────────────────────────────────────────────────────────────────────
pytest Prometheus AWS IAM + IRSA Data Contracts
dbt tests Grafana KMS Encryption Data Catalog
mypy/ruff AlertManager tfsec/Checkov Lineage (Atlas)
E2E tests Loki Lake Formation SCD Type 2
| Pattern | Project(s) |
|---|---|
| Incremental load with watermarks | 01 |
| Staging table upsert (Redshift) | 01 |
| Star Schema + SCD Type 2 | 02 |
| Exactly-once Kafka semantics | 03 |
| Spark structured streaming + watermarks | 03 |
| Medallion Architecture (Bronze/Silver/Gold) | 04 |
| Delta Lake MERGE (idempotent upserts) | 04 |
| Time travel + schema evolution | 04 |
| Least-privilege IAM + IRSA | 05, 06 |
| Multi-environment Terraform + remote state | 05 |
| Blue-green zero-downtime deployment | 06, 08 |
| Kubernetes autoscaling (KEDA + HPA) | 06 |
| Data Mesh + Data Contracts | 07 |
| Federated governance + lineage | 07 |
| Test pyramid (unit → integration → E2E) | 08 |
# Python
python 3.11+
pip install pandas boto3 pyspark delta-spark apache-airflow psycopg2-binary
# Infrastructure
terraform >= 1.6
aws-cli >= 2
kubectl >= 1.28
helm >= 3.12
# Testing
pip install pytest pytest-cov mypy ruff dbt-redshiftcd 01-batch-etl-pipeline
pip install pandas pytest numpy
# Run tests (no external dependencies needed)
pytest tests/test_transformations.py -v
# Process sample data
python -c "
import pandas as pd
from src.transform import SalesDataTransformer
from src.validate import SalesDataValidator
config = {'source_system': 'crm', 'high_value_threshold': 10000, 'min_expected_rows': 5}
df = pd.read_csv('sample_data/sales_sample.csv', parse_dates=['order_date'])
t = SalesDataTransformer(config)
df = t.clean_and_standardize(df)
df = t.apply_business_rules(df)
print(df[['order_id', 'net_revenue', 'revenue_tier', 'is_high_value']].to_string())
"This portfolio demonstrates end-to-end capability across:
| Capability | Evidence |
|---|---|
| Batch data engineering | Project 1: ETL pipeline with validation, retries, quarantine |
| Data warehouse design | Project 2: Star schema, SCD Type 2, dbt |
| Streaming / real-time | Project 3: Kafka + Spark, exactly-once, fraud detection |
| Data lake architecture | Project 4: Medallion, Delta Lake, ML features |
| Cloud infrastructure | Project 5: Terraform, multi-env, security |
| Platform engineering | Project 6: Kubernetes, autoscaling, observability |
| Enterprise architecture | Project 7: Data mesh, contracts, governance |
| DevOps for data | Project 8: CI/CD, blue-green, test pyramid |
Each project is independently deployable. See individual READMEs for setup instructions.