Skip to content

tusshaarpd/data-platform

Repository files navigation

Enterprise Data Platform & Infrastructure Portfolio

Domain: Data Engineering | Data Platform | Infrastructure Engineering Architecture Level: Beginner → Advanced


Portfolio Overview

Eight production-quality projects demonstrating progressive expertise across the full spectrum of modern data engineering — from foundational batch pipelines to advanced platform engineering, data mesh, and enterprise DevOps.

Each project includes: architecture diagrams, production code, unit tests, deployment scripts, trade-off analysis, failure handling, and cost estimates.


Project Roadmap

BEGINNER                    INTERMEDIATE                   ADVANCED
────────────────────────────────────────────────────────────────────────
01-batch-etl-pipeline  →  03-streaming-pipeline  →  06-k8s-data-platform
02-data-warehouse-modeling  04-data-lake-medallion  07-data-mesh
                            05-infrastructure-as-code  08-cicd-data-pipelines

Projects at a Glance

🟢 Beginner Projects

Daily sales data ingestion: CRM API + S3 → Redshift

  • Extract: REST API pagination, CSV, S3 raw zone; watermark-based incremental loads
  • Transform: Business rules, revenue calculations, SCD-ready metadata
  • Validate: 15+ data quality rules across completeness/accuracy/consistency/timeliness
  • Load: S3 Parquet (Hive-partitioned) + Redshift staging table upsert
  • Orchestrate: Airflow DAG with branching, sensors, retry logic, Slack alerting
  • Stack: Python · Pandas · Apache Airflow · Amazon S3 · Amazon Redshift
  • Tests: 20+ pytest unit tests with full coverage

Star schema for retail analytics with SCD Type 2

  • Schema: Fact tables (order line item grain), 5 dimension tables
  • SCD Type 2: Customer, Product, Sales Rep — full point-in-time history
  • Performance: DISTKEY/SORTKEY optimization, materialized views
  • Transformations: dbt incremental models, snapshot tests
  • Stack: PostgreSQL / Redshift · dbt Core · SQL
  • Tests: dbt schema tests + custom SQL assertions

🟡 Intermediate Projects

Real-time fraud detection: 100+ TPS, sub-2s alert latency

  • Ingest: Confluent Kafka (3-broker, RF=3, exactly-once)
  • Process: Spark Structured Streaming — rule-based fraud scoring + velocity windows
  • Sink: Delta Lake (Bronze) + Kafka fraud alerts topic
  • Monitor: Prometheus metrics, Grafana dashboards, Kafka UI
  • Stack: Apache Kafka 7.5 · Spark Structured Streaming 3.5 · Delta Lake · Docker
  • Concepts: Exactly-once semantics, watermarking, backpressure, late data

Bronze → Silver → Gold data lake for enterprise analytics

  • Bronze: Raw ingestion (CSV/Parquet/Kafka), schema enforcement, time travel
  • Silver: Deduplication (Delta MERGE), PII masking, quality scoring
  • Gold: Customer 360, Daily Summary, ML Feature Store
  • Stack: Apache Spark · Delta Lake · S3 · AWS Glue
  • Features: ACID transactions, schema evolution, Z-ORDER, auto-compaction

Full AWS data platform provisioned via Terraform

  • Modules: VPC (multi-AZ, NAT, flow logs), S3 (5-tier lake, KMS, lifecycle), IAM (least-privilege)
  • Environments: dev / staging / prod with isolated state
  • State: S3 backend + DynamoDB locking, KMS-encrypted
  • CI/CD: GitHub Actions — validate → plan → apply dev → manual approve → apply prod
  • Security: tfsec + Checkov scanning, OIDC authentication (no stored credentials)

🔴 Advanced Projects

Complete modern data stack on Amazon EKS

  • Airflow: KubernetesExecutor, HA scheduler (2 replicas), KEDA autoscaling
  • Spark: Spark Operator, Dynamic Allocation (2-20 executors), ResourceQuota
  • Observability: Prometheus PrometheusRules, Grafana, AlertManager, Loki
  • Security: IRSA, External Secrets Operator, RBAC, LimitRange
  • Stack: Amazon EKS · Airflow 2.8 · Spark 3.5 · Helm · KEDA · Prometheus

Domain-oriented architecture with federated governance

  • Domains: Sales, Finance, Operations — independent pipelines + data products
  • Contracts: Versioned YAML contracts (schema + SLA + quality thresholds)
  • Catalog: Federated catalog with discovery, lineage, and access control
  • Governance: Global PII/retention policies, per-domain flexibility
  • Stack: Python · Apache Spark · Delta Lake · AWS Glue Data Catalog · DataHub

Enterprise DevOps for data pipelines

  • Tests: Unit (85%+ coverage) → Integration (Docker) → dbt → E2E (staging)
  • Build: Docker image + Trivy vulnerability scan → ECR
  • Deploy: Blue-green deployment (zero-downtime, instant rollback)
  • Quality Gates: Coverage thresholds, security scan, E2E gates, manual prod approval
  • Stack: GitHub Actions · Docker · Kubernetes · Helm · pytest · ruff · mypy

Technology Landscape

LANGUAGES        ORCHESTRATION    STREAMING       STORAGE          INFRA
────────────────────────────────────────────────────────────────────────
Python 3.11      Apache Airflow   Apache Kafka    Amazon S3        Terraform
SQL              (KubeExecutor)   (Confluent)     Delta Lake       Kubernetes
HCL (Terraform)  Spark Operator   Spark Streaming Amazon Redshift  Helm
YAML             KEDA             Kafka Streams   PostgreSQL       Docker

TESTING          MONITORING       SECURITY         GOVERNANCE
────────────────────────────────────────────────────────────────────────
pytest           Prometheus       AWS IAM + IRSA   Data Contracts
dbt tests        Grafana          KMS Encryption   Data Catalog
mypy/ruff        AlertManager     tfsec/Checkov    Lineage (Atlas)
E2E tests        Loki             Lake Formation   SCD Type 2

Architecture Patterns Demonstrated

Pattern Project(s)
Incremental load with watermarks 01
Staging table upsert (Redshift) 01
Star Schema + SCD Type 2 02
Exactly-once Kafka semantics 03
Spark structured streaming + watermarks 03
Medallion Architecture (Bronze/Silver/Gold) 04
Delta Lake MERGE (idempotent upserts) 04
Time travel + schema evolution 04
Least-privilege IAM + IRSA 05, 06
Multi-environment Terraform + remote state 05
Blue-green zero-downtime deployment 06, 08
Kubernetes autoscaling (KEDA + HPA) 06
Data Mesh + Data Contracts 07
Federated governance + lineage 07
Test pyramid (unit → integration → E2E) 08

Getting Started

Prerequisites

# Python
python 3.11+
pip install pandas boto3 pyspark delta-spark apache-airflow psycopg2-binary

# Infrastructure
terraform >= 1.6
aws-cli >= 2
kubectl >= 1.28
helm >= 3.12

# Testing
pip install pytest pytest-cov mypy ruff dbt-redshift

Run Project 1 Locally (Quickest Start)

cd 01-batch-etl-pipeline
pip install pandas pytest numpy

# Run tests (no external dependencies needed)
pytest tests/test_transformations.py -v

# Process sample data
python -c "
import pandas as pd
from src.transform import SalesDataTransformer
from src.validate import SalesDataValidator

config = {'source_system': 'crm', 'high_value_threshold': 10000, 'min_expected_rows': 5}
df = pd.read_csv('sample_data/sales_sample.csv', parse_dates=['order_date'])
t = SalesDataTransformer(config)
df = t.clean_and_standardize(df)
df = t.apply_business_rules(df)
print(df[['order_id', 'net_revenue', 'revenue_tier', 'is_high_value']].to_string())
"

Portfolio Positioning

This portfolio demonstrates end-to-end capability across:

Capability Evidence
Batch data engineering Project 1: ETL pipeline with validation, retries, quarantine
Data warehouse design Project 2: Star schema, SCD Type 2, dbt
Streaming / real-time Project 3: Kafka + Spark, exactly-once, fraud detection
Data lake architecture Project 4: Medallion, Delta Lake, ML features
Cloud infrastructure Project 5: Terraform, multi-env, security
Platform engineering Project 6: Kubernetes, autoscaling, observability
Enterprise architecture Project 7: Data mesh, contracts, governance
DevOps for data Project 8: CI/CD, blue-green, test pyramid

Each project is independently deployable. See individual READMEs for setup instructions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors