Ein vollständiges, lokal ausführbares dbt-Projekt, das fortgeschrittene SQL-Kompetenzen und moderne Data Engineering Best Practices demonstriert.
Dieses Projekt simuliert einen typischen "Modern Data Stack" Workflow für E-Commerce/SaaS Analytics: Raw Data → Staging → Intermediate → Marts
Es zeigt praxisnahe Datenverarbeitung mit:
- Datenbereinigung und -standardisierung
- Komplexen SQL-Transformationen (CTEs, Window Functions, Aggregationen)
- Data Quality Testing
- Dimensional Modeling (Facts & Dimensions)
├── Raw Layer (Seeds)
│ ├── raw_customers.csv
│ ├── raw_orders.csv
│ └── raw_payments.csv
│
├── Staging Layer (Views)
│ ├── stg_customers → Bereinigung, Standardisierung
│ ├── stg_orders → Datumsformate, Status-Normalisierung
│ └── stg_payments → Methoden-Standardisierung
│
├── Intermediate Layer (Views)
│ └── int_customer_cohorts → Window Functions, Cohort Analysis
│
└── Marts Layer (Tables)
├── fct_orders → Fakten-Tabelle für Analytics
└── dim_customers → Kunden-Dimensionen mit Metriken
CASE WHENfür Datenbereinigung- String-Manipulation (
TRIM,SUBSTR) - Datumsformat-Standardisierung
- NULL-Handling und Default-Werte
- Window Functions:
ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)PERCENT_RANK() OVER (...)AVG() OVER (PARTITION BY ...)FIRST_VALUE() OVER (...)
- Zeitfunktionen:
DATE_TRUNC('month', ...)DATE_DIFF('day', ...)EXTRACT(year FROM ...)
- CTEs für strukturierte Logik
- Komplexe Aggregationen (
SUM,AVG,COUNT,MODE) - Mehrstufige JOINs
- Berechnete Metriken (LTV, AOV, Recency)
- Customer Segmentierung (RFM-inspiriert)
- Business Logic (Revenue Recognition)
# Python 3.8+
python --version
# pip installieren
pip install --upgrade pippip install dbt-duckdb# Im Projektordner
python generate_data.pyDies erstellt 3 CSV-Dateien im seeds/ Ordner mit absichtlichen Datenfehlern:
- raw_customers.csv (500 Kunden)
- raw_orders.csv (~1500 Bestellungen)
- raw_payments.csv (~1400 Payments)
Eingebaute Datenfehler:
- NULL-Werte in verschiedenen Feldern
- Inkonsistente Datumsformate (YYYY-MM-DD vs DD.MM.YYYY)
- Negative Beträge
- Fehlende Payments für manche Orders
- Payment/Order Betrag-Diskrepanzen
Das Projekt nutzt profiles.yml im Projektordner (DuckDB lokal):
ecommerce_analytics:
target: dev
outputs:
dev:
type: duckdb
path: 'data/ecommerce_analytics.duckdb'# 1. Seeds laden (Raw Data)
dbt seed --profiles-dir .
# 2. Alle Modelle bauen
dbt run --profiles-dir .
# 3. Tests ausführen
dbt test --profiles-dir .
# 4. Dokumentation generieren
dbt docs generate --profiles-dir .
dbt docs serve --profiles-dir .SELECT
customer_id,
full_name,
total_lifetime_value,
total_orders,
customer_segment,
value_segment
FROM marts.dim_customers
ORDER BY total_lifetime_value DESC
LIMIT 10;SELECT
cohort_month,
cohort_size,
AVG(total_order_value) as avg_cohort_value,
COUNT(CASE WHEN customer_segment != 'Never Purchased' THEN 1 END) as active_customers
FROM intermediate.int_customer_cohorts
GROUP BY cohort_month, cohort_size
ORDER BY cohort_month;SELECT
customer_id,
full_name,
total_lifetime_value,
days_since_last_order,
recency_segment
FROM marts.dim_customers
WHERE recency_segment IN ('At Risk (90-180 days)', 'Churned (180+ days)')
AND total_lifetime_value > 500
ORDER BY total_lifetime_value DESC;SELECT
order_date_month,
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(*) as total_orders,
SUM(recognized_revenue) as total_revenue,
AVG(recognized_revenue) as avg_order_value
FROM marts.fct_orders
WHERE has_successful_payment = true
GROUP BY order_date_month
ORDER BY order_date_month;SELECT
payment_method,
COUNT(*) as total_orders,
SUM(recognized_revenue) as total_revenue,
AVG(recognized_revenue) as avg_revenue,
COUNT(CASE WHEN has_payment_discrepancy THEN 1 END) as discrepancies
FROM marts.fct_orders
WHERE payment_method IS NOT NULL
GROUP BY payment_method
ORDER BY total_revenue DESC;SELECT
c.customer_id,
c.full_name,
c.signup_date,
c.first_order_date,
c.days_to_first_order,
c.total_orders,
c.total_lifetime_value,
c.avg_orders_per_month,
c.customer_segment
FROM marts.dim_customers c
WHERE c.has_purchased = true
ORDER BY c.total_lifetime_value DESC
LIMIT 20;Das Projekt enthält umfangreiche dbt-Tests:
- Unique/Not Null Tests: Primärschlüssel-Validierung
- Relationships Tests: Foreign Key Integrität
- Accepted Values Tests: Status- und Kategorie-Validierung
# Alle Tests ausführen
dbt test --profiles-dir .
# Nur Staging-Tests
dbt test --select staging --profiles-dir .
# Nur ein Modell testen
dbt test --select dim_customers --profiles-dir .dbt_portfolio_project/
├── README.md # Diese Datei
├── generate_data.py # Python-Script für Datengenerierung
├── dbt_project.yml # dbt Projekt-Konfiguration
├── profiles.yml # DuckDB-Verbindung
│
├── seeds/ # Raw Data (CSV)
│ ├── raw_customers.csv
│ ├── raw_orders.csv
│ └── raw_payments.csv
│
├── models/
│ ├── staging/ # Datenbereinigung
│ │ ├── schema.yml
│ │ ├── stg_customers.sql
│ │ ├── stg_orders.sql
│ │ └── stg_payments.sql
│ │
│ ├── intermediate/ # Erweiterte Transformationen
│ │ ├── schema.yml
│ │ └── int_customer_cohorts.sql
│ │
│ └── marts/ # Analytics-ready Tables
│ ├── schema.yml
│ ├── fct_orders.sql
│ └── dim_customers.sql
│
└── data/ # DuckDB-Datei (generiert)
└── ecommerce_analytics.duckdb
Dieses Projekt demonstriert:
- Modern Data Stack Kenntnisse: dbt, SQL, dimensional modeling
- SQL-Expertise: Window Functions, CTEs, komplexe Aggregationen
- Data Quality: Testing, Validierung, Fehlerbehandlung
- Business Understanding: E-Commerce Metriken (LTV, AOV, RFM)
- Best Practices: Modulare Code-Struktur, Dokumentation, Reproduzierbarkeit
- Visualisierung: BI-Tool anbinden (Metabase, Tableau, Power BI)
- Macros: Wiederverwendbare dbt-Macros schreiben
- Incremental Models: Für große Datenmengen optimieren
- Snapshots: Historische Daten tracken (SCD Type 2)
- Erweiterte Analytics: Churn Prediction, Customer Clustering
Dieses Projekt ist für Portfolio- und Lernzwecke erstellt.
Erstellt mit: dbt + DuckDB + Python Autor: Merlin Mechler Datum: 2025