A production-ready predictive maintenance system for manufacturing equipment that predicts failures before they occur, reducing downtime costs by 35-45% and maintenance costs by 25-30%.
[Factory Floor]
├─ IoT Sensors (vibration, temp, pressure, power)
└─ Edge Device (data aggregation, initial filtering)
↓
[Kafka Topic: raw_sensor_data]
↓
[Stream Processor - Apache Flink / Python Consumer]
├─ Data validation
├─ Feature engineering (rolling stats, FFT)
└─ Writes to → TimescaleDB (time-series) + S3 (raw backup)
↓
[ML Pipeline]
├─ Offline Training (daily/weekly): LSTM + Random Forest
├─ Model Registry (MLflow)
└─ Batch feature computation
↓
[Real-time Inference Service - FastAPI]
├─ Loads latest model from MLflow
├─ Consumes from Kafka: processed_features
└─ Predicts: RUL (Remaining Useful Life), Anomaly Score
↓
[Alert System]
├─ Threshold-based triggers (RUL < 48hrs)
└─ Sends → Email / Slack / SMS
↓
[Dashboard - Streamlit/Grafana]
└─ Real-time monitoring, historical trends, equipment health
| Component | Technology | Justification |
|---|---|---|
| Message Queue | Apache Kafka | Industry standard for IoT streaming; 100k+ events/sec; fault-tolerant |
| Stream Processing | Python (kafka-python) | Simpler than Flink for demo; production-ready; easier ML integration |
| Time-series DB | TimescaleDB | SQL + time-series optimizations; better than InfluxDB for relational context |
| Object Storage | MinIO (local S3) | Production S3 patterns without AWS costs |
| ML Framework | PyTorch (LSTM) + scikit-learn | LSTM for sequences; RF for baseline; interview-friendly |
| Model Registry | MLflow | Tracks experiments, versions models, serves via REST API |
| Inference API | FastAPI | Fast, async, auto-generated docs; production standard |
| Monitoring | Grafana + Streamlit | Grafana = ops teams; Streamlit = custom dashboards |
| Orchestration | Airflow | Schedules retraining, feature backfills |
| Containerization | Docker + Docker Compose | Local dev; mention K8s for scale in README |
-
Module 1: Data Generator (
data_generator/) - ✅ COMPLETED- Simulates sensor data with realistic degradation patterns
- Publishes to Kafka
raw_sensor_datatopic - Supports turbofan engines, pumps, compressors
- Injects failures: linear, exponential, step, oscillating patterns
-
Module 2: Kafka Infrastructure (
infra/kafka/) - ✅ COMPLETED- Docker Compose with Kafka, Zookeeper, TimescaleDB, MinIO, Redis
- 7 Kafka topics with retention/compression policies
- Consumer group configurations
- Health check and management scripts
- TimescaleDB schema with hypertables and continuous aggregates
-
Module 3: Stream Processor (
stream_processor/) - ✅ COMPLETED- Real-time Kafka consumer with batch processing
- Feature engineering pipeline (time-domain, frequency-domain)
- TimescaleDB writer with connection pooling
- Configurable buffer sizes and write intervals
-
Module 4: Feature Store (
feature_store/) - ✅ COMPLETED- Time-series features (rolling statistics, lag features)
- Frequency-domain features (FFT, spectral analysis)
- Label generation for RUL prediction
- Feature versioning and metadata tracking
-
Module 5: Training Pipeline (
ml_pipeline/train/) - ✅ COMPLETED- LSTM model for sequence prediction
- Random Forest baseline model
- Hyperparameter tuning with Optuna
- Cross-validation framework
- MLflow experiment tracking
-
Module 6: Model Evaluation (
ml_pipeline/evaluate/) - ✅ COMPLETED- Comprehensive metrics (MAE, RMSE, R2, MAPE)
- Model card generation
- Evaluation visualizations (8 types)
- Backtesting framework
-
Module 7: Inference API (
inference_service/) - ✅ COMPLETED- FastAPI service with async endpoints
- Model manager with versioning
- Real-time predictions (<50ms latency)
- Batch prediction support
- Health monitoring endpoints
-
Module 8: Alert Engine (
alerting/) - ✅ COMPLETED- Rule-based alerting system (11+ built-in rules)
- Multi-channel notifications (email, Slack, webhook, database)
- Alert suppression and aggregation
- Configurable thresholds and priorities
-
Module 9: Dashboard (
dashboard/) - ✅ COMPLETED- Streamlit interactive dashboard
- Grafana monitoring dashboards
- Real-time equipment health visualization
- Historical trend analysis
- Alert management UI
- Module 10: Retraining Pipeline (
ml_pipeline/retrain/) - ✅ COMPLETED- Automated drift detection (data & concept drift)
- Model comparison framework
- Safe deployment with rollback
- Scheduled retraining workflow
- Python 3.9+
- Docker & Docker Compose
- 8GB RAM minimum (16GB recommended)
- Clone the repository:
cd predictive-maintenance- Start infrastructure services:
cd infra/kafka
./scripts/start-infra.sh # Linux/Mac
# or
.\scripts\start-infra.bat # Windows
# This starts: Kafka, Zookeeper, TimescaleDB, MinIO, Redis- Run Data Streamer (C-MAPSS Real Data):
cd data_loader
pip install -r requirements.txt
# Stream NASA C-MAPSS turbofan engine data
python kafka_streamer.py --dataset FD001 --train --rate 1.0
# Or stream single engine for testing
python kafka_streamer.py --dataset FD001 --train --engine 1 --rate 10.0- Or Run Synthetic Data Generator:
cd data_generator
pip install -r requirements.txt
python main.py --num-equipment 10 --equipment-type turbofan_engine- Run Stream Processor:
cd stream_processor
pip install -r requirements.txt
python main.py- Train Models (on C-MAPSS data):
cd ml_pipeline/train
pip install -r requirements.txt
# Train on NASA C-MAPSS dataset
python train_pipeline.py --config config/training_config.yaml --dataset cmapss
# Evaluate with NASA PHM08 scoring function
cd ../evaluate
python evaluator.py --model-path ../../models/latest_model.pkl --dataset cmapss- Start Inference API:
cd inference_service
pip install -r requirements.txt
uvicorn api.main:app --host 0.0.0.0 --port 8000- Start Dashboard:
cd dashboard/streamlit_app
pip install -r requirements.txt
streamlit run app.pyRun without Kafka/TimescaleDB:
cd data_generator
python main.py --mock --num-equipment 5 --equipment-type turbofan_engine- Kafka UI (Kafdrop): http://localhost:9000
- MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
- TimescaleDB: localhost:5432 (pmuser/pmpassword)
- Inference API: http://localhost:8000/docs
- Streamlit Dashboard: http://localhost:8501
- Grafana: http://localhost:3000 (admin/admin)
The project now supports the NASA C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) Turbofan Engine Degradation Dataset:
- 100 training engines (run-to-failure trajectories)
- 100 test engines (with true RUL labels)
- 21 sensor measurements per cycle
- 3 operational settings (altitude, throttle, Mach number)
Dataset Location: archive/CMaps/ (included in repository)
Files:
train_FD001.txt- Training data (complete degradation)test_FD001.txt- Test data (stops before failure)RUL_FD001.txt- True remaining useful life for test engines
Usage:
from data_loader import CMAPSSLoader
loader = CMAPSSLoader(dataset_path="archive/CMaps", dataset_id="FD001")
train_df = loader.load_train_data() # With RUL labels
test_df = loader.load_test_data() # With true RULSee data_loader/README.md for complete documentation.
The system also includes a synthetic data generator for additional testing:
- Configurable equipment types: Turbofan engines, pumps, compressors
- Failure patterns: Linear, exponential, step, oscillating degradation
- Realistic sensor noise and operating conditions
- NASA C-MAPSS: Turbofan engine degradation dataset (included in
/archive/CMaps/)- FD001: 100 train + 100 test engines, 1 operating condition, 1 fault mode
- 21 sensor measurements per operational cycle
- Run-to-failure trajectories with true RUL labels
- Synthetic Data: Custom generated sensor data with realistic failure patterns
- Temperature: Inlet, outlet, various stages
- Pressure: Fan inlet, HPC outlet, bypass duct
- Vibration: Bearing, shaft imbalance
- Flow: Air flow, fuel flow
- Power: Motor current, consumption
✅ Real-time streaming with Kafka (100k+ events/sec) ✅ Time-series forecasting with LSTM networks ✅ Anomaly detection using Isolation Forest & Autoencoders ✅ Survival analysis for RUL prediction ✅ Feature engineering (FFT, rolling statistics, frequency domain) ✅ Model versioning with MLflow ✅ REST API with FastAPI for real-time inference ✅ Production monitoring with Grafana dashboards
- Cost Reduction: 25-30% maintenance cost savings
- Downtime Prevention: 35-45% reduction in unplanned downtime
- Equipment Lifespan: 20% extension through optimized maintenance
- Annual Savings: $50B industry problem addressed
- Precision: Target >90% (minimize false alarms)
- Recall: Target >85% (catch real failures)
- F1-Score: Balanced metric
- MAE: Mean Absolute Error (cycles)
- RMSE: Root Mean Square Error
- Early/Late Predictions: Tracking lead time accuracy
- Phase 1: Data Foundation
- Data Generator with failure injection
- Kafka infrastructure setup (Kafka, TimescaleDB, MinIO, Redis)
- Stream processor with feature engineering
- Phase 2: ML Pipeline
- Feature store implementation
- LSTM + Random Forest training pipeline
- Model evaluation & backtesting framework
- Phase 3: Inference & Monitoring
- FastAPI inference service
- Alert engine with multi-channel notifications
- Streamlit + Grafana dashboards
- Phase 4: MLOps & Automation
- Automated retraining with drift detection
- Model comparison and safe deployment
- Rollback capabilities
All 10 modules completed! 🎉
- Data Generator: data_generator/README.md
- Stream Processor: stream_processor/README.md
- Feature Store: feature_store/README.md
- Training Pipeline: ml_pipeline/train/README.md
- Model Evaluation: ml_pipeline/evaluate/README.md
- Retraining Pipeline: ml_pipeline/retrain/README.md
- Inference API: inference_service/README.md
- Alert Engine: alerting/README.md
- Dashboard: dashboard/README.md
- Kafka Setup: infra/kafka/README.md
- Architecture Overview (this file)
- API Documentation: http://localhost:8000/docs (when running)
- Deployment Guide: See individual module READMEs
This is a portfolio project demonstrating production ML systems. Each module is designed to be:
- Interview-friendly: Clear code, well-documented
- Production-ready: Error handling, logging, monitoring
- Scalable: Designed for distributed deployment
Educational/Portfolio Project - 2026
- Saxena, A., et al. "Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation" (PHM08)
- Kafka Streams Documentation
- LSTM for Time Series Forecasting
- Survival Analysis in Predictive Maintenance
Built for demonstrating end-to-end ML systems engineering skills