Project Nova is a comprehensive credit scoring system designed specifically for gig economy workers, providing fair and transparent risk assessments without relying on traditional credit history. The system uses advanced unsupervised machine learning techniques combined with a multi-stage fairness framework to ensure equitable treatment across demographic groups.
- Unsupervised Learning Pipeline: Uses PCA dimensionality reduction and K-Means clustering for risk assessment
- Comprehensive Fairness Framework: Multi-stage bias detection and mitigation across protected attributes
- Behavioral Pattern Analysis: Leverages platform engagement, financial stability, and reliability metrics
- Interactive Web Application: Streamlit-based interface for easy data upload and analysis
- Production-Ready Architecture: Modular design with comprehensive validation and monitoring
grabathon/
├── main_notebook_restructured.ipynb # Complete analysis notebook
├── streamlit_app_refactored.py # Main Streamlit application
├── nova_unlabeled.csv # Demo dataset (50,000 records)
├── requirements.txt # Python dependencies
└── src/ # Source code modules
├── __init__.py
├── pipeline.py # Data processing pipeline
├── feature_engineering.py # Feature creation and selection
├── modeling.py # Unsupervised ML models
├── fairness.py # Bias detection and mitigation
├── feature_importance.py # Model interpretability
├── visualizations.py # Plotting functions
├── streamlit_utils.py # UI utilities
├── qa_checks.py # Quality assurance
└── utils.py # General utilities
- Python 3.8+
- pip or conda package manager
-
Clone the repository
git clone <repository-url> cd grabathon
-
Install dependencies
pip install -r requirements.txt
-
Run the Streamlit application
streamlit run streamlit_app_refactored.py
-
Access the application
- Open your browser to
http://localhost:8501 - The application will load with a demo dataset of 1,000 randomly selected records
- Open your browser to
The Streamlit application provides a 6-step pipeline:
- Demo Dataset: Load 1,000 random records from the full dataset
- Custom Upload: Upload your own CSV/Excel files
- Data Validation: Automatic schema validation and quality checks
- Data Overview: Comprehensive statistics and data quality metrics
- Constraint Validation: Check for logical inconsistencies
- Missing Value Analysis: Identify and report data completeness
- Visualization: Correlation heatmaps and distribution plots
- Financial Health Features: Income stability, wallet balance patterns
- Platform Activity Features: Trip completion rates, surge acceptance
- Tenure & Experience Features: Registration duration, activity consistency
- Risk Indicators: Payment flags, low balance frequency
- Advanced Features: Trend analysis, volatility metrics
- PCA Transformation: Dimensionality reduction (95% variance retention)
- K-Means Clustering: Optimal cluster discovery using multiple metrics
- Isolation Forest: Anomaly detection for risk assessment
- Ensemble Risk Calculation: Weighted combination of risk probabilities
- Demographic Parity: Equal treatment across protected groups
- Statistical Parity: Fair prediction rates
- Bias Detection: Comprehensive statistical testing
- Mitigation Framework: Multi-stage bias correction
- NovaScore Generation: 300-850 scale credit scores
- Risk Categorization: High/Medium/Low/Very Low Risk
- Feature Importance: Model interpretability analysis
- Download Results: Export complete dataset with scores
from src.pipeline import load_and_validate_data
from src.feature_engineering import engineer_features, prepare_modeling_data
from src.modeling import apply_pca_transformation, train_unsupervised_models, calculate_novascore
from src.fairness import assess_fairness_comprehensive, apply_bias_mitigation_comprehensive
# Load and validate data
df = load_and_validate_data(use_demo=True)
# Engineer features
df_engineered = engineer_features(df)
X_scaled_df, demographic_info, scaler = prepare_modeling_data(df_engineered)
# Apply PCA transformation
X_pca_df, pca = apply_pca_transformation(X_scaled_df)
# Train models and calculate scores
models = train_unsupervised_models(X_pca_df, demographic_info)
risk_probs = calculate_risk_probabilities(X_pca_df, models)
novascore_results = calculate_novascore(risk_probs)
# Assess fairness
fairness_results, overall_fair = assess_fairness_comprehensive(novascore_results, demographic_info)
# Apply bias mitigation if needed
if not overall_fair:
final_results = apply_bias_mitigation_comprehensive(
X_scaled_df, demographic_info, novascore_results, fairness_results, overall_fair
)- Feature Engineering: Create 20+ engineered features from raw data
- Dimensionality Reduction: PCA with 95% variance retention
- Unsupervised Clustering: K-Means with optimal cluster selection
- Anomaly Detection: Isolation Forest for outlier identification
- Risk Probability: Weighted ensemble of clustering and anomaly scores
- Score Conversion: Linear transformation to 300-850 scale
- Disparate Impact Removal: Adjust features to reduce correlation with protected attributes
- Data Balancing: Ensure representative samples across demographic groups
- Fairness-Constrained Clustering: Penalize unfair cluster assignments
- Protected Attribute Awareness: Include fairness metrics in model optimization
- Score Calibration: Adjust final scores to achieve demographic parity
- Threshold Optimization: Set decision boundaries that minimize bias
- Financial Health (3 features): Income stability, wallet balance patterns
- Platform Activity (5 features): Trip completion, surge acceptance, reliability
- Tenure & Experience (4 features): Registration duration, activity consistency
- Risk Indicators (4 features): Payment flags, financial stress indicators
- Customer Satisfaction (3 features): Rating consistency, service quality
- Silhouette Score: 0.4+ (good cluster separation)
- Davies-Bouldin Score: <1.0 (compact clusters)
- PCA Variance Retention: 95% (minimal information loss)
- Anomaly Detection Rate: 10% (configurable contamination)
- Score Range: 300-850 (industry standard)
- Risk Distribution: Balanced across categories
- Feature Interpretability: Clear business meaning
- Processing Speed: <30 seconds for 1,000 records
- Demographic Parity: 80/20 rule compliance
- Statistical Parity: <10% difference in prediction rates
- Distribution Fairness: No significant group differences
- Bias Mitigation: Multi-stage correction framework
partner_id: Unique identifierage: Partner age (18-100)gender: Male/Female/Otherregion: Geographic regioneducation_level: Education attainmentpartner_type: Driver/Merchant/Otherregistration_date: Platform join date
trips_completed_weekly: Weekly trip completion countcustomer_rating_avg_all_time: Average customer ratingweekly_income: Weekly earningswallet_balance_d1towallet_balance_d7: Daily wallet balancessurge_trips_offered/surge_trips_accepted: Surge trip metricsupsell_attempts/upsell_successes: Business performance metrics
- Completeness: >90% non-null values
- Consistency: No logical contradictions
- Validity: Values within expected ranges
- Freshness: Data within 30 days
-
Data Pipeline (
pipeline.py)- Data loading and validation
- Schema checking and quality assessment
- Comprehensive error handling
-
Feature Engineering (
feature_engineering.py)- Automated feature creation
- Feature selection and scaling
- Missing value imputation
-
Modeling Engine (
modeling.py)- PCA transformation
- K-Means clustering
- Isolation Forest anomaly detection
- NovaScore calculation
-
Fairness Framework (
fairness.py)- Bias detection algorithms
- Multi-stage mitigation
- Statistical testing
-
Visualization Suite (
visualizations.py)- Interactive plots with Plotly
- Statistical visualizations
- Business intelligence charts
streamlit>=1.28.0 # Web application framework
pandas>=1.5.0 # Data manipulation
numpy>=1.24.0 # Numerical computing
matplotlib>=3.6.0 # Static plotting
seaborn>=0.12.0 # Statistical visualization
plotly>=5.15.0 # Interactive plotting
scikit-learn>=1.3.0 # Machine learning
scipy>=1.10.0 # Scientific computing
# Install dependencies
pip install -r requirements.txt
# Run application
streamlit run streamlit_app_refactored.py- Container Setup: Use Docker for consistent environments
- Web Server: Deploy with nginx + gunicorn
- Database Integration: Connect to PostgreSQL/MySQL
- Monitoring: Implement logging and performance metrics
- Scaling: Use Kubernetes for horizontal scaling
# Example API endpoint
@app.post("/calculate-novascore")
async def calculate_novascore(data: PartnerData):
# Process data through pipeline
scores = novascore_pipeline(data)
return {"novascore": scores, "risk_category": categories}The included nova_unlabeled.csv contains 50,000 simulated partner records with:
- Demographics: Age, gender, region, education
- Platform Activity: Trips, ratings, income patterns
- Financial Data: Wallet balances, cashouts, payment history
- Behavioral Patterns: Surge acceptance, reliability metrics
- Data Encryption: All data encrypted in transit and at rest
- Access Control: Role-based permissions
- Audit Logging: Complete activity tracking
- GDPR Compliance: Data anonymization and right to deletion
- Model Explainability: Transparent decision-making process
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or contributions:
- Issues: Use GitHub Issues for bug reports
- Discussions: Use GitHub Discussions for questions
- Documentation: Check the notebook for detailed methodology
- Real-time Scoring: API for live score calculation
- Model Monitoring: Drift detection and retraining
- Advanced Fairness: Intersectional bias analysis
- Multi-platform Support: Extend to other gig economy platforms
- Mobile App: Native mobile interface
- A/B Testing: Framework for model comparison
Project Nova - Enabling fair credit access for the gig economy workforce through transparent, unbiased machine learning.