Skip to content

Krish8955/Grabathon

Repository files navigation

Project Nova: Equitable Credit Scoring for Gig Economy Workers

🚀 Overview

Project Nova is a comprehensive credit scoring system designed specifically for gig economy workers, providing fair and transparent risk assessments without relying on traditional credit history. The system uses advanced unsupervised machine learning techniques combined with a multi-stage fairness framework to ensure equitable treatment across demographic groups.

🎯 Key Features

  • Unsupervised Learning Pipeline: Uses PCA dimensionality reduction and K-Means clustering for risk assessment
  • Comprehensive Fairness Framework: Multi-stage bias detection and mitigation across protected attributes
  • Behavioral Pattern Analysis: Leverages platform engagement, financial stability, and reliability metrics
  • Interactive Web Application: Streamlit-based interface for easy data upload and analysis
  • Production-Ready Architecture: Modular design with comprehensive validation and monitoring

📊 Project Structure

grabathon/
├── main_notebook_restructured.ipynb    # Complete analysis notebook
├── streamlit_app_refactored.py         # Main Streamlit application
├── nova_unlabeled.csv                  # Demo dataset (50,000 records)
├── requirements.txt                    # Python dependencies
└── src/                               # Source code modules
    ├── __init__.py
    ├── pipeline.py                     # Data processing pipeline
    ├── feature_engineering.py          # Feature creation and selection
    ├── modeling.py                     # Unsupervised ML models
    ├── fairness.py                     # Bias detection and mitigation
    ├── feature_importance.py           # Model interpretability
    ├── visualizations.py               # Plotting functions
    ├── streamlit_utils.py              # UI utilities
    ├── qa_checks.py                    # Quality assurance
    └── utils.py                        # General utilities

🔧 Installation & Setup

Prerequisites

  • Python 3.8+
  • pip or conda package manager

Installation Steps

  1. Clone the repository

    git clone <repository-url>
    cd grabathon
  2. Install dependencies

    pip install -r requirements.txt
  3. Run the Streamlit application

    streamlit run streamlit_app_refactored.py
  4. Access the application

    • Open your browser to http://localhost:8501
    • The application will load with a demo dataset of 1,000 randomly selected records

🎮 Usage Guide

Web Application Workflow

The Streamlit application provides a 6-step pipeline:

1. Data Upload 📁

  • Demo Dataset: Load 1,000 random records from the full dataset
  • Custom Upload: Upload your own CSV/Excel files
  • Data Validation: Automatic schema validation and quality checks

2. Quality Assurance 🔍

  • Data Overview: Comprehensive statistics and data quality metrics
  • Constraint Validation: Check for logical inconsistencies
  • Missing Value Analysis: Identify and report data completeness
  • Visualization: Correlation heatmaps and distribution plots

3. Feature Engineering 🔧

  • Financial Health Features: Income stability, wallet balance patterns
  • Platform Activity Features: Trip completion rates, surge acceptance
  • Tenure & Experience Features: Registration duration, activity consistency
  • Risk Indicators: Payment flags, low balance frequency
  • Advanced Features: Trend analysis, volatility metrics

4. Model Training 🤖

  • PCA Transformation: Dimensionality reduction (95% variance retention)
  • K-Means Clustering: Optimal cluster discovery using multiple metrics
  • Isolation Forest: Anomaly detection for risk assessment
  • Ensemble Risk Calculation: Weighted combination of risk probabilities

5. Fairness Analysis ⚖️

  • Demographic Parity: Equal treatment across protected groups
  • Statistical Parity: Fair prediction rates
  • Bias Detection: Comprehensive statistical testing
  • Mitigation Framework: Multi-stage bias correction

6. Results 📊

  • NovaScore Generation: 300-850 scale credit scores
  • Risk Categorization: High/Medium/Low/Very Low Risk
  • Feature Importance: Model interpretability analysis
  • Download Results: Export complete dataset with scores

Programmatic Usage

from src.pipeline import load_and_validate_data
from src.feature_engineering import engineer_features, prepare_modeling_data
from src.modeling import apply_pca_transformation, train_unsupervised_models, calculate_novascore
from src.fairness import assess_fairness_comprehensive, apply_bias_mitigation_comprehensive

# Load and validate data
df = load_and_validate_data(use_demo=True)

# Engineer features
df_engineered = engineer_features(df)
X_scaled_df, demographic_info, scaler = prepare_modeling_data(df_engineered)

# Apply PCA transformation
X_pca_df, pca = apply_pca_transformation(X_scaled_df)

# Train models and calculate scores
models = train_unsupervised_models(X_pca_df, demographic_info)
risk_probs = calculate_risk_probabilities(X_pca_df, models)
novascore_results = calculate_novascore(risk_probs)

# Assess fairness
fairness_results, overall_fair = assess_fairness_comprehensive(novascore_results, demographic_info)

# Apply bias mitigation if needed
if not overall_fair:
    final_results = apply_bias_mitigation_comprehensive(
        X_scaled_df, demographic_info, novascore_results, fairness_results, overall_fair
    )

🧠 Methodology

NovaScore Algorithm

  1. Feature Engineering: Create 20+ engineered features from raw data
  2. Dimensionality Reduction: PCA with 95% variance retention
  3. Unsupervised Clustering: K-Means with optimal cluster selection
  4. Anomaly Detection: Isolation Forest for outlier identification
  5. Risk Probability: Weighted ensemble of clustering and anomaly scores
  6. Score Conversion: Linear transformation to 300-850 scale

Fairness Framework

Pre-processing

  • Disparate Impact Removal: Adjust features to reduce correlation with protected attributes
  • Data Balancing: Ensure representative samples across demographic groups

In-processing

  • Fairness-Constrained Clustering: Penalize unfair cluster assignments
  • Protected Attribute Awareness: Include fairness metrics in model optimization

Post-processing

  • Score Calibration: Adjust final scores to achieve demographic parity
  • Threshold Optimization: Set decision boundaries that minimize bias

Feature Categories

  • Financial Health (3 features): Income stability, wallet balance patterns
  • Platform Activity (5 features): Trip completion, surge acceptance, reliability
  • Tenure & Experience (4 features): Registration duration, activity consistency
  • Risk Indicators (4 features): Payment flags, financial stress indicators
  • Customer Satisfaction (3 features): Rating consistency, service quality

📈 Model Performance

Technical Metrics

  • Silhouette Score: 0.4+ (good cluster separation)
  • Davies-Bouldin Score: <1.0 (compact clusters)
  • PCA Variance Retention: 95% (minimal information loss)
  • Anomaly Detection Rate: 10% (configurable contamination)

Business Metrics

  • Score Range: 300-850 (industry standard)
  • Risk Distribution: Balanced across categories
  • Feature Interpretability: Clear business meaning
  • Processing Speed: <30 seconds for 1,000 records

Fairness Metrics

  • Demographic Parity: 80/20 rule compliance
  • Statistical Parity: <10% difference in prediction rates
  • Distribution Fairness: No significant group differences
  • Bias Mitigation: Multi-stage correction framework

🔍 Data Requirements

Required Columns

  • partner_id: Unique identifier
  • age: Partner age (18-100)
  • gender: Male/Female/Other
  • region: Geographic region
  • education_level: Education attainment
  • partner_type: Driver/Merchant/Other
  • registration_date: Platform join date

Optional Columns

  • trips_completed_weekly: Weekly trip completion count
  • customer_rating_avg_all_time: Average customer rating
  • weekly_income: Weekly earnings
  • wallet_balance_d1 to wallet_balance_d7: Daily wallet balances
  • surge_trips_offered/surge_trips_accepted: Surge trip metrics
  • upsell_attempts/upsell_successes: Business performance metrics

Data Quality Standards

  • Completeness: >90% non-null values
  • Consistency: No logical contradictions
  • Validity: Values within expected ranges
  • Freshness: Data within 30 days

🛠️ Technical Architecture

Core Components

  1. Data Pipeline (pipeline.py)

    • Data loading and validation
    • Schema checking and quality assessment
    • Comprehensive error handling
  2. Feature Engineering (feature_engineering.py)

    • Automated feature creation
    • Feature selection and scaling
    • Missing value imputation
  3. Modeling Engine (modeling.py)

    • PCA transformation
    • K-Means clustering
    • Isolation Forest anomaly detection
    • NovaScore calculation
  4. Fairness Framework (fairness.py)

    • Bias detection algorithms
    • Multi-stage mitigation
    • Statistical testing
  5. Visualization Suite (visualizations.py)

    • Interactive plots with Plotly
    • Statistical visualizations
    • Business intelligence charts

Dependencies

streamlit>=1.28.0          # Web application framework
pandas>=1.5.0              # Data manipulation
numpy>=1.24.0              # Numerical computing
matplotlib>=3.6.0          # Static plotting
seaborn>=0.12.0            # Statistical visualization
plotly>=5.15.0             # Interactive plotting
scikit-learn>=1.3.0        # Machine learning
scipy>=1.10.0              # Scientific computing

🚀 Deployment

Local Development

# Install dependencies
pip install -r requirements.txt

# Run application
streamlit run streamlit_app_refactored.py

Production Deployment

  1. Container Setup: Use Docker for consistent environments
  2. Web Server: Deploy with nginx + gunicorn
  3. Database Integration: Connect to PostgreSQL/MySQL
  4. Monitoring: Implement logging and performance metrics
  5. Scaling: Use Kubernetes for horizontal scaling

API Integration

# Example API endpoint
@app.post("/calculate-novascore")
async def calculate_novascore(data: PartnerData):
    # Process data through pipeline
    scores = novascore_pipeline(data)
    return {"novascore": scores, "risk_category": categories}

📊 Demo Dataset

The included nova_unlabeled.csv contains 50,000 simulated partner records with:

  • Demographics: Age, gender, region, education
  • Platform Activity: Trips, ratings, income patterns
  • Financial Data: Wallet balances, cashouts, payment history
  • Behavioral Patterns: Surge acceptance, reliability metrics

🔒 Security & Privacy

  • Data Encryption: All data encrypted in transit and at rest
  • Access Control: Role-based permissions
  • Audit Logging: Complete activity tracking
  • GDPR Compliance: Data anonymization and right to deletion
  • Model Explainability: Transparent decision-making process

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

For questions, issues, or contributions:

  • Issues: Use GitHub Issues for bug reports
  • Discussions: Use GitHub Discussions for questions
  • Documentation: Check the notebook for detailed methodology

🎯 Future Enhancements

  • Real-time Scoring: API for live score calculation
  • Model Monitoring: Drift detection and retraining
  • Advanced Fairness: Intersectional bias analysis
  • Multi-platform Support: Extend to other gig economy platforms
  • Mobile App: Native mobile interface
  • A/B Testing: Framework for model comparison

Project Nova - Enabling fair credit access for the gig economy workforce through transparent, unbiased machine learning.

About

For the hackathon of Grab

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published