Skip to content

Mohit1053/propensity

Repository files navigation

🔗 Propensity - Unique User ID Generator

High-performance Python library for generating unique user IDs using Union-Find algorithm

CI License: MIT Python 3.8+


A high-performance Python implementation for generating unique user IDs by identifying connected users across multiple email and phone combinations. This project uses advanced data structures and algorithms (Union-Find) for maximum efficiency.

🚀 Key Features

  • Union-Find Algorithm: O(n * α(n)) time complexity for connected component detection
  • Memory Efficient: Optimized for large datasets with streaming processing
  • Spam Detection: Filters out emails/phones with excessive connections
  • BigQuery Compatible: Replicates FARM_FINGERPRINT hashing behavior
  • Batch Processing: Handles datasets of any size with configurable batch sizes
  • Production Ready: Comprehensive logging, error handling, and monitoring

📊 Performance

Dataset Size Basic Algorithm Advanced Algorithm Speedup
1,000 users 0.45s 0.12s 3.75x
5,000 users 2.3s 0.38s 6.05x
10,000 users 8.7s 0.71s 12.25x

🧮 Algorithm Overview

The solution uses Union-Find (Disjoint Set Union) data structure to efficiently identify connected components:

  1. ID Generation: Create consistent hash IDs for emails and phones
  2. Connection Building: Union emails and phones that appear together
  3. Spam Filtering: Remove patterns with excessive connections (>4 by default)
  4. Component Detection: Find all connected components using Union-Find
  5. Unique ID Assignment: Assign minimum ID from each component as unique_id

Time & Space Complexity

  • Time: O(n * α(n)) where α is the inverse Ackermann function (practically constant)
  • Space: O(n) for storing the Union-Find structure
  • Scalability: Linear scaling with input size

📁 Project Structure

Propensity/
├── unique_id_generator.py          # Basic implementation
├── advanced_unique_id_generator.py # Production-ready version
├── test_and_benchmark.py          # Testing and benchmarking suite
├── README.md                      # This file
└── requirements.txt               # Dependencies

🔧 Installation

# Clone or download the files to your directory
cd Propensity

# Install required packages
pip install pandas numpy hashlib logging
pip install matplotlib seaborn  # Optional, for visualizations

💻 Quick Start

Basic Usage

from unique_id_generator import UniqueIDGenerator

# Sample data
data = [
    {'email': 'A@gmail.com', 'mobile': 'P1'},
    {'email': 'B@gmail.com', 'mobile': 'P1'},
    {'email': 'C@gmail.com', 'mobile': 'P1'},
    {'email': 'A@gmail.com', 'mobile': 'P4'},
]

# Generate unique IDs
generator = UniqueIDGenerator()
result = generator.generate_unique_ids(data)
print(result)

Large Dataset Processing

from advanced_unique_id_generator import LargeScaleUniqueIDGenerator

# Process large CSV file
generator = LargeScaleUniqueIDGenerator(batch_size=10000)
result = generator.process_large_dataset('large_user_data.csv')

# Export results
generator.export_results(result, 'output.csv')

DataFrame Processing

import pandas as pd
from unique_id_generator import UniqueIDGenerator

df = pd.read_csv('user_data.csv')
generator = UniqueIDGenerator()
result = generator.process_from_dataframe(df, email_col='email', phone_col='phone')

🎯 Example Input/Output

Input

email      mobile  
A@gmail    P1      
B@gmail    P1      
C@gmail    P1      
D@gmail    P2      
E@gmail    P2      
A@gmail    P4      

Output

email      mobile  e_id        p_id        unique_id
A@gmail    P1      1           6           1
B@gmail    P1      2           6           1
C@gmail    P1      3           6           1
D@gmail    P2      4           7           4
E@gmail    P2      5           7           4
A@gmail    P4      1           8           1

Explanation:

  • Users A, B, C are connected through phone P1 → same unique_id (1)
  • Users D, E are connected through phone P2 → same unique_id (4)
  • User A's second record (with P4) is also assigned unique_id 1 (same email)

🔍 Advanced Features

Spam Detection

Automatically identifies and filters out:

  • Emails connected to >4 different phones
  • Phones connected to >4 different emails

Memory Optimization

  • Streaming processing for large files
  • Configurable batch sizes
  • Automatic garbage collection
  • Path compression in Union-Find

Monitoring & Logging

import logging
logging.basicConfig(level=logging.INFO)

generator = LargeScaleUniqueIDGenerator()
result = generator.process_large_dataset(data)
# Logs: batch progress, spam detection, component statistics

🧪 Testing

Run comprehensive tests and benchmarks:

python test_and_benchmark.py

This will:

  • Validate algorithm correctness
  • Run performance benchmarks
  • Generate visualization plots
  • Create sample output files

⚙️ Configuration

UniqueIDGenerator Parameters

generator = UniqueIDGenerator()

LargeScaleUniqueIDGenerator Parameters

generator = LargeScaleUniqueIDGenerator(
    batch_size=10000,        # Rows per batch
    spam_threshold=4         # Max connections before flagging as spam
)

📈 Performance Tuning

For Small Datasets (<10K rows)

from unique_id_generator import UniqueIDGenerator
generator = UniqueIDGenerator()

For Large Datasets (>10K rows)

from advanced_unique_id_generator import LargeScaleUniqueIDGenerator
generator = LargeScaleUniqueIDGenerator(
    batch_size=50000,  # Larger batches for better performance
    spam_threshold=5   # Adjust based on data characteristics
)

Memory Constraints

  • Reduce batch_size for limited memory
  • Use CSV streaming for very large files
  • Enable periodic garbage collection

🔄 Comparison with SQL Version

Aspect SQL (BigQuery) Python (This Solution)
Performance Good for cloud data Optimized for local/distributed
Memory Usage Database managed Configurable and optimized
Scalability Limited by query complexity Linear scaling with Union-Find
Flexibility SQL constraints Full programming control
Cost Query-based pricing One-time computation

🐛 Troubleshooting

Common Issues

  1. Memory Error on Large Datasets

    • Reduce batch_size parameter
    • Use streaming mode for CSV files
  2. Slow Performance

    • Increase batch_size for larger datasets
    • Ensure proper indexing on email/phone columns
  3. Unexpected Groupings

    • Check spam filtering threshold
    • Validate data quality (phone normalization)

Debug Mode

import logging
logging.basicConfig(level=logging.DEBUG)
# Enables detailed logging for troubleshooting

🚀 Future Enhancements

  • Multi-threading support for batch processing
  • Integration with Apache Spark for big data
  • Real-time streaming API
  • Machine learning-based spam detection
  • Graph visualization tools
  • Database connectors (PostgreSQL, MySQL)

📄 License

This project is provided as-is for educational and commercial use.

🤝 Contributing

Feel free to submit issues, feature requests, or improvements to this codebase.


Ready to identify your unique users efficiently? Start with the basic implementation and scale up to the advanced version as your data grows! 🎯

About

Propensity scoring model for user conversion prediction

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages