This project transforms raw ecommerce transaction data into actionable customer intelligence. By moving beyond standard reporting, we identified a critical misalignment in the company's loyalty program and built a Predictive "Whale" Engine.
Using a Random Forest Classifier, we scored every customer on their propensity to be a High-Value Shopper (Top 25%), creating a targeted list of 112 VIPs who represent $46,365 in immediate revenue risk.
Our analysis revealed a counter-intuitive trend: Non-Members are more valuable than Loyalty Members.
-
Insight: Loyalty members use discounts 54% of the time (vs 49% for non-members) and have a lower average basket size.
-
Business Impact: The current program incentivizes "discount seeking" rather than high-value purchasing.
| Metric | Loyalty Members | Non-Members |
| --- | --- | --- |
| Avg Lifetime Value (CLV) | $1,797 | $2,020 🏆 |
| Discount Usage Freq | 54% | 49% |
| Avg Spend per Order | $261 | $288 |
We isolated the elite 10% of the customer base to understand their habits.
-
Demographics: Predominantly Female, aged 30-35.
-
Top Categories: Baby Products, Clothing, and Electronics.
-
Shopping Channel: They are Omnichannel shoppers (engaging both Online and Offline) rather than single-channel users.
-
Age vs. Spend: There is zero correlation (R² = 0.0003) between age and spending power. A 20-year-old is just as likely to spend $500 as a 50-year-old.
-
The Device Gap: Myth busted. Mobile users spend nearly identical amounts per cart as Desktop users.
We engineered a machine learning pipeline to predict high-value potential based on demographic traits, allowing marketing to target users before they make a purchase.
-
Model: Random Forest Classifier (
n_estimators=100) -
Target: Top 25% of Spenders (Threshold: >$418/order)
-
Features:
Age,Gender,Income_Level,Education_Level -
Performance: Strong positive correlation (R² = 0.81) between predicted propensity scores and actual spending.
The project includes a Plotly Interactive Scatterplot allowing stakeholders to hover over customers to see real-time predictive scores.
(Note: Correlation between Actual Spend and Predicted Whale Propensity)
Instead of a static CSV, the final output of this project is a structured JSON-style Intelligence Profile for every customer. This structure is designed to be fed directly into a CRM or Marketing API.
Sample Entry:
"37-611-6911": {
"Profile": {
"Age": 22,
"Gender": "Female",
"Location": "Évry"
},
"Metrics": {
"Top_Cat": "Gardening & Outdoors",
"Spend": 333.80,
"Loyalty_Level": 5
},
"Predictions": {
"Whale_Propensity": 0.892,
"Recommended_Action": "VIP Outreach"
}
}
-
Cleaning: Removed currency symbols (
$), handled nulls, and converted data types. -
Feature Engineering: Created
CLV_Score(Spend × Frequency) and 5-Year Age Bins. -
Encoding: Applied One-Hot Encoding to categorical variables (Gender, Education).
-
Pandas: Data manipulation and aggregation.
-
Seaborn: Static statistical visualizations (Boxplots, Heatmaps).
-
Plotly Express: Interactive, hover-able web charts.
-
Scikit-Learn: Random Forest implementation.
- Install Requirements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
- Launch the Notebook
jupyter notebook Consumer_Behavior_Project.ipynb
- Download the CSV Used in This Project
This can be found here (Kaggle account may be required)
Based on the data, the following business actions are recommended:
-
Restructure Loyalty: Move from "Points per Visit" to "Points per Dollar" to reverse the negative CLV trend.
-
Target the 30-35 Demographic: Shift ad spend for Baby Products/Electronics to this high-conversion age group.
-
Deploy the "Consumer Dict": Integrate the JSON output into the email marketing platform to trigger automatic "VIP Outreach" emails when a user's Propensity Score > 0.6.
Author: Daniel Ohebshalom