This project is a Classification model. By segmenting wines into five distinct quality categories from "Low" to "Very High" this model provides actionable insights for inventory pricing and quality control.
- Source: UCI Machine Learning Repository - Wine Quality Dataset
- Composition: 6,497 observations of Red and White wine.
- Features: 11 physicochemical inputs (pH, Alcohol, Citric Acid, etc.).
- Target: Quality category (Engineered from raw scores).
- Strategic Binning: Transformed raw 0-10 scores into 5 business-relevant tiers:
Low,Medium Low,Medium High,High, andVery High. - Multicollinearity Management: Dropped
Residual Sugarto prevent redundancy withDensity, ensuring a cleaner feature set. - Label Encoding: Converted categorical tiers into numerical labels for model compatibility.
- Critical Discovery: Identified initial 100% accuracy as Data Leakage (the model was "cheating" by seeing the original quality score).
- The Fix: Stripped all target-related data, forcing the model to rely solely on pure physicochemical benchmarks.
- Normalization: Applied
MinMaxScalerandStandardScalerto handle varying feature magnitudes (e.g., Chlorides vs. Total Sulfur Dioxide). - Class Balancing (SMOTE): Addressed the rarity of "Very High" quality wines (imbalanced classes) by synthetically oversampling minority classes using SMOTE (Synthetic Minority Over-sampling Technique). This expanded the training set from 1,279 to 2,970 samples.
We utilized an Ensemble-first approach to compare how different architectures handled the chemical complexity of each wine type. To ensure the model works on unseen data, we implemented 5-Fold Stratified Cross-Validation, maintaining class ratios and ensuring a consistent performance margin (
| Model | Result (Accuracy) | Strategic Insight |
|---|---|---|
| Random Forest (Tuned) | 68.2% | Top Performer: Effectively mapped complex chemical variances. |
| KNN Classifier | 58.5% | Stability: Established a solid, low-variance baseline. |
| SMOTE Impact | Balanced | Fairness: Improved prediction on rare premium tiers. |
| Model | Result (Accuracy) | Strategic Insight |
|---|---|---|
| Random Forest (Tuned) | 70.4% | Top Performer: High predictability in acidity/sugar balance. |
| KNN Classifier | 57.4% | Reliable: Effective for high-volume automated sorting. |
| SMOTE Impact | Robust | Depth: Handled massive sample increases without overfitting. |
A key challenge in this dataset was the heavy concentration of "mid-range" wines (Quality 5 and 6). To address this, we implemented SMOTE (Synthetic Minority Over-sampling Technique) instead of undersampling.
- Why SMOTE?: We chose this to preserve the rich chemical information within the majority classes. Undersampling would have resulted in a significant loss of data.
- The Result: This approach allowed us to effectively boost the F1-score for the minority classes (high-quality and low-quality wines), ensuring the model identifies "Premium" wines rather than just guessing the most frequent class.
- Python 3.x
- Pandas & NumPy: Data manipulation and matrix operations.
- Scikit-Learn: Core ML library for scaling, splitting, and modeling.
- Imbalanced-Learn (SMOTE): For handling minority class distribution.
- Matplotlib & Seaborn: For feature importance and correlation heatmaps.
- Ensemble Methods: Random Forest, AdaBoost, Gradient Boosting, Bagging.
- Clustering/Proximity: K-Nearest Neighbors (KNN).
- Optimization: Hyperparameter tuning (max_depth, n_estimators), Stratified K-Fold, and SMOTE.