Skip to content

cgveradi/Will_Byers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🍷 Wine Quality Prediction


📌 Project Overview

This project is a Classification model. By segmenting wines into five distinct quality categories from "Low" to "Very High" this model provides actionable insights for inventory pricing and quality control.


📊 The Data


🛠️ Technical Workflow

1. Data Cleaning & Feature Engineering

  • Strategic Binning: Transformed raw 0-10 scores into 5 business-relevant tiers: Low, Medium Low, Medium High, High, and Very High.
  • Multicollinearity Management: Dropped Residual Sugar to prevent redundancy with Density, ensuring a cleaner feature set.
  • Label Encoding: Converted categorical tiers into numerical labels for model compatibility.

2. The "Leakage" Resolution

  • Critical Discovery: Identified initial 100% accuracy as Data Leakage (the model was "cheating" by seeing the original quality score).
  • The Fix: Stripped all target-related data, forcing the model to rely solely on pure physicochemical benchmarks.

3. Advanced Preprocessing

  • Normalization: Applied MinMaxScaler and StandardScaler to handle varying feature magnitudes (e.g., Chlorides vs. Total Sulfur Dioxide).
  • Class Balancing (SMOTE): Addressed the rarity of "Very High" quality wines (imbalanced classes) by synthetically oversampling minority classes using SMOTE (Synthetic Minority Over-sampling Technique). This expanded the training set from 1,279 to 2,970 samples.

🔬 Modeling & Performance

We utilized an Ensemble-first approach to compare how different architectures handled the chemical complexity of each wine type. To ensure the model works on unseen data, we implemented 5-Fold Stratified Cross-Validation, maintaining class ratios and ensuring a consistent performance margin ($\pm 5%$).

🍷 Red Wine Final Results

Model Result (Accuracy) Strategic Insight
Random Forest (Tuned) 68.2% Top Performer: Effectively mapped complex chemical variances.
KNN Classifier 58.5% Stability: Established a solid, low-variance baseline.
SMOTE Impact Balanced Fairness: Improved prediction on rare premium tiers.

🥂 White Wine Final Results

Model Result (Accuracy) Strategic Insight
Random Forest (Tuned) 70.4% Top Performer: High predictability in acidity/sugar balance.
KNN Classifier 57.4% Reliable: Effective for high-volume automated sorting.
SMOTE Impact Robust Depth: Handled massive sample increases without overfitting.

⚖️ Handling Class Imbalance (SMOTE)

A key challenge in this dataset was the heavy concentration of "mid-range" wines (Quality 5 and 6). To address this, we implemented SMOTE (Synthetic Minority Over-sampling Technique) instead of undersampling.

  • Why SMOTE?: We chose this to preserve the rich chemical information within the majority classes. Undersampling would have resulted in a significant loss of data.
  • The Result: This approach allowed us to effectively boost the F1-score for the minority classes (high-quality and low-quality wines), ensuring the model identifies "Premium" wines rather than just guessing the most frequent class.

🚀 Tech Stack

Languages & Libraries

  • Python 3.x
  • Pandas & NumPy: Data manipulation and matrix operations.
  • Scikit-Learn: Core ML library for scaling, splitting, and modeling.
  • Imbalanced-Learn (SMOTE): For handling minority class distribution.
  • Matplotlib & Seaborn: For feature importance and correlation heatmaps.

Machine Learning Techniques

  • Ensemble Methods: Random Forest, AdaBoost, Gradient Boosting, Bagging.
  • Clustering/Proximity: K-Nearest Neighbors (KNN).
  • Optimization: Hyperparameter tuning (max_depth, n_estimators), Stratified K-Fold, and SMOTE.

The presentation is available here.

Trello dashboard is available here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •