Skip to content

cgveradi/Will_Byers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🍷 Wine Quality Prediction


📌 Project Overview

This project is a Classification model. By segmenting wines into five distinct quality categories from "Low" to "Very High" this model provides actionable insights for inventory pricing and quality control.


📊 The Data


🛠️ Technical Workflow

1. Data Cleaning & Feature Engineering

  • Strategic Binning: Transformed raw 0-10 scores into 5 business-relevant tiers: Low, Medium Low, Medium High, High, and Very High.
  • Multicollinearity Management: Dropped Residual Sugar to prevent redundancy with Density, ensuring a cleaner feature set.
  • Label Encoding: Converted categorical tiers into numerical labels for model compatibility.

2. The "Leakage" Resolution

  • Critical Discovery: Identified initial 100% accuracy as Data Leakage (the model was "cheating" by seeing the original quality score).
  • The Fix: Stripped all target-related data, forcing the model to rely solely on pure physicochemical benchmarks.

3. Advanced Preprocessing

  • Normalization: Applied MinMaxScaler and StandardScaler to handle varying feature magnitudes (e.g., Chlorides vs. Total Sulfur Dioxide).
  • Class Balancing (SMOTE): Addressed the rarity of "Very High" quality wines (imbalanced classes) by synthetically oversampling minority classes using SMOTE (Synthetic Minority Over-sampling Technique). This expanded the training set from 1,279 to 2,970 samples.

🔬 Modeling & Performance

We utilized an Ensemble-first approach to compare how different architectures handled the chemical complexity of each wine type. To ensure the model works on unseen data, we implemented 5-Fold Stratified Cross-Validation, maintaining class ratios and ensuring a consistent performance margin ($\pm 5%$).

🍷 Red Wine Final Results

Model Result (Accuracy) Strategic Insight
Random Forest (Tuned) 68.2% Top Performer: Effectively mapped complex chemical variances.
KNN Classifier 58.5% Stability: Established a solid, low-variance baseline.
SMOTE Impact Balanced Fairness: Improved prediction on rare premium tiers.

🥂 White Wine Final Results

Model Result (Accuracy) Strategic Insight
Random Forest (Tuned) 70.4% Top Performer: High predictability in acidity/sugar balance.
KNN Classifier 57.4% Reliable: Effective for high-volume automated sorting.
SMOTE Impact Robust Depth: Handled massive sample increases without overfitting.

⚖️ Handling Class Imbalance (SMOTE)

A key challenge in this dataset was the heavy concentration of "mid-range" wines (Quality 5 and 6). To address this, we implemented SMOTE (Synthetic Minority Over-sampling Technique) instead of undersampling.

  • Why SMOTE?: We chose this to preserve the rich chemical information within the majority classes. Undersampling would have resulted in a significant loss of data.
  • The Result: This approach allowed us to effectively boost the F1-score for the minority classes (high-quality and low-quality wines), ensuring the model identifies "Premium" wines rather than just guessing the most frequent class.

🚀 Tech Stack

Languages & Libraries

  • Python 3.x
  • Pandas & NumPy: Data manipulation and matrix operations.
  • Scikit-Learn: Core ML library for scaling, splitting, and modeling.
  • Imbalanced-Learn (SMOTE): For handling minority class distribution.
  • Matplotlib & Seaborn: For feature importance and correlation heatmaps.

Machine Learning Techniques

  • Ensemble Methods: Random Forest, AdaBoost, Gradient Boosting, Bagging.
  • Clustering/Proximity: K-Nearest Neighbors (KNN).
  • Optimization: Hyperparameter tuning (max_depth, n_estimators), Stratified K-Fold, and SMOTE.

The presentation is available here.

Trello dashboard is available here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors