🍷 Wine Quality Prediction

📌 Project Overview

This project is a Classification model. By segmenting wines into five distinct quality categories from "Low" to "Very High" this model provides actionable insights for inventory pricing and quality control.

📊 The Data

Source: UCI Machine Learning Repository - Wine Quality Dataset
Composition: 6,497 observations of Red and White wine.
Features: 11 physicochemical inputs (pH, Alcohol, Citric Acid, etc.).
Target: Quality category (Engineered from raw scores).

🛠️ Technical Workflow

1. Data Cleaning & Feature Engineering

Strategic Binning: Transformed raw 0-10 scores into 5 business-relevant tiers: Low, Medium Low, Medium High, High, and Very High.
Multicollinearity Management: Dropped Residual Sugar to prevent redundancy with Density, ensuring a cleaner feature set.
Label Encoding: Converted categorical tiers into numerical labels for model compatibility.

2. The "Leakage" Resolution

Critical Discovery: Identified initial 100% accuracy as Data Leakage (the model was "cheating" by seeing the original quality score).
The Fix: Stripped all target-related data, forcing the model to rely solely on pure physicochemical benchmarks.

3. Advanced Preprocessing

Normalization: Applied MinMaxScaler and StandardScaler to handle varying feature magnitudes (e.g., Chlorides vs. Total Sulfur Dioxide).
Class Balancing (SMOTE): Addressed the rarity of "Very High" quality wines (imbalanced classes) by synthetically oversampling minority classes using SMOTE (Synthetic Minority Over-sampling Technique). This expanded the training set from 1,279 to 2,970 samples.

🔬 Modeling & Performance

We utilized an Ensemble-first approach to compare how different architectures handled the chemical complexity of each wine type. To ensure the model works on unseen data, we implemented 5-Fold Stratified Cross-Validation, maintaining class ratios and ensuring a consistent performance margin ($\pm 5%$).

🍷 Red Wine Final Results

Model	Result (Accuracy)	Strategic Insight
Random Forest (Tuned)	68.2%	Top Performer: Effectively mapped complex chemical variances.
KNN Classifier	58.5%	Stability: Established a solid, low-variance baseline.
SMOTE Impact	Balanced	Fairness: Improved prediction on rare premium tiers.

🥂 White Wine Final Results

Model	Result (Accuracy)	Strategic Insight
Random Forest (Tuned)	70.4%	Top Performer: High predictability in acidity/sugar balance.
KNN Classifier	57.4%	Reliable: Effective for high-volume automated sorting.
SMOTE Impact	Robust	Depth: Handled massive sample increases without overfitting.

⚖️ Handling Class Imbalance (SMOTE)

A key challenge in this dataset was the heavy concentration of "mid-range" wines (Quality 5 and 6). To address this, we implemented SMOTE (Synthetic Minority Over-sampling Technique) instead of undersampling.

Why SMOTE?: We chose this to preserve the rich chemical information within the majority classes. Undersampling would have resulted in a significant loss of data.
The Result: This approach allowed us to effectively boost the F1-score for the minority classes (high-quality and low-quality wines), ensuring the model identifies "Premium" wines rather than just guessing the most frequent class.

🚀 Tech Stack

Languages & Libraries

Python 3.x
Pandas & NumPy: Data manipulation and matrix operations.
Scikit-Learn: Core ML library for scaling, splitting, and modeling.
Imbalanced-Learn (SMOTE): For handling minority class distribution.
Matplotlib & Seaborn: For feature importance and correlation heatmaps.

Machine Learning Techniques

Ensemble Methods: Random Forest, AdaBoost, Gradient Boosting, Bagging.
Clustering/Proximity: K-Nearest Neighbors (KNN).
Optimization: Hyperparameter tuning (max_depth, n_estimators), Stratified K-Fold, and SMOTE.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Data/Raw		Data/Raw
Notebooks		Notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍷 Wine Quality Prediction

📌 Project Overview

📊 The Data

🛠️ Technical Workflow

1. Data Cleaning & Feature Engineering

2. The "Leakage" Resolution

3. Advanced Preprocessing

🔬 Modeling & Performance

🍷 Red Wine Final Results

🥂 White Wine Final Results

⚖️ Handling Class Imbalance (SMOTE)

🚀 Tech Stack

Languages & Libraries

Machine Learning Techniques

The presentation is available here.

Trello dashboard is available here.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🍷 Wine Quality Prediction

📌 Project Overview

📊 The Data

🛠️ Technical Workflow

1. Data Cleaning & Feature Engineering

2. The "Leakage" Resolution

3. Advanced Preprocessing

🔬 Modeling & Performance

🍷 Red Wine Final Results

🥂 White Wine Final Results

⚖️ Handling Class Imbalance (SMOTE)

🚀 Tech Stack

Languages & Libraries

Machine Learning Techniques

The presentation is available here.

Trello dashboard is available here.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages