To help navigation, this README includes:
- Introduction
- Why Explainable Pricing Matters
- Dataset Source + Owner Explanation
- Project Philosophy & Design Goals
- High-Level Summary of the System
- Architecture
- Data Pipeline
- Feature Engineering
- Model Training
- Value Decomposition Theory
- Mathematical Formulation
- Example: Decomposed Pricing Story
- Command-Line Interface (CLI Guide)
- Streamlit App Walkthrough (With Screenshots)
- Explainability Tools
- Dataset Insights
- Future Enhancements
- Real Business Use Cases
Most machine learning projects stop at prediction. This one does not.
Car-Value-Decoding-Engine goes beyond predicting car prices, it explains them.
It behaves like a professional human appraiser, breaking a car’s predicted price into meaningful and interpretable components:
- How much the brand adds
- How much age subtracts
- How mileage influences resale value
- How engine size affects base value
- How condition affects desirability
- How fuel type shifts market expectation
- How transmission affects demand
Instead of offering a mysterious ML-generated price, it offers:
“This price makes sense because each part contributes fairly and logically.”
This is what true explainable machine learning looks like.
Car pricing is not random, it is a function of measurable and emotional components. But real-world ML pricing models often behave like opaque black boxes, making them:
- hard to trust
- hard to understand
- hard to debug
- hard to justify
This project solves that by making predictions transparent, interpretable, and auditable.
For businesses, explainability helps:
- build consumer trust
- improve regulatory acceptance
- assist negotiation
- support fairness & compliance
- enable strategic decisions
For developers, explainability helps:
- verify model logic
- detect data bias
- validate assumptions
- discover feature interactions
- avoid model hallucinations
The dataset comes from Abdullah Meo (Kaggle):
https://www.kaggle.com/datasets/abdullahmeo/car-price-pridiction
- It contains realistic automotive attributes
- It reflects genuine market patterns
- It includes both numeric and categorical data
- It features non-linear interactions (perfect for Random Forests)
- It enables creation of interpretable pricing logic
Every attribute connects directly to a real-world pricing factor:
| Dataset Column | Real-World Meaning |
|---|---|
| Brand | Reputation, luxury factor |
| Model | Design variant, trims |
| Engine Size | Performance & spec level |
| Mileage | Wear and tear |
| Year | Depreciation factor |
| Condition | Market-readiness |
| Transmission | Market demand |
| Fuel Type | Running cost perception |
This allows a rich, insightful ML system.
This project follows five guiding principles:
A pricing model must explain itself, not just output numbers.
Instead of predicting blindly, the model simulates:
- brand uplift
- mileage penalty
- aging depreciation
- spec-based value
Every part of the system, data, model, decomposition and UI is separated.
Designed to be used by:
- dealerships
- pricing analysts
- buyers/sellers
- researchers
Architecture mirrors real ML systems used in industry.
The system has four major components:
Cleans raw input, engineers features, handles missingness, creates baseline rows.
Trains a Random Forest with encoded categorical variables.
Simulates feature-group replacement to compute contributions.
Lets users experiment with values and view explanations.
┌────────────────────┐
│ Raw Kaggle Data │
└───────┬────────────┘
│ data_prep.py
▼
┌────────────────────┐
│ Cleaned Dataset │
│ + Engineered Fields │
└───────┬────────────┘
│ features.py
▼
┌────────────────────┐
│ Preprocessing │
│ (Scaling + OHE) │
└───────┬────────────┘
│ train_model.py
▼
┌────────────────────┐
│ Trained Model │
│ + Baseline Stats │
└───────┬────────────┘
│ value_decomposition.py
▼
┌────────────────────┐
│ Decomposition │
│ Explanation Engine │
└───────┬────────────┘
│ app/app.py
▼
┌────────────────────┐
│ Streamlit App │
│ (Decoder + Tools) │
└────────────────────┘
Each piece is independently testable and replaceable.
The pipeline handles messy real-world data gracefully.
from data/raw/car_price_prediction.csv.
Trims whitespace, normalizes values.
Ensures Year, Mileage, Engine Size, and Price are valid numbers.
(negative mileage, zero-age cars, etc.)
Human-friendly depreciation measure.
Normalizes mileage intensity.
Replaces missing values with "Unknown" to avoid model crashes.
- Car age
- Engine size
- Mileage
- Mileage intensity (km per year)
Why numeric scaling matters: Random Forests don’t require scaling, but scaling helps decomposition consistency.
- Brand
- Fuel type
- Transmission
- Condition
OneHotEncoder ensures that each category becomes its own dimension.
Because it offers:
- robustness
- nonlinearity
- feature interaction learning
- stability under perturbations (important for decomposition)
- simplicity of deployment
The model pipeline:
- Preprocess numerics & categoricals
- Fit Random Forest
- Save model and training stats
- Store baseline feature row for decomposition
- Save metrics (MAE, RMSE, R²)
This module (value_decomposition.py) is the heart of the project.
It solves the hardest ML problem:
"Given a prediction, how do we determine how much each factor contributed?"
SHAP is powerful, but:
- difficult to explain to non-experts
- computationally expensive
- unstable across model changes
- not grouped by human-friendly categories
Our method is:
- deterministic
- stable
- grouped by meaningful components
- mathematically sound
- domain-aligned
Let:
- X_base = baseline feature vector
- X_groupi = baseline with group i replaced
- f() = trained model
- price_base = f(X_base)
- price_groupi = f(X_groupi)
Contribution of group i:
C_i = price_groupi, previous_price
Final predicted price:
P = price_base + Σ C_i
This ensures explainability is:
- additive
- human-readable
- consistent
Imagine a car predicted at $44,210.
The model might say:
- +472 because it's Audi (brand premium)
- –914 because it's older
- +13,378 because of high-distribution mileage pattern
- –15,497 because automatic transmission is penalized in this dataset
- +255 because of good condition
- +1657 from engine size
This creates a transparent story, not just a number.
python -m src.cli prepare-dataCleans and caches the dataset.
python -m src.cli trainTrains the ML engine.
python -m src.cli evaluateComputes metrics and saves them.
python -m src.cli decode-car --index nPrints a detailed decomposition story.
- Lets you simulate any car configuration
- Predicts the price
- Shows value breakdown
- Visualizes contributions
- Outputs explainable JSON
Great for:
- buyers
- sellers
- ML explainability demos
- pricing analysts
Outputs full machine-readable breakdown.
Color-coded interpretation of increases and penalties.
Lets you compare:
- Specs
- Final predictions
- Contribution profiles
Visual dataset understanding.
Shows global influence of features.
Shows if the model systematically favors or penalizes:
- transmissions
- fuel types
- conditions
- brands
This ensures fairness in pricing systems.
dataset exhibits interesting behaviors:
Why?
- Some high-mileage cars belong to luxury brands
- Mileage clusters may correlate with engine size
- Market bias encoded in source data
Possible reasons:
- Region prefers manual cars
- Dataset bias
- Engine-transmission pairs not uniformly distributed
Understanding these patterns is vital when applying ML to economics.
Price appraisal accuracy + explainability builds trust.
Show transparent pricing breakdown to buyers.
Use decomposition to justify assessments.
Study economic patterns in vehicle markets.
Validate pricing strategies using feature importances.
This decomposition framework can also be used for:
- home pricing engines
- insurance risk scoring
- credit scoring
- e-commerce price optimization
- medical diagnosis attribution
- HR salary intelligence
Anywhere you need:
- prediction + explanation
- transparency + trust