This project implements a multimodal machine learning system for large-scale product classification by combining image features, text features, and categorical metadata using LightGBM.
It was developed for the Amazon ML Challenge and focuses on building a production-style pipeline for handling heterogeneous e-commerce data.
Real-world product data often contains multiple modalities:
- Product images
- Text descriptions and titles
- Structured categorical metadata
This project fuses all three modalities into a single supervised learning model to improve prediction performance and robustness.
The pipeline consists of:
- Pretrained CNN (EfficientNetB0)
include_top = False- Global Average Pooling
- Output: fixed-length image embeddings (1536-D)
- NLP vectorization (TF-IDF(ngram-(1,2)) / embedding based)
- Preprocessing: cleaning, tokenization
- Native categorical features
- preprocessing: skewed data-conversion, missing value imputation, null-value handling
- Handled directly using LightGBM categorical support
- Concatenation of:
- Image embeddings
- Text vectors
- Categorical features
- LightGBM classifier / regressor
- Optimized for large-scale tabular + multimodal data
LightGBM was selected due to:
- Native categorical feature handling
- High performance on tabular data
- Low memory footprint
- Fast training and inference
- Strong performance on large datasets
The original dataset was provided as part of the Amazon ML Challenge and contains:
- Product images
- Text fields (title, description, etc.)
- Categorical metadata
- Target labels
Note: The dataset is not included in this repository due to size constraints and competition licensing restrictions.
- Preprocess images and extract CNN embeddings
- Preprocess text and generate feature vectors
- Encode categorical metadata
- Align features by product ID
- Train LightGBM model
- Evaluate using validation split
- Export trained model
- Make predictions on test set
- Submit predictions to competition platform
The model was evaluated using:
- Mean Absolute Error (MAE) on log-transformed targets
- Inverse transformation applied during evaluation:
mae_price = np.exp(mae_log) - 1This ensures predictions are evaluated in the original price scale.
- Achieved MAE (log-transformed):0.517287
- corresponding MAE (price scale):0.67747
- Dataset not publicly distributable
- Training requires high memory for feature matrices
- CNN backbone frozen (no fine-tuning)
- Evaluation limited to challenge metrics
Two CNN backbones were evaluated for image feature extraction:
-
EfficientNet-B0 (baseline)
- Faster and lightweight
- Used in early experiments
-
EfficientNet-B3 (final model)
- Higher representational capacity
- Better feature quality for downstream LightGBM fusion
- Selected as the final image encoder
The final system uses EfficientNet-B3 embeddings for training and inference.
- End-to-end fine-tuning of image backbone
- Transformer-based text embeddings
- Model ensembling
- Online inference API
- Feature store integration
- Hyperparameter optimization
This project is released under the Apache 2.0 License.
Developed by AJ.