Estimate the physical height & width of a product — just from an user taken image.
Can we predict a product’s real-world dimensions without any measurement input — using only user-taken photos?
This project explores the limits of what kind of physical insight can be extracted from visual-only signals, especially in noisy, real-world e-commerce scenarios.
- Scraped 24,000+ user-uploaded images from an e-commerce platform.
- Covered 520 product categories, ~60 images per product.
- Images were noisy and inconsistent, as expected from user content.
✅ A custom visual filtering pipeline was implemented to automatically retain only clean product views, replacing the need for manual curation.
Bounding box annotations were not available, so I experimented with several zero-shot / low-shot object detection approaches:
- VGG16-based Visual Outlier Detection
- YOLOv8
- CLIP + SAM
- ✅ GroundingDINO (selected: best performance)
This allowed the pipeline to isolate only the product in each photo, which significantly improved downstream predictions.
Beyond raw image input, each segmented product was used to compute 12 visual-statistical features, including:
- Aspect ratio
- Normalized area
- Rectangularity
- Center offset
- Foreground-background contrast
- and more
These structured features act as a helpful inductive bias alongside image-based learning.
Input: Cropped product image + 12D feature vector
Output: Real-world height & width (float regression)
- ResNet50
- EfficientNetB3
- ConvNeXt
- ✅ Swin Transformer (best performer)
After identifying Swin as the top model, I applied a two-phase training strategy:
- Frozen backbone: Only the regression head was trained initially.
- Unfrozen fine-tuning: Full model was then fine-tuned end-to-end.
This improved stability and reduced overfitting in early training.
- Native support for research-centric models: CLIP, SAM, GroundingDINO
- Dynamic computation graphs
- Easier experimentation and debugging
- Rapid prototyping with Hugging Face, timm, and segmentation libraries
This pipeline combines object detection, visual feature engineering, and deep regression to estimate real-world product sizes from visual signals only.
- 🛍️ E-commerce auto-tagging (dimensions, volume, proportions)
- 📦 Packaging optimization (logistics, shipping cost estimation)
- 📱 Mobile apps (dimension from photo, DIY tools)
- 🧾 Metadata generation for large-scale product databases
🧪 Live Demo (Hugging Face Space):
👉 Launch Demo
📘 Full Notebook (Kaggle):
📎 View on Kaggle
💻 Source Code (GitHub):
💾 GitHub Repository
Open to collaboration and feedback!
Feel free to reach out via GitHub or connect on LinkedIn.