-
Notifications
You must be signed in to change notification settings - Fork 9
[Discussion] SAHI-Aware Training as a Complement to Compact Architecture for Small Object Detection #7
Description
Context
Great work on EdgeCrafter — the distillation approach for compact ViTs is exactly the right direction
for edge deployment. I wanted to raise a discussion about a complementary strategy that I think
aligns well with the goals of this project.
The Problem
The field keeps pushing toward heavier architectures to solve small object detection —
more parameters, more FLOPs, more complex attention mechanisms. The underlying assumption
is that the network needs more capacity to "see" what the input resolution is hiding.
I think the framing is wrong. The real problem is often upstream: too much information is
lost before it even reaches the network. When you resize a 2K aerial frame to 640×640,
a pedestrian that was 20px tall becomes 5px — not because the model is too small,
but because the preprocessing discarded the spatial information.
The Proposal: SAHI-Aware Training
SAHI (Slicing Aided Hyper Inference) addresses this by slicing
the image into overlapping tiles before inference, so objects always appear at an adequate scale
relative to the network input. The key insight I want to raise here is:
SAHI should not be just an inference trick — it should be a training strategy.
If you train a model on pre-sliced images (e.g., 448×448 tiles from a 2K frame),
the network learns features on objects at the right scale. At inference, you slice the
same way and merge detections via NMS. The result:
- A smaller input size → faster inference per slice
- A lighter model → fewer parameters needed because the network isn't fighting
information loss - Higher AP on dense small-object scenarios
I validated this on VisDrone with a YOLOv9-based model:
| Model | Training | Epochs | mAP@0.5 |
|---|---|---|---|
| GELAN-C (full-640) | Full-frame | 140 | 0.485 |
| GELAN-C (sliced-448) | Sliced tiles (fine-tuned) | 40 |
0.859 |
⚠️ The sliced model was fine-tuned from the full-frame checkpoint and stopped at epoch 40
— not fully converged. The gap would likely be even larger with full training.
Both using SAHI at inference. The sliced model was fine-tuned from the full model in just 40 epochs.
ECDet-S achieves 51.7 AP on COCO at only 10M params — impressive. But COCO objects are
well-sized relative to the input. The real challenge is datasets like VisDrone/UAVDT where
objects are systematically tiny relative to the frame.
The hypothesis: a SAHI-trained ECDet-S on VisDrone-sliced data would outperform
a much heavier model trained on full frames, while staying well within edge compute budgets.
The balance point isn't "minimum parameters for a given AP on full-frame input" — it's
"minimum parameters for a given AP when input information is preserved via slicing".
That's a fundamentally different optimization target, and it systematically favors compact models.
Related Work
I built native GStreamer/DeepStream plugins that implement SAHI for real-time inference
(pre/post-process plugins with GPU-accelerated slicing and GreedyNMM merge):
https://github.com/levipereira/deepstream-sahi
The training side is documented in the Training Guide.
Happy to discuss or share training configs if useful.