[Discussion] SAHI-Aware Training as a Complement to Compact Architecture for Small Object Detection

## Context

Great work on EdgeCrafter — the distillation approach for compact ViTs is exactly the right direction 
for edge deployment. I wanted to raise a discussion about a complementary strategy that I think 
aligns well with the goals of this project.

## The Problem

The field keeps pushing toward heavier architectures to solve small object detection — 
more parameters, more FLOPs, more complex attention mechanisms. The underlying assumption 
is that the network needs more capacity to "see" what the input resolution is hiding.

I think the framing is wrong. The real problem is often upstream: **too much information is 
lost before it even reaches the network**. When you resize a 2K aerial frame to 640×640, 
a pedestrian that was 20px tall becomes 5px — not because the model is too small, 
but because the preprocessing discarded the spatial information.

## The Proposal: SAHI-Aware Training

[SAHI (Slicing Aided Hyper Inference)](https://github.com/obss/sahi) addresses this by slicing 
the image into overlapping tiles before inference, so objects always appear at an adequate scale 
relative to the network input. The key insight I want to raise here is:

**SAHI should not be just an inference trick — it should be a training strategy.**

If you train a model on pre-sliced images (e.g., 448×448 tiles from a 2K frame), 
the network learns features on objects at the right scale. At inference, you slice the 
same way and merge detections via NMS. The result:

- A **smaller input size** → faster inference per slice
- A **lighter model** → fewer parameters needed because the network isn't fighting 
  information loss
- **Higher AP** on dense small-object scenarios

I validated this on VisDrone with a YOLOv9-based model:

| Model | Training | Epochs | mAP@0.5 |
|---|---|---|---|
| GELAN-C (full-640) | Full-frame | 140 | 0.485 |
| GELAN-C (sliced-448) | Sliced tiles (fine-tuned) | 40 ⚠️ | **0.859** |

> ⚠️ The sliced model was fine-tuned from the full-frame checkpoint and stopped at epoch 40 
> — not fully converged. The gap would likely be even larger with full training.

Both using SAHI at inference. The sliced model was fine-tuned from the full model in just 40 epochs.


ECDet-S achieves 51.7 AP on COCO at only 10M params — impressive. But COCO objects are 
well-sized relative to the input. The real challenge is datasets like VisDrone/UAVDT where 
objects are systematically tiny relative to the frame.

The hypothesis: **a SAHI-trained ECDet-S on VisDrone-sliced data would outperform 
a much heavier model trained on full frames**, while staying well within edge compute budgets.

The balance point isn't "minimum parameters for a given AP on full-frame input" — it's 
**"minimum parameters for a given AP when input information is preserved via slicing"**. 
That's a fundamentally different optimization target, and it systematically favors compact models.

## Related Work

I built native GStreamer/DeepStream plugins that implement SAHI for real-time inference 
(pre/post-process plugins with GPU-accelerated slicing and GreedyNMM merge):
https://github.com/levipereira/deepstream-sahi

The training side is documented in the [Training Guide](https://github.com/levipereira/deepstream-sahi/blob/master/docs/TRAINING.md).

Happy to discuss or share training configs if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] SAHI-Aware Training as a Complement to Compact Architecture for Small Object Detection #7

Context

The Problem

The Proposal: SAHI-Aware Training

Related Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Training	Epochs	mAP@0.5
GELAN-C (full-640)	Full-frame	140	0.485
GELAN-C (sliced-448)	Sliced tiles (fine-tuned)	40 ⚠️	0.859

[Discussion] SAHI-Aware Training as a Complement to Compact Architecture for Small Object Detection #7

Description

Context

The Problem

The Proposal: SAHI-Aware Training

Related Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions