Performance discussion: Pre-resizing in uint8 before ToTensor for large input images

Hello,

First of all, thank you for releasing SAM2 and the excellent implementation.

While profiling preprocessing performance for large-resolution inputs, we observed a performance difference related to the ordering of ```ToTensor``` and ```Resize``` in the preprocessing pipeline.

Current implementation (SAM2-style) in [sam2_image_predictor.py](https://github.com/facebookresearch/sam2/blob/2b90b9f5ceec907a1c18123530e92e794ad901a4/sam2/utils/transforms.py#L42):
```bash
ToTensor → Resize → Normalize
```
where ```Resize``` and ```Normalize``` are wrapped in a TorchScripted ```nn.Sequential```, and ```forward_batch``` applies them inside a Python loop.

Alternative approach:
```bash
PIL Resize (uint8) → ToTensor → Normalize
```

## **Test Environment**
 - Input resolutions tested:
   - 1920×1080
   - 3840×2160
   - 4096×2048
 - Batch size: 8
 - Target resolution: 1024×1024
 - Device: CUDA
 - Preprocessing performed on CPU
 - PyTorch: 2.8.0+cu126, torchvision: 0.23.0+cu126


## **Benchmark Results**
### **3840×2160 → 1024×1024 (batch=8)**
Pipeline | Avg Time | Python Peak Memory
-- | -- | --
PIL Resize → ToTensor → Normalize | 441 ms | ~6 MB
ToTensor → Torch Resize → Normalize (current style) | 492 ms | ~48 MB

### **4096×2048 → 1024×1024 (batch=8)**
Pipeline | Avg Time | Python Peak Memory
-- | -- | --
PIL Resize → ToTensor → Normalize | 418 ms | ~6 MB
ToTensor → Torch Resize → Normalize (current style) | 495 ms | ~48 MB

For smaller inputs (1920×1080), the current SAM2-style pipeline was slightly faster, but for large inputs (4K-level), pre-resizing in uint8 provided:
 - Lower CPU memory peak
 - Faster overall preprocessing time
GPU peak memory remained similar in our setup because resizing was done on CPU before transferring the batch to CUDA.

## **Question**
We understand that placing ToTensor before Resize may simplify TorchScript compatibility for the transform module.

 - Could you clarify whether the current ordering is primarily motivated by TorchScript constraints or deployment consistency?
 - Would it make sense to optionally allow resizing in uint8 before tensor conversion for large-resolution CPU preprocessing scenarios?

We would appreciate any insights on the design rationale.

Thank you again for your work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance discussion: Pre-resizing in uint8 before ToTensor for large input images #740

Test Environment

Benchmark Results

3840×2160 → 1024×1024 (batch=8)

4096×2048 → 1024×1024 (batch=8)

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pipeline	Avg Time	Python Peak Memory
PIL Resize → ToTensor → Normalize	441 ms	~6 MB
ToTensor → Torch Resize → Normalize (current style)	492 ms	~48 MB

Pipeline	Avg Time	Python Peak Memory
PIL Resize → ToTensor → Normalize	418 ms	~6 MB
ToTensor → Torch Resize → Normalize (current style)	495 ms	~48 MB

Performance discussion: Pre-resizing in uint8 before ToTensor for large input images #740

Description

Test Environment

Benchmark Results

3840×2160 → 1024×1024 (batch=8)

4096×2048 → 1024×1024 (batch=8)

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions