Skip to content

Performance discussion: Pre-resizing in uint8 before ToTensor for large input images #740

@GivenAward

Description

@GivenAward

Hello,

First of all, thank you for releasing SAM2 and the excellent implementation.

While profiling preprocessing performance for large-resolution inputs, we observed a performance difference related to the ordering of ToTensor and Resize in the preprocessing pipeline.

Current implementation (SAM2-style) in sam2_image_predictor.py:

ToTensor → Resize → Normalize

where Resize and Normalize are wrapped in a TorchScripted nn.Sequential, and forward_batch applies them inside a Python loop.

Alternative approach:

PIL Resize (uint8) → ToTensor → Normalize

Test Environment

  • Input resolutions tested:
    • 1920×1080
    • 3840×2160
    • 4096×2048
  • Batch size: 8
  • Target resolution: 1024×1024
  • Device: CUDA
  • Preprocessing performed on CPU
  • PyTorch: 2.8.0+cu126, torchvision: 0.23.0+cu126

Benchmark Results

3840×2160 → 1024×1024 (batch=8)

Pipeline Avg Time Python Peak Memory
PIL Resize → ToTensor → Normalize 441 ms ~6 MB
ToTensor → Torch Resize → Normalize (current style) 492 ms ~48 MB

4096×2048 → 1024×1024 (batch=8)

Pipeline Avg Time Python Peak Memory
PIL Resize → ToTensor → Normalize 418 ms ~6 MB
ToTensor → Torch Resize → Normalize (current style) 495 ms ~48 MB

For smaller inputs (1920×1080), the current SAM2-style pipeline was slightly faster, but for large inputs (4K-level), pre-resizing in uint8 provided:

  • Lower CPU memory peak
  • Faster overall preprocessing time
    GPU peak memory remained similar in our setup because resizing was done on CPU before transferring the batch to CUDA.

Question

We understand that placing ToTensor before Resize may simplify TorchScript compatibility for the transform module.

  • Could you clarify whether the current ordering is primarily motivated by TorchScript constraints or deployment consistency?
  • Would it make sense to optionally allow resizing in uint8 before tensor conversion for large-resolution CPU preprocessing scenarios?

We would appreciate any insights on the design rationale.

Thank you again for your work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions