Skip to content

profile: GPU occupancy of batched SVD on small D matrices #24

@hunter-heidenreich

Description

@hunter-heidenreich

Goal

Profile GPU utilization of torch.linalg.svd on batched small-D inputs (D=2, 3, 5) to find the effective batch size B at which GPU occupancy reaches a useful threshold, and document this as a usage recommendation.

Motivation

Batched SVD on (B, 3, 3) tensors is known to be GPU-occupancy-bound for small B: each 3×3 SVD problem is too small to fill a warp (32 lanes), so most GPU threads are idle. This means that for small B, GPU may actually be slower than CPU for the SVD step. Users deploying this library for per-sample alignment (B=1) or small-batch inference may be unknowingly running on a suboptimal device.

Experimental Design

Using the PyTorch profiler (or NVIDIA NSight):

  1. Profile torch.linalg.svd on (B, 3, 3) for B in [1, 4, 16, 64, 256, 1024, 4096, 16384]
  2. Record: SM utilization (%), memory bandwidth utilization, kernel duration
  3. Compare against the same operation on CPU (device='cpu') -- find the GPU/CPU crossover in wall time
  4. Repeat for D=5 and D=10 to show how occupancy improves with matrix size

Expected Deliverables

  • Plot: GPU SM utilization vs. B for each D
  • Plot: GPU vs. CPU wall time for SVD vs. B -- mark crossover point
  • Written threshold recommendation (e.g. "GPU is beneficial for B > ~256 with D=3")
  • Documentation PR adding a "Performance Notes" section with this guidance

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    benchmarkPerformance measurement or profilingframework:pytorchPyTorch-specific issuemoderateModerate impact, fix when possibleperformanceRuntime performance improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions