-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
benchmarkPerformance measurement or profilingPerformance measurement or profilingframework:pytorchPyTorch-specific issuePyTorch-specific issuemoderateModerate impact, fix when possibleModerate impact, fix when possibleperformanceRuntime performance improvementRuntime performance improvement
Description
Goal
Profile GPU utilization of torch.linalg.svd on batched small-D inputs (D=2, 3, 5) to find the effective batch size B at which GPU occupancy reaches a useful threshold, and document this as a usage recommendation.
Motivation
Batched SVD on (B, 3, 3) tensors is known to be GPU-occupancy-bound for small B: each 3×3 SVD problem is too small to fill a warp (32 lanes), so most GPU threads are idle. This means that for small B, GPU may actually be slower than CPU for the SVD step. Users deploying this library for per-sample alignment (B=1) or small-batch inference may be unknowingly running on a suboptimal device.
Experimental Design
Using the PyTorch profiler (or NVIDIA NSight):
- Profile
torch.linalg.svdon(B, 3, 3)for B in [1, 4, 16, 64, 256, 1024, 4096, 16384] - Record: SM utilization (%), memory bandwidth utilization, kernel duration
- Compare against the same operation on CPU (
device='cpu') -- find the GPU/CPU crossover in wall time - Repeat for D=5 and D=10 to show how occupancy improves with matrix size
Expected Deliverables
- Plot: GPU SM utilization vs. B for each D
- Plot: GPU vs. CPU wall time for SVD vs. B -- mark crossover point
- Written threshold recommendation (e.g. "GPU is beneficial for B > ~256 with D=3")
- Documentation PR adding a "Performance Notes" section with this guidance
Notes
- This is PyTorch-focused (CUDA profiling), but findings apply conceptually to JAX/TF as well
- MLX is excluded (CPU-only SVD, separate issue benchmark: quantify MLX CPU round-trip overhead from stream=mx.cpu SVD #22)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
benchmarkPerformance measurement or profilingPerformance measurement or profilingframework:pytorchPyTorch-specific issuePyTorch-specific issuemoderateModerate impact, fix when possibleModerate impact, fix when possibleperformanceRuntime performance improvementRuntime performance improvement