-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
benchmarkPerformance measurement or profilingPerformance measurement or profilingframework:mlxMLX-specific issueMLX-specific issuemoderateModerate impact, fix when possibleModerate impact, fix when possibleperformanceRuntime performance improvementRuntime performance improvement
Description
Goal
Quantify the wall-clock overhead introduced by forcing SVD to the CPU in mlx/kabsch_svd_nd.py and mlx/horn_quat_3d.py.
Background
Both MLX modules pin SVD to the CPU:
# mlx/kabsch_svd_nd.py:11
U, S, Vt = mx.linalg.svd(A, stream=mx.cpu)This is necessary because MLX's GPU backend does not implement SVD. However, for inputs already on the GPU (Apple Silicon unified memory still has stream-switching overhead), this forces a synchronization point and CPU dispatch per call. The magnitude of this penalty is currently unknown.
Experimental Design
Isolate and measure three things:
- SVD step alone: time
mx.linalg.svd(H, stream=mx.cpu)for varying B (on shapes [B, 3, 3]) - Full
kabschcall: total wall-clock time including SVD - Hypothetical GPU SVD: substitute a no-op (or identity) in place of SVD to measure the non-SVD portion
From (1) and (2), compute the fraction of total time spent in the forced-CPU SVD step as a function of B.
Expected Deliverables
- Plot: SVD fraction of total call time vs. B
- Absolute numbers: SVD latency for B = [1, 16, 256, 4096]
- Written assessment: at what B does the CPU SVD become the clear bottleneck?
- Recommendation for the docs on the known limitation and any workarounds (e.g. batching strategy)
Notes
- This benchmark should be run on Apple Silicon (M-series) hardware where MLX is the intended target
- If/when MLX adds GPU SVD support, this benchmark will serve as the baseline for measuring the improvement
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
benchmarkPerformance measurement or profilingPerformance measurement or profilingframework:mlxMLX-specific issueMLX-specific issuemoderateModerate impact, fix when possibleModerate impact, fix when possibleperformanceRuntime performance improvementRuntime performance improvement