Skip to content

benchmark: quantify MLX CPU round-trip overhead from stream=mx.cpu SVD #22

@hunter-heidenreich

Description

@hunter-heidenreich

Goal

Quantify the wall-clock overhead introduced by forcing SVD to the CPU in mlx/kabsch_svd_nd.py and mlx/horn_quat_3d.py.

Background

Both MLX modules pin SVD to the CPU:

# mlx/kabsch_svd_nd.py:11
U, S, Vt = mx.linalg.svd(A, stream=mx.cpu)

This is necessary because MLX's GPU backend does not implement SVD. However, for inputs already on the GPU (Apple Silicon unified memory still has stream-switching overhead), this forces a synchronization point and CPU dispatch per call. The magnitude of this penalty is currently unknown.

Experimental Design

Isolate and measure three things:

  1. SVD step alone: time mx.linalg.svd(H, stream=mx.cpu) for varying B (on shapes [B, 3, 3])
  2. Full kabsch call: total wall-clock time including SVD
  3. Hypothetical GPU SVD: substitute a no-op (or identity) in place of SVD to measure the non-SVD portion

From (1) and (2), compute the fraction of total time spent in the forced-CPU SVD step as a function of B.

Expected Deliverables

  • Plot: SVD fraction of total call time vs. B
  • Absolute numbers: SVD latency for B = [1, 16, 256, 4096]
  • Written assessment: at what B does the CPU SVD become the clear bottleneck?
  • Recommendation for the docs on the known limitation and any workarounds (e.g. batching strategy)

Notes

  • This benchmark should be run on Apple Silicon (M-series) hardware where MLX is the intended target
  • If/when MLX adds GPU SVD support, this benchmark will serve as the baseline for measuring the improvement

Metadata

Metadata

Assignees

No one assigned

    Labels

    benchmarkPerformance measurement or profilingframework:mlxMLX-specific issuemoderateModerate impact, fix when possibleperformanceRuntime performance improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions