-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
benchmarkPerformance measurement or profilingPerformance measurement or profilingmoderateModerate impact, fix when possibleModerate impact, fix when possible
Description
Goal
Measure alignment throughput (pairs/sec) as a function of batch size B across all five frameworks (NumPy, PyTorch, JAX, TensorFlow, MLX) for both kabsch and horn.
Motivation
The library's GPU efficiency is heavily dependent on B. For small B, kernel launch overhead dominates; for large B, batched SVD on tiny (3×3) matrices becomes GPU-occupancy-bound. There are no documented recommendations for users about where these crossover points are.
Experimental Design
- Fix D=3, N=100
- Sweep B over [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 4096, 16384]
- For autodiff frameworks: test both eager and JIT-compiled modes (torch.compile, jax.jit, tf.function)
- Report: wall-clock time per call (median over 100 runs after warmup), throughput in pairs/sec
- Device: CPU and GPU variants where applicable
Expected Deliverables
- A plot: throughput vs. B (log-log), one curve per framework per mode
- Identification of the B threshold where GPU overtakes CPU for each framework
- A brief prose summary suitable for a README "Performance Tips" section
- A
benchmarks/script committed to the repo so results are reproducible
Open Questions
- Does MLX (Apple Silicon) show a different crossover point than CUDA frameworks?
- Does PyTorch
torch.compileshow meaningful speedup over eager for this workload?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
benchmarkPerformance measurement or profilingPerformance measurement or profilingmoderateModerate impact, fix when possibleModerate impact, fix when possible