Skip to content

benchmark: throughput vs. batch size (B) across all frameworks #20

@hunter-heidenreich

Description

@hunter-heidenreich

Goal

Measure alignment throughput (pairs/sec) as a function of batch size B across all five frameworks (NumPy, PyTorch, JAX, TensorFlow, MLX) for both kabsch and horn.

Motivation

The library's GPU efficiency is heavily dependent on B. For small B, kernel launch overhead dominates; for large B, batched SVD on tiny (3×3) matrices becomes GPU-occupancy-bound. There are no documented recommendations for users about where these crossover points are.

Experimental Design

  • Fix D=3, N=100
  • Sweep B over [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 4096, 16384]
  • For autodiff frameworks: test both eager and JIT-compiled modes (torch.compile, jax.jit, tf.function)
  • Report: wall-clock time per call (median over 100 runs after warmup), throughput in pairs/sec
  • Device: CPU and GPU variants where applicable

Expected Deliverables

  • A plot: throughput vs. B (log-log), one curve per framework per mode
  • Identification of the B threshold where GPU overtakes CPU for each framework
  • A brief prose summary suitable for a README "Performance Tips" section
  • A benchmarks/ script committed to the repo so results are reproducible

Open Questions

  • Does MLX (Apple Silicon) show a different crossover point than CUDA frameworks?
  • Does PyTorch torch.compile show meaningful speedup over eager for this workload?

Metadata

Metadata

Assignees

No one assigned

    Labels

    benchmarkPerformance measurement or profilingmoderateModerate impact, fix when possible

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions