benchmark: throughput vs. batch size (B) across all frameworks

## Goal

Measure alignment throughput (pairs/sec) as a function of batch size B across all five frameworks (NumPy, PyTorch, JAX, TensorFlow, MLX) for both `kabsch` and `horn`.

## Motivation

The library's GPU efficiency is heavily dependent on B. For small B, kernel launch overhead dominates; for large B, batched SVD on tiny (3×3) matrices becomes GPU-occupancy-bound. There are no documented recommendations for users about where these crossover points are.

## Experimental Design

- Fix D=3, N=100
- Sweep B over [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 4096, 16384]
- For autodiff frameworks: test both eager and JIT-compiled modes (torch.compile, jax.jit, tf.function)
- Report: wall-clock time per call (median over 100 runs after warmup), throughput in pairs/sec
- Device: CPU and GPU variants where applicable

## Expected Deliverables

- A plot: throughput vs. B (log-log), one curve per framework per mode
- Identification of the B threshold where GPU overtakes CPU for each framework
- A brief prose summary suitable for a README "Performance Tips" section
- A `benchmarks/` script committed to the repo so results are reproducible

## Open Questions

- Does MLX (Apple Silicon) show a different crossover point than CUDA frameworks?
- Does PyTorch `torch.compile` show meaningful speedup over eager for this workload?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: throughput vs. batch size (B) across all frameworks #20

Goal

Motivation

Experimental Design

Expected Deliverables

Open Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

benchmark: throughput vs. batch size (B) across all frameworks #20

Description

Goal

Motivation

Experimental Design

Expected Deliverables

Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions