|
| 1 | +# Performance Benchmarks |
| 2 | + |
| 3 | +## Methodology |
| 4 | + |
| 5 | +### Test Environment |
| 6 | + |
| 7 | +- **Python**: 3.11 |
| 8 | +- **GPU**: NVIDIA GeForce RTX 4090 |
| 9 | +- **CPU**: AMD Ryzen Threadripper PRO 7975WX 32-Cores |
| 10 | +- **OS**: Linux 6.12.10-76061203-generic |
| 11 | + |
| 12 | +### What We Measure |
| 13 | + |
| 14 | +**Kernel Performance Only**: All benchmarks measure only the core delay-and-sum beamforming kernel execution time. We explicitly **exclude**: |
| 15 | +- CPU↔GPU memory transfers (depends on PCIe bandwidth, motherboard, etc., and pre-/post-processing steps may keep data on the GPU) |
| 16 | +- JIT compilation (amortized over multiple function-calls) |
| 17 | +- Data format reshaping (varies by scanner or file format) |
| 18 | +- Pre-processing and post-processing steps (bit-unpacking, demodulation, clutter filtering, etc.; varies by ultrasound sequence) |
| 19 | + |
| 20 | +The benchmark focuses on the most compute-/memory-bandwidth-intensive part of beamforming rather than system-specific overheads. |
| 21 | + |
| 22 | +### Benchmark Dataset |
| 23 | + |
| 24 | +We use PyMUST's [rotating-disk Doppler dataset](https://github.com/creatis-ULTIM/PyMUST/blob/170ba68/examples/rotatingDisk_real.ipynb) as our primary benchmark: |
| 25 | + |
| 26 | +- **128 receive elements** (L7-4 linear array) |
| 27 | +- **63,001 voxels** (25mm × 25mm grid, 0.1mm spacing) |
| 28 | +- **32 frames** (temporal ensemble) |
| 29 | + |
| 30 | +This represents a realistic ultrafast imaging workload, although it is a small microbenchmark compared to the 3D, high-channel count datasets we're actually interested in. |
| 31 | + |
| 32 | +## Performance Results |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +| Implementation | Median Runtime | Points/Second | "Mach factor" | |
| 37 | +|---------------|----------------|---------------|----------------| |
| 38 | +| **mach (GPU)** | **0.3 ms** | **8.56 × 10¹¹** | **5.0×** | |
| 39 | +| Speed-of-Sound (35mm) | 1.5 ms | | 1× | |
| 40 | +| vbeam (JAX/GPU) | 2.8 ms | 9.04 × 10¹⁰ | 0.5× | |
| 41 | +| PyMUST (CPU) | 298.4 ms | 8.64 × 10⁸ | 0.005× | |
| 42 | + |
| 43 | +### What does "beamforming at the speed of sound" even mean? |
| 44 | + |
| 45 | +The **speed-of-sound** ("Mach 1"), represents the theoretical minimum time required for ultrasound waves to travel to the deepest imaging point and back, multiplied by the number of frames in the dataset. |
| 46 | +This is specific to each imaging scenario. For our benchmark with PyMUST's [rotating-disk Doppler dataset](https://github.com/creatis-ULTIM/PyMUST/blob/170ba68/examples/rotatingDisk_real.ipynb): |
| 47 | + |
| 48 | +- **Maximum imaging depth**: 35 mm |
| 49 | +- **Speed of sound in rotating disk**: 1,480 m/s |
| 50 | +- **Round-trip time**: 2 × 0.035 m ÷ 1,480 m/s = 47 μs per frame |
| 51 | +- **Total for 32 frames**: 47.3 μs × 32 = **1.5 ms** |
| 52 | + |
| 53 | +mach's processing time depends on various factors (see [Computational Complexity](#computational-complexity)), |
| 54 | +so the "Mach 5" description is specific to this benchmark. |
| 55 | + |
| 56 | +## Benchmark Reproduction |
| 57 | + |
| 58 | +All benchmarks can be reproduced using the included test suite: |
| 59 | + |
| 60 | +```bash |
| 61 | +# Run full benchmark suite |
| 62 | +make benchmark |
| 63 | + |
| 64 | +# Generate performance plots |
| 65 | +uv run --group compare tests/plot_benchmark.py --output assets/benchmark-doppler_disk.svg |
| 66 | +uv run --group compare tests/plot_benchmark.py --points-per-second --output assets/benchmark-doppler_disk_pps.svg |
| 67 | +``` |
| 68 | + |
| 69 | +The benchmark job in our CI pipeline ([`test_gpu.yml`](https://github.com/Forest-Neurotech/mach/blob/main/.github/workflows/test_gpu.yml)) automatically runs these benchmarks across different commits, providing continuous performance monitoring. |
| 70 | + |
| 71 | +## CUDA Optimizations |
| 72 | + |
| 73 | +mach optimize GPU memory access patterns to improve performance. For those interested in learning more about CUDA optimization, excellent resources include: |
| 74 | + |
| 75 | +- [CUDA Crash Course](https://github.com/CoffeeBeforeArch/cuda_programming/) by CoffeeBeforeArch |
| 76 | +- [How CUDA Programming Works](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41487/) - CUDA Architect presentation on CUDA best-practices |
| 77 | +- [Optimizing Parallel Reduction in CUDA](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) - NVIDIA example of optimizing a different algorithm |
| 78 | + |
| 79 | +### Key Optimizations in mach |
| 80 | + |
| 81 | +#### 1. **Coalesced Memory Access** |
| 82 | +- Channel data organized as `[n_receive_elements, n_samples, n_frames]` |
| 83 | +- Frames dimension is contiguous for [coalesced access](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/) to reduce global memory reads |
| 84 | + |
| 85 | +#### 2. **Pre-computed Transmit Wavefront Arrivals** |
| 86 | +- Pre-compute transmit arrival times to amortize delay calculation across repeated kernel calls |
| 87 | +- However, cannot pre-compute the full transmit+receive delay matrix (like PyMUST) for large datasets: (would require transmits × voxels × channels x `size(float)` memory) |
| 88 | + |
| 89 | +#### 3. **Shared Memory for Delay Tables** |
| 90 | +- Transmit-to-receive delays computed only once per voxel (not 32× for 32 frames) |
| 91 | +- Delay and apodization tables cached in shared memory |
| 92 | +- Reused across all frames for each voxel |
| 93 | + |
| 94 | +### Memory Bandwidth and Compute Utilization |
| 95 | + |
| 96 | +*[Stub: Detailed analysis of memory bandwidth utilization and compute efficiency using Nsight Compute profiling results]* |
| 97 | + |
| 98 | +*[Analysis of achieved bandwidth and compute utilization to be added]* |
| 99 | + |
| 100 | + |
| 101 | +## Scaling Performance |
| 102 | + |
| 103 | +### Computational Complexity |
| 104 | + |
| 105 | +The beamforming algorithm scales as: |
| 106 | +``` |
| 107 | +O(n_voxels × n_elements × n_frames) |
| 108 | +``` |
| 109 | + |
| 110 | +For the PyMUST dataset: `63,001 voxels × 128 elements × 32 frames ≈ 2.6 × 10⁸` points |
| 111 | + |
| 112 | +### GPU Memory (VRAM) Usage |
| 113 | + |
| 114 | +mach allocates GPU memory only for input arguments and the output result; no intermediate arrays are created during computation, simplifying memory management. |
| 115 | + |
| 116 | +Total GPU memory usage scales as: |
| 117 | +``` |
| 118 | +O(n_voxels × n_frames + n_elements × n_samples × n_frames) |
| 119 | +``` |
| 120 | + |
| 121 | +Where: |
| 122 | +- **First term** (`n_voxels × n_frames`): Output array size—dominates for large imaging grids |
| 123 | +- **Second term** (`n_elements × n_samples × n_frames`): Input data size—dominates for high-channel-count systems |
| 124 | + |
| 125 | +#### Memory Usage Example |
| 126 | + |
| 127 | +Here is an example functional ultrasound imaging (fUSI) workload: |
| 128 | +- **Imaging grid**: 100×100×100 voxels (1M points) |
| 129 | +- **Temporal frames**: 200 frames |
| 130 | +- **Matrix probe**: 1024 elements |
| 131 | +- **Samples per channel**: 100 |
| 132 | +- **Data type**: `complex64` (8 bytes per sample) |
| 133 | + |
| 134 | +**Memory breakdown:** |
| 135 | +``` |
| 136 | +channel_data: (1024, 100, 200) → 164 MB |
| 137 | +rx_coords_m: (1024, 3) → 12 KB |
| 138 | +scan_coords_m: (1M, 3) → 12 MB |
| 139 | +tx_wave_arrivals_s: (1M,) → 4 MB |
| 140 | +out: (1M, 200) → 1.6 GB |
| 141 | + ───────── |
| 142 | +Total GPU memory: ~1.78 GB |
| 143 | +``` |
| 144 | + |
| 145 | +In this example, the output array (`out`) represents 90% of memory usage, demonstrating how large imaging grids dominate memory requirements for volumetric datasets. |
| 146 | + |
| 147 | +### Performance Scaling with Dataset Size |
| 148 | + |
| 149 | +*[Figure placeholder: Performance scaling with number of voxels]* |
| 150 | + |
| 151 | +*[Figure placeholder: Performance scaling with number of receive elements]* |
| 152 | + |
| 153 | +*[Figure placeholder: Performance scaling with ensemble size (frames)]* |
| 154 | + |
| 155 | +## Large-Scale Dataset Performance |
| 156 | + |
| 157 | +### Practical Workloads |
| 158 | + |
| 159 | +*[Stub: Performance analysis on larger, more realistic datasets]* |
| 160 | + |
| 161 | +Typical functional ultrasound imaging (fUSI) datasets we're targeting: |
| 162 | +- **1024+ receive elements** (high-density arrays) |
| 163 | +- **1M+ voxels** (volumetric or high-resolution imaging) |
| 164 | +- **100+ frames** (longer temporal windows) |
| 165 | +- **Real-time processing** requirements |
| 166 | + |
| 167 | +*[Figure placeholder: Performance scaling results]* |
| 168 | + |
| 169 | +### Memory Scaling Characteristics |
| 170 | + |
| 171 | +*[Stub: Analysis of memory usage patterns and optimization strategies for large datasets]* |
| 172 | + |
| 173 | +## Performance Optimization Guide |
| 174 | + |
| 175 | +### For Maximum Throughput |
| 176 | + |
| 177 | +1. **Keep data on GPU**: Use CuPy/JAX arrays to avoid CPU↔GPU transfers |
| 178 | +2. **Use sufficient ensemble size**: Use ≥16 frames for complex64 or ≥32 frames for float32 to fully coalesce reads to global memory |
| 179 | +3. **Ensure contiguous frame dimension**: The kernel requires frame-contiguous memory layouts. |
0 commit comments