Skip to content

Commit ee9c1fa

Browse files
authored
[doc] Add benchmark and profiling notes (#26)
* More updates * Add placeholder * Add benchmark.rst page * Fix lint * Add an explanation about the mach factor * Add a memory-usage example --------- Co-authored-by: Charles Guan <3221512+charlesincharge@users.noreply.github.com>
1 parent 6cf35b6 commit ee9c1fa

File tree

4 files changed

+191
-1
lines changed

4 files changed

+191
-1
lines changed

BENCHMARKS.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Performance Benchmarks
2+
3+
## Methodology
4+
5+
### Test Environment
6+
7+
- **Python**: 3.11
8+
- **GPU**: NVIDIA GeForce RTX 4090
9+
- **CPU**: AMD Ryzen Threadripper PRO 7975WX 32-Cores
10+
- **OS**: Linux 6.12.10-76061203-generic
11+
12+
### What We Measure
13+
14+
**Kernel Performance Only**: All benchmarks measure only the core delay-and-sum beamforming kernel execution time. We explicitly **exclude**:
15+
- CPU↔GPU memory transfers (depends on PCIe bandwidth, motherboard, etc., and pre-/post-processing steps may keep data on the GPU)
16+
- JIT compilation (amortized over multiple function-calls)
17+
- Data format reshaping (varies by scanner or file format)
18+
- Pre-processing and post-processing steps (bit-unpacking, demodulation, clutter filtering, etc.; varies by ultrasound sequence)
19+
20+
The benchmark focuses on the most compute-/memory-bandwidth-intensive part of beamforming rather than system-specific overheads.
21+
22+
### Benchmark Dataset
23+
24+
We use PyMUST's [rotating-disk Doppler dataset](https://github.com/creatis-ULTIM/PyMUST/blob/170ba68/examples/rotatingDisk_real.ipynb) as our primary benchmark:
25+
26+
- **128 receive elements** (L7-4 linear array)
27+
- **63,001 voxels** (25mm × 25mm grid, 0.1mm spacing)
28+
- **32 frames** (temporal ensemble)
29+
30+
This represents a realistic ultrafast imaging workload, although it is a small microbenchmark compared to the 3D, high-channel count datasets we're actually interested in.
31+
32+
## Performance Results
33+
34+
![Benchmark Results](assets/benchmark-doppler_disk.svg)
35+
36+
| Implementation | Median Runtime | Points/Second | "Mach factor" |
37+
|---------------|----------------|---------------|----------------|
38+
| **mach (GPU)** | **0.3 ms** | **8.56 × 10¹¹** | **5.0×** |
39+
| Speed-of-Sound (35mm) | 1.5 ms | ||
40+
| vbeam (JAX/GPU) | 2.8 ms | 9.04 × 10¹⁰ | 0.5× |
41+
| PyMUST (CPU) | 298.4 ms | 8.64 × 10⁸ | 0.005× |
42+
43+
### What does "beamforming at the speed of sound" even mean?
44+
45+
The **speed-of-sound** ("Mach 1"), represents the theoretical minimum time required for ultrasound waves to travel to the deepest imaging point and back, multiplied by the number of frames in the dataset.
46+
This is specific to each imaging scenario. For our benchmark with PyMUST's [rotating-disk Doppler dataset](https://github.com/creatis-ULTIM/PyMUST/blob/170ba68/examples/rotatingDisk_real.ipynb):
47+
48+
- **Maximum imaging depth**: 35 mm
49+
- **Speed of sound in rotating disk**: 1,480 m/s
50+
- **Round-trip time**: 2 × 0.035 m ÷ 1,480 m/s = 47 μs per frame
51+
- **Total for 32 frames**: 47.3 μs × 32 = **1.5 ms**
52+
53+
mach's processing time depends on various factors (see [Computational Complexity](#computational-complexity)),
54+
so the "Mach 5" description is specific to this benchmark.
55+
56+
## Benchmark Reproduction
57+
58+
All benchmarks can be reproduced using the included test suite:
59+
60+
```bash
61+
# Run full benchmark suite
62+
make benchmark
63+
64+
# Generate performance plots
65+
uv run --group compare tests/plot_benchmark.py --output assets/benchmark-doppler_disk.svg
66+
uv run --group compare tests/plot_benchmark.py --points-per-second --output assets/benchmark-doppler_disk_pps.svg
67+
```
68+
69+
The benchmark job in our CI pipeline ([`test_gpu.yml`](https://github.com/Forest-Neurotech/mach/blob/main/.github/workflows/test_gpu.yml)) automatically runs these benchmarks across different commits, providing continuous performance monitoring.
70+
71+
## CUDA Optimizations
72+
73+
mach optimize GPU memory access patterns to improve performance. For those interested in learning more about CUDA optimization, excellent resources include:
74+
75+
- [CUDA Crash Course](https://github.com/CoffeeBeforeArch/cuda_programming/) by CoffeeBeforeArch
76+
- [How CUDA Programming Works](https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41487/) - CUDA Architect presentation on CUDA best-practices
77+
- [Optimizing Parallel Reduction in CUDA](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) - NVIDIA example of optimizing a different algorithm
78+
79+
### Key Optimizations in mach
80+
81+
#### 1. **Coalesced Memory Access**
82+
- Channel data organized as `[n_receive_elements, n_samples, n_frames]`
83+
- Frames dimension is contiguous for [coalesced access](https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/) to reduce global memory reads
84+
85+
#### 2. **Pre-computed Transmit Wavefront Arrivals**
86+
- Pre-compute transmit arrival times to amortize delay calculation across repeated kernel calls
87+
- However, cannot pre-compute the full transmit+receive delay matrix (like PyMUST) for large datasets: (would require transmits × voxels × channels x `size(float)` memory)
88+
89+
#### 3. **Shared Memory for Delay Tables**
90+
- Transmit-to-receive delays computed only once per voxel (not 32× for 32 frames)
91+
- Delay and apodization tables cached in shared memory
92+
- Reused across all frames for each voxel
93+
94+
### Memory Bandwidth and Compute Utilization
95+
96+
*[Stub: Detailed analysis of memory bandwidth utilization and compute efficiency using Nsight Compute profiling results]*
97+
98+
*[Analysis of achieved bandwidth and compute utilization to be added]*
99+
100+
101+
## Scaling Performance
102+
103+
### Computational Complexity
104+
105+
The beamforming algorithm scales as:
106+
```
107+
O(n_voxels × n_elements × n_frames)
108+
```
109+
110+
For the PyMUST dataset: `63,001 voxels × 128 elements × 32 frames ≈ 2.6 × 10⁸` points
111+
112+
### GPU Memory (VRAM) Usage
113+
114+
mach allocates GPU memory only for input arguments and the output result; no intermediate arrays are created during computation, simplifying memory management.
115+
116+
Total GPU memory usage scales as:
117+
```
118+
O(n_voxels × n_frames + n_elements × n_samples × n_frames)
119+
```
120+
121+
Where:
122+
- **First term** (`n_voxels × n_frames`): Output array size—dominates for large imaging grids
123+
- **Second term** (`n_elements × n_samples × n_frames`): Input data size—dominates for high-channel-count systems
124+
125+
#### Memory Usage Example
126+
127+
Here is an example functional ultrasound imaging (fUSI) workload:
128+
- **Imaging grid**: 100×100×100 voxels (1M points)
129+
- **Temporal frames**: 200 frames
130+
- **Matrix probe**: 1024 elements
131+
- **Samples per channel**: 100
132+
- **Data type**: `complex64` (8 bytes per sample)
133+
134+
**Memory breakdown:**
135+
```
136+
channel_data: (1024, 100, 200) → 164 MB
137+
rx_coords_m: (1024, 3) → 12 KB
138+
scan_coords_m: (1M, 3) → 12 MB
139+
tx_wave_arrivals_s: (1M,) → 4 MB
140+
out: (1M, 200) → 1.6 GB
141+
─────────
142+
Total GPU memory: ~1.78 GB
143+
```
144+
145+
In this example, the output array (`out`) represents 90% of memory usage, demonstrating how large imaging grids dominate memory requirements for volumetric datasets.
146+
147+
### Performance Scaling with Dataset Size
148+
149+
*[Figure placeholder: Performance scaling with number of voxels]*
150+
151+
*[Figure placeholder: Performance scaling with number of receive elements]*
152+
153+
*[Figure placeholder: Performance scaling with ensemble size (frames)]*
154+
155+
## Large-Scale Dataset Performance
156+
157+
### Practical Workloads
158+
159+
*[Stub: Performance analysis on larger, more realistic datasets]*
160+
161+
Typical functional ultrasound imaging (fUSI) datasets we're targeting:
162+
- **1024+ receive elements** (high-density arrays)
163+
- **1M+ voxels** (volumetric or high-resolution imaging)
164+
- **100+ frames** (longer temporal windows)
165+
- **Real-time processing** requirements
166+
167+
*[Figure placeholder: Performance scaling results]*
168+
169+
### Memory Scaling Characteristics
170+
171+
*[Stub: Analysis of memory usage patterns and optimization strategies for large datasets]*
172+
173+
## Performance Optimization Guide
174+
175+
### For Maximum Throughput
176+
177+
1. **Keep data on GPU**: Use CuPy/JAX arrays to avoid CPU↔GPU transfers
178+
2. **Use sufficient ensemble size**: Use ≥16 frames for complex64 or ≥32 frames for float32 to fully coalesce reads to global memory
179+
3. **Ensure contiguous frame dimension**: The kernel requires frame-contiguous memory layouts.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ An ultrafast CUDA-accelerated ultrasound beamformer for Python users. Developed
99

1010
![Benchmark Results](assets/benchmark-doppler_disk.svg)
1111

12-
_Beamforming PyMUST's [rotating-disk Doppler dataset](https://github.com/creatis-ULTIM/PyMUST/blob/170ba68/examples/rotatingDisk_real.ipynb) at 0.86 trillion points per second (**5**x the speed of sound)._
12+
_[Benchmark](https://github.com/Forest-Neurotech/mach/blob/main/BENCHMARKS.md): Beamforming PyMUST's [rotating-disk Doppler dataset](https://github.com/creatis-ULTIM/PyMUST/blob/170ba68/examples/rotatingDisk_real.ipynb) at 0.86 trillion points per second ([**5**x the speed of sound](https://github.com/Forest-Neurotech/mach/blob/main/BENCHMARKS.md))._
1313

1414
> **⚠️ Alpha Release**
1515
>

docs/benchmarks.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Benchmarks
2+
=======================
3+
4+
.. include:: ../BENCHMARKS.md
5+
:parser: myst_parser.sphinx_

docs/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,12 @@ mach
1414

1515
examples/index
1616

17+
.. toctree::
18+
:maxdepth: 1
19+
:caption: Performance:
20+
21+
benchmarks
22+
1723
.. toctree::
1824
:maxdepth: 2
1925
:caption: API Reference:

0 commit comments

Comments
 (0)