Skip to content

Commit f5d59ff

Browse files
committed
Docs: B200 results (pre register-only kernels)
1 parent a5f4a67 commit f5d59ff

File tree

1 file changed

+6
-5
lines changed

1 file changed

+6
-5
lines changed

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,7 @@ Each of those has two flavors - with linear and affine gap penalties, also known
245245
| `stringzillas::LevenshteinDistancesUtf8` on 16x SPR | 38'954 MCUPS | 103'500 MCUPS |
246246
| `stringzillas::LevenshteinDistances` on RTX6000 | __32'030 MCUPS__ | __901'990 MCUPS__ |
247247
| `stringzillas::LevenshteinDistances` on H100 | __31'913 MCUPS__ | __925'890 MCUPS__ |
248+
| `stringzillas::LevenshteinDistances` on B200 | __32'960 MCUPS__ | __998'620 MCUPS__ |
248249
| `stringzillas::LevenshteinDistances` on 384x GNR | __114'190 MCUPS__ | __3'084'270 MCUPS__ |
249250
| `stringzillas::LevenshteinDistancesUtf8` on 384x GNR | __103'590 MCUPS__ | __2'938'320 MCUPS__ |
250251
| | | |
@@ -475,13 +476,13 @@ In case you are profiling the some of the internal kernels of mentioned librarie
475476
Such as using `ncu` for NVIDIA GPUs to evaluate the register usage and occupancy of the CUDA kernels used in StringZilla's Levenshtein distance calculation:
476477

477478
```bash
478-
ncu \
479-
--metrics launch__registers_per_thread,launch__occupancy_per_block_size \
479+
/usr/local/cuda/bin/ncu \
480+
--metrics launch__registers_per_thread,launch__occupancy_per_block_size,sm__warps_active.avg.pct_of_peak_sustained_active,sm__throughput.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed,dram__bytes.sum \
480481
--target-processes all \
481482
--kernel-name "levenshtein_on_each_cuda_thread" \
482-
bash -c 'STRINGWARS_DATASET=acgt_100.txt STRINGWARS_BATCH=65536 \
483-
STRINGWARS_TOKENS=lines STRINGWARS_FILTER="uniform/stringzillas::LevenshteinDistances\(1xGPU" \
484-
cargo criterion --features "cuda bench_similarities" bench_similarities --jobs 1'
483+
--launch-skip 5 \
484+
--launch-count 1 \
485+
bash -c 'STRINGWARS_DATASET=acgt_100.txt STRINGWARS_BATCH=65536 STRINGWARS_TOKENS=lines STRINGWARS_FILTER="uniform/stringzillas::LevenshteinDistances\(1xGPU\)" cargo criterion --features "cuda bench_similarities" bench_similarities --jobs 1'
485486
```
486487

487488
Using `perf` on Linux to analyze the CPU-side performance of SIMD-accelerated substring search:

0 commit comments

Comments
 (0)