Skip to content

Commit a5f4a67

Browse files
committed
Docs: Profiling with ncu & perf
1 parent 19ce836 commit a5f4a67

File tree

2 files changed

+25
-0
lines changed

2 files changed

+25
-0
lines changed

.vscode/settings.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
"bytesum",
99
"cityhash",
1010
"corasick",
11+
"CUDA",
1112
"CUDF",
1213
"Dataframe",
1314
"foldhash",
@@ -26,8 +27,10 @@
2627
"stringtape",
2728
"stringwars",
2829
"stringzilla",
30+
"stringzillas",
2931
"strstr",
3032
"tfidf",
33+
"Vardanian",
3134
"Wunsch"
3235
]
3336
}

README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -468,3 +468,25 @@ wget --no-clobber -O acgt_100k.txt https://huggingface.co/datasets/ashvardanian/
468468
wget --no-clobber -O acgt_1m.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_1m.txt?download=true
469469
wget --no-clobber -O acgt_10m.txt https://huggingface.co/datasets/ashvardanian/StringWars/resolve/main/acgt_10m.txt?download=true
470470
```
471+
472+
## Deep Profiling
473+
474+
In case you are profiling the some of the internal kernels of mentioned libraries, here are a few example commands to get around.
475+
Such as using `ncu` for NVIDIA GPUs to evaluate the register usage and occupancy of the CUDA kernels used in StringZilla's Levenshtein distance calculation:
476+
477+
```bash
478+
ncu \
479+
--metrics launch__registers_per_thread,launch__occupancy_per_block_size \
480+
--target-processes all \
481+
--kernel-name "levenshtein_on_each_cuda_thread" \
482+
bash -c 'STRINGWARS_DATASET=acgt_100.txt STRINGWARS_BATCH=65536 \
483+
STRINGWARS_TOKENS=lines STRINGWARS_FILTER="uniform/stringzillas::LevenshteinDistances\(1xGPU" \
484+
cargo criterion --features "cuda bench_similarities" bench_similarities --jobs 1'
485+
```
486+
487+
Using `perf` on Linux to analyze the CPU-side performance of SIMD-accelerated substring search:
488+
489+
```bash
490+
perf record -e cpu-clock -g graph,0x400000 -o perf.data -- cargo criterion --features "bench_similarities" bench_similarities --jobs 1
491+
perf report -i perf.data
492+
```

0 commit comments

Comments
 (0)