Commit 13dca2a
Vectorize load instructions in dmmv f16 CUDA kernel (ggml-org#9816)
* Vectorize load instructions in dmmv f16 CUDA kernel
Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.
* addressed comment
* Update ggml/src/ggml-cuda/dmmv.cu
Co-authored-by: Johannes Gäßler <[email protected]>
---------
Co-authored-by: Johannes Gäßler <[email protected]>1 parent d4c19c0 commit 13dca2a
1 file changed
+25
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
416 | 416 | | |
417 | 417 | | |
418 | 418 | | |
419 | | - | |
| 419 | + | |
| 420 | + | |
420 | 421 | | |
421 | | - | |
422 | | - | |
| 422 | + | |
| 423 | + | |
423 | 424 | | |
424 | 425 | | |
425 | 426 | | |
| |||
476 | 477 | | |
477 | 478 | | |
478 | 479 | | |
479 | | - | |
480 | | - | |
481 | | - | |
482 | | - | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
483 | 491 | | |
484 | | - | |
485 | | - | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
486 | 502 | | |
487 | 503 | | |
488 | 504 | | |
| |||
0 commit comments