Vulkan perfetto #15859

walidbr · 2025-09-07T21:04:56Z

Make sure to read the contributing guidelines before submitting a PR

This uses perfetto in process profiling, and will produce a perfetto binary by the end of the inference. This is very useful to help visualise how the handles the inference. Build: cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGGML_VULKAN=ON cmake --build build-vk -j8 Run: GGML_VK_PERF_SILENT=1 GGML_VK_PERF_LOGGER=1 LLAMA_PERFETTO_TRACE=./out.pftrace build-vk/bin/llama-cli -m model.gguf Test: Tested on M4 Mac In detail this patch does the following: 1. Including the `LlamaPerfetto.h` header file, which contains the definitions for the Perfetto-related functions and variables used in this example. 2. Calling the `llama_perfetto_start()` function to start tracing at the beginning of the conversation. 3. Calling the `llama_perfetto_stop_flush()` function to stop tracing and flush the trace after each generation. 4. Adding a call to the `llama_perfetto_trace_begin_with_text()` function to begin an event in Perfetto with a text description of the current evaluation. 5. Adding a call to the `llama_perfetto_trace_end()` function to end the event after each evaluation. 6. Adding a call to the `llama_perfetto_counter_tokens_per_s()` function to update the Perfetto counter for tokens per second during idle periods. 7. Calling the `llama_perfetto_emit_gpu_timeline()` function to emit GPU timeline slices into Perfetto. 8. Adding a call to the `llama_perfetto_print_gpu_stats()` function to print GPU statistics at idle periods. 9. Calling the `llama_perfetto_flush_dump_stats()` function to flush and dump the Perfetto trace stats to a file at idle periods.

completing a4090d1

Signed-off-by: noemotiovon <[email protected]>

* llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault

Removed information about MSVC compiler limitations for arm64 builds.

…#15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16

…#15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK

…#15679) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.

* vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent

) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.

ggml-ci

* metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]

* server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints

* sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

* CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment

…ml-org#15690) * CUDA: fix build error from ambiguous __half conversions in conv2d Building conv2d with half precision failed because `__half` defines multiple implicit conversion operators (to float, int, short, etc.), causing ambiguous overload resolution when multiplying with float. Introduce a templated `to_float` helper that explicitly converts `__half` via `__half2float`, while passing through float unchanged. Use this helper in conv2d accumulation to ensure unambiguous and correct promotion to float. Fixes some build errors with half-precision kernels on CUDA. ggml-ci * CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion * CUDA: Add missing convert.cuh header * CUDA: remove unnecessary extension in ggml_cuda_cast * CUDA: Address review comment, remove second type template argument

Signed-off-by: Jie Fu <[email protected]>

) * ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops This commit adds support for the TRANSPOSE and RESHAPE operations in the ggml webgpu backend. Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

…gml-org#14903) * vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants * vulkan: use subgroup operations for quantize_q8_1 shader * vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader * vulkan: use q8_1_x4 blocks in mul_mmq shader * vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec * vulkan: tune mul_mat_vecq performance for Intel * vulkan: fix quantizing issue when tensor is not divisible by 128 * vulkan: adapt integer dot mmv to mmv small m optimization (ggml-org#15355) * vulkan: allow all subgroup modes for mmv and mmvq * vulkan: use prealloc intermediate reuse for mmvq path * vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090 * vulkan: adapt mmv quantize_y path to conditional sync logic * vulkan: disable q8_0 mmvq on Nvidia * vulkan: enable q8_0 on Nvidia pre-turing * fix prealloc sync condition * fix llvmpipe subgroup 8 issue

Signed-off-by: Jie Fu <[email protected]>

…rg#15115) * Added sve implementation for vec_dot_fp16 Kernel * removed white spaces * Added comment * removed white spaces * changed GGML_F16x_VEC_FMA for code consistency * Update vec.h --------- Co-authored-by: vithulep <[email protected]>

* SVE support for exponential functions Add const notation to variable pg * Update ggml/src/ggml-cpu/vec.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Add const --------- Co-authored-by: Georgi Gerganov <[email protected]>

…gml-org#15827) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* server : speed up tests * clean up * restore timeout_seconds in some places * flake8 * explicit offline

…ggml-org#15817)

This uses perfetto in process profiling, and will produce a perfetto binary by the end of the inference. This is very useful to help visualise how the handles the inference. Build: cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGGML_VULKAN=ON cmake --build build-vk -j8 Run: GGML_VK_PERF_SILENT=1 GGML_VK_PERF_LOGGER=1 LLAMA_PERFETTO_TRACE=./out.pftrace build-vk/bin/llama-cli -m model.gguf Test: Tested on M4 Mac In detail this patch does the following: 1. Including the `LlamaPerfetto.h` header file, which contains the definitions for the Perfetto-related functions and variables used in this example. 2. Calling the `llama_perfetto_start()` function to start tracing at the beginning of the conversation. 3. Calling the `llama_perfetto_stop_flush()` function to stop tracing and flush the trace after each generation. 4. Adding a call to the `llama_perfetto_trace_begin_with_text()` function to begin an event in Perfetto with a text description of the current evaluation. 5. Adding a call to the `llama_perfetto_trace_end()` function to end the event after each evaluation. 6. Adding a call to the `llama_perfetto_counter_tokens_per_s()` function to update the Perfetto counter for tokens per second during idle periods. 7. Calling the `llama_perfetto_emit_gpu_timeline()` function to emit GPU timeline slices into Perfetto. 8. Adding a call to the `llama_perfetto_print_gpu_stats()` function to print GPU statistics at idle periods. 9. Calling the `llama_perfetto_flush_dump_stats()` function to flush and dump the Perfetto trace stats to a file at idle periods.

0cc4m · 2025-09-08T07:31:54Z

What are you doing? Why open and close immediately, twice?

walidbr · 2025-09-09T11:44:55Z

I'm sorry this was not intended. Please ignore and delete this patch.

walidbr and others added 30 commits September 7, 2025 03:38

Merge branch 'master' into vulkan_perfetto

b0a9c06

server : removed obsolete doc (ggml-org#15670)

14ff930

completing a4090d1

CANN: FIx compiler warnings (ggml-org#15661)

0bcbeaa

Signed-off-by: noemotiovon <[email protected]>

vulkan: Skip syncing for prealloc_y when it is reused (ggml-org#15544)

a5d75e1

CUDA: use FP32 arithmetic for conv2d (ggml-org#15683)

22506f6

llama: use FA + max. GPU layers by default (ggml-org#15434)

878fe00

* llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault

Update build.md to remove MSVC arm64 notes (ggml-org#15684)

d68c62c

Removed information about MSVC compiler limitations for arm64 builds.

ggml: update kleidiai to v1.13.0 (ggml-org#15663)

42e986f

vulkan: clamp matmul and FA results to the max finite value (ggml-org…

26f086c

…#15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16

vulkan: Allow fallback to sysmem memory when vidmem is full (ggml-org…

e35560d

…#15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK

vulkan: handle large sizes for get_rows (ggml-org#15686)

e16026b

ci : explicitly set fa off or on (ggml-org#15692)

c369ca1

llama : separate compute buffer reserve from fattn check (ggml-org#15696

44da51d

) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.

llama : fix fattn reserve call n_seqs parameter (ggml-org#15699)

c86af26

ggml-ci

metal : fix checks for available FA kernels (ggml-org#15700)

7c540fc

* metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]

server : enable /slots by default and make it secure (ggml-org#15630)

47bd99d

* server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints

CANN: Optimize MUL_MAT_ID (ggml-org#15658)

fad54d2

docs : add Hunyuan to models section (ggml-org#15707)

883ad6d

Signed-off-by: Jie Fu <[email protected]>

convert : remove redundant code (ggml-org#15708)

31ee6d0

Signed-off-by: Jie Fu <[email protected]>

ngxson and others added 6 commits September 7, 2025 21:55

server : speed up tests (ggml-org#15836)

8084099

* server : speed up tests * clean up * restore timeout_seconds in some places * flake8 * explicit offline

kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (…

958f133

…ggml-org#15817)

CUDA: faster tile FA (Pascal/AMD), headsize 256 (ggml-org#15769)

6b4a425

Added vulkan build workflow, x86

43d92b5

walidbr requested review from 0cc4m, JohannesGaessler, ggerganov and ngxson as code owners September 7, 2025 21:04

walidbr closed this Sep 7, 2025

walidbr deleted the vulkan_perfetto branch September 7, 2025 21:07

walidbr restored the vulkan_perfetto branch September 7, 2025 21:10

walidbr deleted the vulkan_perfetto branch September 7, 2025 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan perfetto #15859

Vulkan perfetto #15859

Uh oh!

walidbr commented Sep 7, 2025

Uh oh!

0cc4m commented Sep 8, 2025

Uh oh!

walidbr commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

30 participants

Vulkan perfetto #15859

Vulkan perfetto #15859

Uh oh!

Conversation

walidbr commented Sep 7, 2025

Uh oh!

0cc4m commented Sep 8, 2025

Uh oh!

walidbr commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

30 participants