-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Vulkan perfetto #15859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Vulkan perfetto #15859
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This uses perfetto in process profiling, and will produce a perfetto binary by the end of the inference. This is very useful to help visualise how the handles the inference. Build: cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGGML_VULKAN=ON cmake --build build-vk -j8 Run: GGML_VK_PERF_SILENT=1 GGML_VK_PERF_LOGGER=1 LLAMA_PERFETTO_TRACE=./out.pftrace build-vk/bin/llama-cli -m model.gguf Test: Tested on M4 Mac In detail this patch does the following: 1. Including the `LlamaPerfetto.h` header file, which contains the definitions for the Perfetto-related functions and variables used in this example. 2. Calling the `llama_perfetto_start()` function to start tracing at the beginning of the conversation. 3. Calling the `llama_perfetto_stop_flush()` function to stop tracing and flush the trace after each generation. 4. Adding a call to the `llama_perfetto_trace_begin_with_text()` function to begin an event in Perfetto with a text description of the current evaluation. 5. Adding a call to the `llama_perfetto_trace_end()` function to end the event after each evaluation. 6. Adding a call to the `llama_perfetto_counter_tokens_per_s()` function to update the Perfetto counter for tokens per second during idle periods. 7. Calling the `llama_perfetto_emit_gpu_timeline()` function to emit GPU timeline slices into Perfetto. 8. Adding a call to the `llama_perfetto_print_gpu_stats()` function to print GPU statistics at idle periods. 9. Calling the `llama_perfetto_flush_dump_stats()` function to flush and dump the Perfetto trace stats to a file at idle periods.
This uses perfetto in process profiling, and will produce a perfetto binary by the end of the inference. This is very useful to help visualise how the handles the inference. Build: cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGGML_VULKAN=ON cmake --build build-vk -j8 Run: GGML_VK_PERF_SILENT=1 GGML_VK_PERF_LOGGER=1 LLAMA_PERFETTO_TRACE=./out.pftrace build-vk/bin/llama-cli -m model.gguf Test: Tested on M4 Mac In detail this patch does the following: 1. Including the `LlamaPerfetto.h` header file, which contains the definitions for the Perfetto-related functions and variables used in this example. 2. Calling the `llama_perfetto_start()` function to start tracing at the beginning of the conversation. 3. Calling the `llama_perfetto_stop_flush()` function to stop tracing and flush the trace after each generation. 4. Adding a call to the `llama_perfetto_trace_begin_with_text()` function to begin an event in Perfetto with a text description of the current evaluation. 5. Adding a call to the `llama_perfetto_trace_end()` function to end the event after each evaluation. 6. Adding a call to the `llama_perfetto_counter_tokens_per_s()` function to update the Perfetto counter for tokens per second during idle periods. 7. Calling the `llama_perfetto_emit_gpu_timeline()` function to emit GPU timeline slices into Perfetto. 8. Adding a call to the `llama_perfetto_print_gpu_stats()` function to print GPU statistics at idle periods. 9. Calling the `llama_perfetto_flush_dump_stats()` function to flush and dump the Perfetto trace stats to a file at idle periods.
Signed-off-by: noemotiovon <[email protected]>
* llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault
Removed information about MSVC compiler limitations for arm64 builds.
…#15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16
…#15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK
…#15679) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.
* vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent
* metal : fix checks for available FA kernels ggml-ci * cont : fix comment [no ci]
* server : enable /slots by default and make it secure ggml-ci * server : fix tests to pass `--no-slots` when necessary * server : extend /props with info about enabled endpoints
* sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
* CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment
…ml-org#15690) * CUDA: fix build error from ambiguous __half conversions in conv2d Building conv2d with half precision failed because `__half` defines multiple implicit conversion operators (to float, int, short, etc.), causing ambiguous overload resolution when multiplying with float. Introduce a templated `to_float` helper that explicitly converts `__half` via `__half2float`, while passing through float unchanged. Use this helper in conv2d accumulation to ensure unambiguous and correct promotion to float. Fixes some build errors with half-precision kernels on CUDA. ggml-ci * CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion * CUDA: Add missing convert.cuh header * CUDA: remove unnecessary extension in ggml_cuda_cast * CUDA: Address review comment, remove second type template argument
Signed-off-by: Jie Fu <[email protected]>
) * ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops This commit adds support for the TRANSPOSE and RESHAPE operations in the ggml webgpu backend. Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
…gml-org#14903) * vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants * vulkan: use subgroup operations for quantize_q8_1 shader * vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader * vulkan: use q8_1_x4 blocks in mul_mmq shader * vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec * vulkan: tune mul_mat_vecq performance for Intel * vulkan: fix quantizing issue when tensor is not divisible by 128 * vulkan: adapt integer dot mmv to mmv small m optimization (ggml-org#15355) * vulkan: allow all subgroup modes for mmv and mmvq * vulkan: use prealloc intermediate reuse for mmvq path * vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090 * vulkan: adapt mmv quantize_y path to conditional sync logic * vulkan: disable q8_0 mmvq on Nvidia * vulkan: enable q8_0 on Nvidia pre-turing * fix prealloc sync condition * fix llvmpipe subgroup 8 issue
Signed-off-by: Jie Fu <[email protected]>
…rg#15115) * Added sve implementation for vec_dot_fp16 Kernel * removed white spaces * Added comment * removed white spaces * changed GGML_F16x_VEC_FMA for code consistency * Update vec.h --------- Co-authored-by: vithulep <[email protected]>
* SVE support for exponential functions Add const notation to variable pg * Update ggml/src/ggml-cpu/vec.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Add const --------- Co-authored-by: Georgi Gerganov <[email protected]>
…gml-org#15827) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
* server : speed up tests * clean up * restore timeout_seconds in some places * flake8 * explicit offline
This uses perfetto in process profiling, and will produce a perfetto binary by the end of the inference. This is very useful to help visualise how the handles the inference. Build: cmake -S . -B build-vk -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGGML_VULKAN=ON cmake --build build-vk -j8 Run: GGML_VK_PERF_SILENT=1 GGML_VK_PERF_LOGGER=1 LLAMA_PERFETTO_TRACE=./out.pftrace build-vk/bin/llama-cli -m model.gguf Test: Tested on M4 Mac In detail this patch does the following: 1. Including the `LlamaPerfetto.h` header file, which contains the definitions for the Perfetto-related functions and variables used in this example. 2. Calling the `llama_perfetto_start()` function to start tracing at the beginning of the conversation. 3. Calling the `llama_perfetto_stop_flush()` function to stop tracing and flush the trace after each generation. 4. Adding a call to the `llama_perfetto_trace_begin_with_text()` function to begin an event in Perfetto with a text description of the current evaluation. 5. Adding a call to the `llama_perfetto_trace_end()` function to end the event after each evaluation. 6. Adding a call to the `llama_perfetto_counter_tokens_per_s()` function to update the Perfetto counter for tokens per second during idle periods. 7. Calling the `llama_perfetto_emit_gpu_timeline()` function to emit GPU timeline slices into Perfetto. 8. Adding a call to the `llama_perfetto_print_gpu_stats()` function to print GPU statistics at idle periods. 9. Calling the `llama_perfetto_flush_dump_stats()` function to flush and dump the Perfetto trace stats to a file at idle periods.
|
What are you doing? Why open and close immediately, twice? |
|
I'm sorry this was not intended. Please ignore and delete this patch. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Apple Metal
https://en.wikipedia.org/wiki/Metal_(API)
Ascend NPU
issues specific to Ascend NPUs
devops
improvements to build systems and github actions
documentation
Improvements or additions to documentation
examples
ggml
changes relating to the ggml tensor library for machine learning
Nvidia GPU
Issues specific to Nvidia GPUs
OpenCL
Issues specific to the OpenCL backend
python
python script changes
script
Script related
server
SYCL
https://en.wikipedia.org/wiki/SYCL - GPU programming language
testing
Everything test related
Vulkan
Issues specific to the Vulkan backend
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Make sure to read the contributing guidelines before submitting a PR