make memset range dynamic #13002

pockers21 · 2025-04-18T05:52:14Z

When the model processes multiple data batches during inference, if the first data entry has a length of 1000, the output buffer length will be calculated as (1000 × vocab_size / embd_size). After allocating the output buffer space, the system performs a full-range memset operation. However, when subsequent inference data (e.g., the second batch) has a shorter length than 1000, the memset operation still clears the entire 1000-length space, resulting in significant performance overhead.
This PR introduces dynamic length adjustment to update memset parameters based on actual data length, thereby minimizing unnecessary performance waste.

ngxson · 2025-04-18T08:40:28Z

ggml/src/ggml-backend.cpp

 };

+void ggml_backend_set_size(ggml_backend_buffer_t buffer, size_t cur_size){
+    buffer->size = cur_size;


I think this will cause memory leak rather than actually saving memory

And worst, if cur_size is bigger than the actually capacity of the buffer, you're now having buffer overflow

I think this will cause memory leak rather than actually saving memory

And worst, if cur_size is bigger than the actually capacity of the buffer, you're now having buffer overflow

1, I will carefully review for potential memory leaks.
2, if the buffer is unallocated or the new size exceeds the previous size, it should allocate a new buffer of the new size. With this code:
if (!buf_output || prev_size < new_size) { buf_output.reset(ggml_backend_buft_alloc_buffer(buft, new_size)); }
it shouldn't cause out-of-bounds issues, correct?

slaren

This is not a good change to ggml-backend. I don't know why the output buffer needs to be cleared, just remove the call to ggml_backend_buffer_clear.

pockers21 · 2025-04-22T06:33:26Z

Hi @ngxson and @slaren ,

Thank you both for taking the time to review this PR and providing valuable feedback!

@slaren , your comment about the necessity of the ggml_backend_buffer_clear call itself is a great point. I've taken another look at the code in llama-context.cpp where this buffer is used.

It appears that the memory regions pointed to by logits and embd are indeed intended to be fully populated by the results of the subsequent graph computation (ggml_graph_compute). If that's the case, clearing them beforehand seems redundant and introduces unnecessary overhead, just as my initial PR description aimed to reduce.

Therefore, I tend to agree with your suggestion: removing the ggml_backend_buffer_clear(buf_output.get(), ...). This would simplify the code and directly address the performance concern without needing changes in ggml-backend or introducing the potential risks @ngxson rightly pointed out regarding dynamic size management.

My plan is to update this PR to simply remove that line. Does this approach sound reasonable to you, or are there any specific reasons why this buffer needs to be cleared that I might be missing?

Thanks again for your guidance!

Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like ggml-org#12465

* mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member

* SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo

* server : use std::move whenever possible * use r-value ref * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * make task creation scoped * restore std::move * fix task_id not set correctly * apply changes from suggestion Co-authored-by: ggerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

This restores the behavior from ggml-org#491. This does not affect Ctrl+D's ability to terminate --multiline-input lines (ggml-org#1040). This also actually implements ggml-org#587: "If the user wants the text to end in a newline, this should be accomplished by explicitly adding a newline by using \ followed by return, then returning control by pressing return again." Fixes ggml-org#12949

…ml-org#13011) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include

* convert : experimental support for `--mmproj` flag * fix bad ctrl+f replace * fix style * split into subclasses TextModel and VisionModel * rename Mode --> ModelBase * small fix * correct CLIP_VISION arch name (because existing GGUF already use it) * Apply suggestions from code review Co-authored-by: compilade <[email protected]> * fix Mistral3Model * fix typo Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>

…li` (ggml-org#13012) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggml-ci

…g#12871) * ggml : add SSE 4.2 variant for CPUs without AVX * ggml : add x64 base ABI variant

* llava : update documentations * fix typo

* metal : add memory pool for temp allocs (wip) [no ci] * cont : free buffers from the heap * cont : resize heap [no ci] * cont : refactor heap [no ci] * cont : heap for each cmd buffer [no ci] * cont : fix free * wip * cont : fix alignment [no ci] * cont : not working .. [no ci] * cont : heap allocation now works [no ci] * cont : use MTLHeapTypePlacement ggml-ci * metal : use dynamic MTLHeap allocations ggml-ci * metal : add comments * metal : disable softmax use of mem_pool ggml-ci * metal : final touches

* security : add note about RPC functionality * security : add note about llama-server

* mtmd : support SmolVLM (version 1 and 2) * correct chat template * fix n_patches * scale_factor is an int * add more models to test

* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs

…nd (ggml-org#13060) closes ggml-org#13051

…13021) * append mult-eos,half-rope,bos to GLM4-0414 * remove unset var

* add pixtral text model (vision is wip) * cgraph ok, just missing 2D RoPE * fix bad rebase * first working version * fix problem with img_break token * support dynamic image size * update docs * update test script

* Sigint rework in mtmd vision example * Applied suggestions on mtmd-cli PR * Forgot to invert one of the conditions * Update examples/llava/mtmd-cli.cpp * Removed redundant exit check --------- Co-authored-by: pl752 <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>

* clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO

* Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit 6efd872. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

Co-authored-by: Richard Lyons <[email protected]>

… conversion APIs (ggml-org#13107) * ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion * move fp converter to ggml-cpu * Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32

* clip : improve projector naming * no more kv has_llava_projector * rm unused kv * rm more unused

* common : add common_remote_get_content * support max size and timeout * add tests

* implment vision model architecture, gguf convertor * handle window attention inputs * add debug utils * fix few incorrect tensor memory layout * move position id remap out of ggml to avoid int32 cuda operations * cleaning up * ignore transformers Qwen2_5_xxx type check * remove not so often use `qwen2vl-cli` debug functions * remove commented-out code blocks * fix attn weight scaling after rebase * add `PROJECTOR_TYPE_QWEN2_5_VL` * remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM` * replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN` * remove `attn_window_size` from gguf * fix model conversion * clean up * fix merging problem * add test --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

…qwen2vl_merger migration was incomplete (ggml-org#13133)

Signed-off-by: Xiaodong Ye <[email protected]>

* fix wrong template in GLM4-0414 * fix spaces * no bos token since it is already in the template * moved the chatgml4 check to higher priority * restored template for old GLM models * moved the GLM4 template check in the correct place with correct check

* Add --override-tensors option to llama-bench * Correct llama-bench --override-tensors to --override-tensor * llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix. * Make new llama-bench util functions static to fix Ubuntu CI * llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)

pockers21 · 2025-04-28T10:28:21Z

This PR's history became messy after rebasing onto the latest master/main. I've created a new, clean pull request at #13152 to replace this one. Closing this PR in favor of the new one. Thanks!

make memset range dynamic

b637bc6

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 18, 2025

ngxson reviewed Apr 18, 2025

View reviewed changes

slaren requested changes Apr 18, 2025

View reviewed changes

rgerganov and others added 25 commits April 28, 2025 16:57

mtmd : add methods to access mtmd_image_tokens (ggml-org#12906)

5ffc9a2

* mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member

SYCL: Refactor and enable FP16 in binary broadcast OPs (ggml-org#12975)

c1079ff

* SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo

gguf-py : GGUF Editor GUI - Python + Qt6 (ggml-org#12930)

853b98e

clip : refactor, add image_manipulation and llava_uhd classes (gg…

ad6c775

…ml-org#13011) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include

gguf-py : fix upload python package workflow (ggml-org#13020)

46303f5

Disable CI cross-compile builds (ggml-org#13022)

6eb4bb0

metal: add neg operator (ggml-org#13029)

1f3ca67

vulkan: support noncontiguous rms_norm (ggml-org#13031)

bd281db

llava: fix errors in clip.h on certain compilers (ggml-org#13030)

104cf86

SYCL: Add non-contiguous support in ROPE (ggml-org#12993)

0da1f2c

ggml-ci

ggml : add SSE 4.2 and x64 base variant for CPUs without AVX (ggml-or…

0190b62

…g#12871) * ggml : add SSE 4.2 variant for CPUs without AVX * ggml : add x64 base ABI variant

llava : update documentations (ggml-org#13055)

a034e66

* llava : update documentations * fix typo

security : add note about RPC and server functionality (ggml-org#13061)

2d59f44

* security : add note about RPC functionality * security : add note about llama-server

mtmd : support SmolVLM (version 1 and 2) (ggml-org#13050)

cb04870

* mtmd : support SmolVLM (version 1 and 2) * correct chat template * fix n_patches * scale_factor is an int * add more models to test

CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (ggml-org#13014)

e8558fd

* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs

rpc : add command line option for number of threads for the CPU backe…

7513bbe

…nd (ggml-org#13060) closes ggml-org#13051

convert : Append mult-eos,half-rope,bos to GLM4-0414 and Z (ggml-org#…

5978039

…13021) * append mult-eos,half-rope,bos to GLM4-0414 * remove unset var

mtmd : Support Pixtral 12B (ggml-org#13065)

5027efb

* add pixtral text model (vision is wip) * cgraph ok, just missing 2D RoPE * fix bad rebase * first working version * fix problem with img_break token * support dynamic image size * update docs * update test script

ngxson and others added 14 commits April 28, 2025 16:57

clip : fix pixtral on some GPU backends (ggml-org#13097)

e306c33

* clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO

llama : fix K-shift with quantized K and BLAS backend (ggml-org#13113)

1f7a36f

grammar : handle maxItems == 0 in JSON schema (ggml-org#13117)

dd23581

Co-authored-by: Richard Lyons <[email protected]>

clip : improve projector naming (ggml-org#13118)

cc00eb2

* clip : improve projector naming * no more kv has_llava_projector * rm unused kv * rm more unused

common : add common_remote_get_content (ggml-org#13123)

4f3ba1c

* common : add common_remote_get_content * support max size and timeout * add tests

Fixes Qwen2.5VL segfault during inference with ggml-org#12402 as has_…

d350aed

…qwen2vl_merger migration was incomplete (ggml-org#13133)

musa: fix build warning (ggml-org#13129)

aff089e

Signed-off-by: Xiaodong Ye <[email protected]>

arg : fix unused variable (ggml-org#13142)

558ab0a

delete set size and delete buffer clear as suggested

1ba4571

pockers21 requested a review from JohannesGaessler as a code owner April 28, 2025 08:57

pockers21 mentioned this pull request Apr 28, 2025

delete buffer clear as suggested #13152

Merged

pockers21 closed this Apr 28, 2025

pockers21 deleted the feature-dynamic-memset-range branch October 15, 2025 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make memset range dynamic #13002

make memset range dynamic #13002

Uh oh!

pockers21 commented Apr 18, 2025

Uh oh!

ngxson Apr 18, 2025

Uh oh!

pockers21 Apr 18, 2025 •

edited

Loading

Uh oh!

slaren left a comment

Uh oh!

pockers21 commented Apr 22, 2025

Uh oh!

pockers21 commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

26 participants

make memset range dynamic #13002

make memset range dynamic #13002

Uh oh!

Conversation

pockers21 commented Apr 18, 2025

Uh oh!

ngxson Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

pockers21 Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

pockers21 commented Apr 22, 2025

Uh oh!

pockers21 commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

26 participants

pockers21 Apr 18, 2025 •

edited

Loading