Skip to content

Conversation

@pockers21
Copy link
Contributor

When the model processes multiple data batches during inference, if the first data entry has a length of 1000, the output buffer length will be calculated as (1000 × vocab_size / embd_size). After allocating the output buffer space, the system performs a full-range memset operation. However, when subsequent inference data (e.g., the second batch) has a shorter length than 1000, the memset operation still clears the entire 1000-length space, resulting in significant performance overhead.
This PR introduces dynamic length adjustment to update memset parameters based on actual data length, thereby minimizing unnecessary performance waste.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 18, 2025
};

void ggml_backend_set_size(ggml_backend_buffer_t buffer, size_t cur_size){
buffer->size = cur_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause memory leak rather than actually saving memory

And worst, if cur_size is bigger than the actually capacity of the buffer, you're now having buffer overflow

Copy link
Contributor Author

@pockers21 pockers21 Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause memory leak rather than actually saving memory

And worst, if cur_size is bigger than the actually capacity of the buffer, you're now having buffer overflow

1, I will carefully review for potential memory leaks.
2, if the buffer is unallocated or the new size exceeds the previous size, it should allocate a new buffer of the new size. With this code:
if (!buf_output || prev_size < new_size) { buf_output.reset(ggml_backend_buft_alloc_buffer(buft, new_size)); }
it shouldn't cause out-of-bounds issues, correct?

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good change to ggml-backend. I don't know why the output buffer needs to be cleared, just remove the call to ggml_backend_buffer_clear.

@pockers21
Copy link
Contributor Author

Hi @ngxson and @slaren ,

Thank you both for taking the time to review this PR and providing valuable feedback!

@slaren , your comment about the necessity of the ggml_backend_buffer_clear call itself is a great point. I've taken another look at the code in llama-context.cpp where this buffer is used.

It appears that the memory regions pointed to by logits and embd are indeed intended to be fully populated by the results of the subsequent graph computation (ggml_graph_compute). If that's the case, clearing them beforehand seems redundant and introduces unnecessary overhead, just as my initial PR description aimed to reduce.

Therefore, I tend to agree with your suggestion: removing the ggml_backend_buffer_clear(buf_output.get(), ...). This would simplify the code and directly address the performance concern without needing changes in ggml-backend or introducing the potential risks @ngxson rightly pointed out regarding dynamic size management.

My plan is to update this PR to simply remove that line. Does this approach sound reasonable to you, or are there any specific reasons why this buffer needs to be cleared that I might be missing?

Thanks again for your guidance!

rgerganov and others added 25 commits April 28, 2025 16:57
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like ggml-org#12465
* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* fix prompt_modified

* rm redundant data member
* SYCL: refactor move to a separate file

* Fix binbcast

* Remove duplicates

* fix include formatting

* fix typo
* server : use std::move whenever possible

* use r-value ref

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* make task creation scoped

* restore std::move

* fix task_id not set correctly

* apply changes from suggestion

Co-authored-by: ggerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
This restores the behavior from ggml-org#491. This does not affect Ctrl+D's ability to
terminate --multiline-input lines (ggml-org#1040).

This also actually implements ggml-org#587: "If the user wants the text to end in a
newline, this should be accomplished by explicitly adding a newline by using
\ followed by return, then returning control by pressing return again."

Fixes ggml-org#12949
…ml-org#13011)

* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include
* convert : experimental support for `--mmproj` flag

* fix bad ctrl+f replace

* fix style

* split into subclasses TextModel and VisionModel

* rename Mode --> ModelBase

* small fix

* correct CLIP_VISION arch name (because existing GGUF already use it)

* Apply suggestions from code review

Co-authored-by: compilade <[email protected]>

* fix Mistral3Model

* fix typo

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: compilade <[email protected]>
…li` (ggml-org#13012)

* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
…g#12871)

* ggml : add SSE 4.2 variant for CPUs without AVX

* ggml : add x64 base ABI variant
* llava : update documentations

* fix typo
* metal : add memory pool for temp allocs (wip) [no ci]

* cont : free buffers from the heap

* cont : resize heap [no ci]

* cont : refactor heap [no ci]

* cont : heap for each cmd buffer [no ci]

* cont : fix free

* wip

* cont : fix alignment [no ci]

* cont : not working .. [no ci]

* cont : heap allocation now works [no ci]

* cont : use MTLHeapTypePlacement

ggml-ci

* metal : use dynamic MTLHeap allocations

ggml-ci

* metal : add comments

* metal : disable softmax use of mem_pool

ggml-ci

* metal : final touches
* security : add note about RPC functionality

* security : add note about llama-server
* mtmd : support SmolVLM (version 1 and 2)

* correct chat template

* fix n_patches

* scale_factor is an int

* add more models to test
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID

* fix logic for RoPE support, CUDA graphs
…13021)

* append mult-eos,half-rope,bos to GLM4-0414

* remove unset var
* add pixtral text model (vision is wip)

* cgraph ok, just missing 2D RoPE

* fix bad rebase

* first working version

* fix problem with img_break token

* support dynamic image size

* update docs

* update test script
* Sigint rework in mtmd vision example

* Applied suggestions on mtmd-cli PR

* Forgot to invert one of the conditions

* Update examples/llava/mtmd-cli.cpp

* Removed redundant exit check

---------

Co-authored-by: pl752 <[email protected]>
Co-authored-by: Xuan-Son Nguyen <[email protected]>
ngxson and others added 14 commits April 28, 2025 16:57
* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO
* Force FP32 compute in cuBLAS GEMM

* Revert "Force FP32 compute in cuBLAS GEMM"

This reverts commit 6efd872.

* Force F32 compute in GLM4 ffn down

* Edit comment to clarify issue

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
… conversion APIs (ggml-org#13107)

* ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion

* move fp converter to ggml-cpu

* Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32
* clip : improve projector naming

* no more kv has_llava_projector

* rm unused kv

* rm more unused
* common : add common_remote_get_content

* support max size and timeout

* add tests
* implment vision model architecture, gguf convertor

* handle window attention inputs

* add debug utils

* fix few incorrect tensor memory layout

* move position id remap out of ggml to avoid int32 cuda operations

* cleaning up

* ignore transformers Qwen2_5_xxx type check

* remove not so often use `qwen2vl-cli` debug functions

* remove commented-out code blocks

* fix attn weight scaling after rebase

* add `PROJECTOR_TYPE_QWEN2_5_VL`

* remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM`

* replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN`

* remove `attn_window_size` from gguf

* fix model conversion

* clean up

* fix merging problem

* add test

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
* fix wrong template in GLM4-0414

* fix spaces

* no bos token since it is already in the template

* moved the chatgml4 check to higher priority

* restored template for old GLM models

* moved the GLM4 template check in the correct place with correct check
* Add --override-tensors option to llama-bench

* Correct llama-bench --override-tensors to --override-tensor

* llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix.

* Make new llama-bench util functions static to fix Ubuntu CI

* llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
@pockers21
Copy link
Contributor Author

This PR's history became messy after rebasing onto the latest master/main. I've created a new, clean pull request at #13152 to replace this one. Closing this PR in favor of the new one. Thanks!

@pockers21 pockers21 closed this Apr 28, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation script Script related testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes devops improvements to build systems and github actions server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 28, 2025
@pockers21 pockers21 deleted the feature-dynamic-memset-range branch October 15, 2025 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) devops improvements to build systems and github actions documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes script Script related server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.