-
Notifications
You must be signed in to change notification settings - Fork 13.5k
make memset range dynamic #13002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make memset range dynamic #13002
Conversation
ggml/src/ggml-backend.cpp
Outdated
| }; | ||
|
|
||
| void ggml_backend_set_size(ggml_backend_buffer_t buffer, size_t cur_size){ | ||
| buffer->size = cur_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will cause memory leak rather than actually saving memory
And worst, if cur_size is bigger than the actually capacity of the buffer, you're now having buffer overflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will cause memory leak rather than actually saving memory
And worst, if
cur_sizeis bigger than the actually capacity of the buffer, you're now having buffer overflow
1, I will carefully review for potential memory leaks.
2, if the buffer is unallocated or the new size exceeds the previous size, it should allocate a new buffer of the new size. With this code:
if (!buf_output || prev_size < new_size) { buf_output.reset(ggml_backend_buft_alloc_buffer(buft, new_size)); }
it shouldn't cause out-of-bounds issues, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good change to ggml-backend. I don't know why the output buffer needs to be cleared, just remove the call to ggml_backend_buffer_clear.
|
Thank you both for taking the time to review this PR and providing valuable feedback! @slaren , your comment about the necessity of the It appears that the memory regions pointed to by Therefore, I tend to agree with your suggestion: removing the My plan is to update this PR to simply remove that line. Does this approach sound reasonable to you, or are there any specific reasons why this buffer needs to be cleared that I might be missing? Thanks again for your guidance! |
Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like ggml-org#12465
* mtmd : add more api around mtmd_image_tokens * mtmd : ability to calc image hash * shared_ptr for mtmd_image_tokens * move hash to user-define ID (fixed) * fix prompt_modified * rm redundant data member
* SYCL: refactor move to a separate file * Fix binbcast * Remove duplicates * fix include formatting * fix typo
* server : use std::move whenever possible * use r-value ref * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * make task creation scoped * restore std::move * fix task_id not set correctly * apply changes from suggestion Co-authored-by: ggerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
This restores the behavior from ggml-org#491. This does not affect Ctrl+D's ability to terminate --multiline-input lines (ggml-org#1040). This also actually implements ggml-org#587: "If the user wants the text to end in a newline, this should be accomplished by explicitly adding a newline by using \ followed by return, then returning control by pressing return again." Fixes ggml-org#12949
…ml-org#13011) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include
* convert : experimental support for `--mmproj` flag * fix bad ctrl+f replace * fix style * split into subclasses TextModel and VisionModel * rename Mode --> ModelBase * small fix * correct CLIP_VISION arch name (because existing GGUF already use it) * Apply suggestions from code review Co-authored-by: compilade <[email protected]> * fix Mistral3Model * fix typo Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>
…li` (ggml-org#13012) * mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli` * support for minicpmv * remove cpp files of llava and minicpmv * update hot topics * mtmd : add not supported msg for qwen2vl * Update examples/llava/mtmd.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
…g#12871) * ggml : add SSE 4.2 variant for CPUs without AVX * ggml : add x64 base ABI variant
* llava : update documentations * fix typo
* metal : add memory pool for temp allocs (wip) [no ci] * cont : free buffers from the heap * cont : resize heap [no ci] * cont : refactor heap [no ci] * cont : heap for each cmd buffer [no ci] * cont : fix free * wip * cont : fix alignment [no ci] * cont : not working .. [no ci] * cont : heap allocation now works [no ci] * cont : use MTLHeapTypePlacement ggml-ci * metal : use dynamic MTLHeap allocations ggml-ci * metal : add comments * metal : disable softmax use of mem_pool ggml-ci * metal : final touches
* security : add note about RPC functionality * security : add note about llama-server
* mtmd : support SmolVLM (version 1 and 2) * correct chat template * fix n_patches * scale_factor is an int * add more models to test
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID * fix logic for RoPE support, CUDA graphs
…13021) * append mult-eos,half-rope,bos to GLM4-0414 * remove unset var
* add pixtral text model (vision is wip) * cgraph ok, just missing 2D RoPE * fix bad rebase * first working version * fix problem with img_break token * support dynamic image size * update docs * update test script
* Sigint rework in mtmd vision example * Applied suggestions on mtmd-cli PR * Forgot to invert one of the conditions * Update examples/llava/mtmd-cli.cpp * Removed redundant exit check --------- Co-authored-by: pl752 <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]>
* clip : fix pixtral on some GPU backends * refactor inp_raw set * rm outdated comment * fix dynamic size * add TODO
* Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit 6efd872. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Richard Lyons <[email protected]>
… conversion APIs (ggml-org#13107) * ggml: dynamic x86_64 feature detection for FP32 <-> FP16/BF16 conversion * move fp converter to ggml-cpu * Switch ggml_compute_forward_get_rows_f16/bf16 to new ggml_cpu_fp16/bf16_to_fp32
* clip : improve projector naming * no more kv has_llava_projector * rm unused kv * rm more unused
* common : add common_remote_get_content * support max size and timeout * add tests
* implment vision model architecture, gguf convertor * handle window attention inputs * add debug utils * fix few incorrect tensor memory layout * move position id remap out of ggml to avoid int32 cuda operations * cleaning up * ignore transformers Qwen2_5_xxx type check * remove not so often use `qwen2vl-cli` debug functions * remove commented-out code blocks * fix attn weight scaling after rebase * add `PROJECTOR_TYPE_QWEN2_5_VL` * remove `KEY_USE_GLU_MLP`, `KEY_USE_RMS_NORM` * replace `KEY_FULLATTN_BLK_IDX` with `KEY_WIN_ATTN_PATTERN` * remove `attn_window_size` from gguf * fix model conversion * clean up * fix merging problem * add test --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
…qwen2vl_merger migration was incomplete (ggml-org#13133)
Signed-off-by: Xiaodong Ye <[email protected]>
* fix wrong template in GLM4-0414 * fix spaces * no bos token since it is already in the template * moved the chatgml4 check to higher priority * restored template for old GLM models * moved the GLM4 template check in the correct place with correct check
* Add --override-tensors option to llama-bench * Correct llama-bench --override-tensors to --override-tensor * llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix. * Make new llama-bench util functions static to fix Ubuntu CI * llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
|
This PR's history became messy after rebasing onto the latest master/main. I've created a new, clean pull request at #13152 to replace this one. Closing this PR in favor of the new one. Thanks! |
When the model processes multiple data batches during inference, if the first data entry has a length of 1000, the output buffer length will be calculated as (1000 × vocab_size / embd_size). After allocating the output buffer space, the system performs a full-range memset operation. However, when subsequent inference data (e.g., the second batch) has a shorter length than 1000, the memset operation still clears the entire 1000-length space, resulting in significant performance overhead.
This PR introduces dynamic length adjustment to update memset parameters based on actual data length, thereby minimizing unnecessary performance waste.