forked from ggml-org/llama.cpp
    
        
        - 
                Notifications
    
You must be signed in to change notification settings  - Fork 0
 
Cherry pick 12447 #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
    
                
     Merged
            
            
          Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <[email protected]>
It's useful to be able to have this from the library layer as it's a key parameter of the model (e.g. to figure out how much KV cache memory is needed).
* Add example docs for granite vision Signed-off-by: Alex-Brooks <[email protected]>
* vulkan: implement GGML_OP_ROPE_BACK * vulkan: implement GGML_OP_RMS_NORM_BACK * vulkan: implement GGML_OP_SILU_BACK * vulkan: implement GGML_OP_SOFTMAX_BACK
* ggml-cpu: Fix build with sve Signed-off-by: Molly Sophia <[email protected]> * ggml-cpu: Remove unused variable in sve q3_k vec dot Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>
Co-authored-by: Judd <[email protected]>
Looks like a copy/paste bug from qx_needs_dequant.
Signed-off-by: kerthcet <[email protected]>
Currently self.byte_order is never used. Actually use it to byteswap read data to allow reading big endian files on little endian systems and vice versa. Now it's possible to convert little-endian model into a big-endian model and back on a little-endian system.
* Refactor gguf scripts to improve metadata handling Added contents method to ReaderField class Added endianess property to GGUFReader class * update scripts * fix import * remove unused import * attempt to work around flake and pyright errors * second attempt * give up, ignore type * bump version * apply newbyteorder fixes
* add struct for FFI bindgen * Apply suggestions from code review --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>
* Fix dependencies between ggml and backends ggml backends link only to ggml-base and ggml links to all backends. * Fix installation of ggml backends Set up GNUInstallDirs before setting the installation directory of ggml backends
* vulkan: improve im2col performance
* faster dequant for old quants * dont use unpack for iq4_nl * vec2 unpack for q8
Remove unused header file that causes compilation failure on ARM platform with GCC 13.
…rg#12064) * Added SVE Support for Q2_K Quantized Models * Use 4-space indentation in the switch cases * removed comments lines * Remove the loop Retain the curly bracess for better understanding of code * Remove the comment like added for q3_k_q8_k kernel --------- Co-authored-by: vithulep <[email protected]>
…ns (ggml-org#11595) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants
Signed-off-by: Alex-Brooks <[email protected]>
…2108) * Added Phi-4-mini-instruct support * Update regex per ngxson * Change the vocab base to Xenova/gpt-4o * fix conversion update script * no need to check longrope * minor style fix * fix python style --------- Co-authored-by: Nicholas Sparks <[email protected]>
* Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <[email protected]>
* convert : fix Norway problem when parsing YAML * Update gguf-py/gguf/metadata.py * add newline at correct place
* fix typos and improve menu text clarity * rename variable trimedValue to trimmedValue * add updated index.html.gz * rebuild --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".
…versation mode (ggml-org#12131) * Add --system-prompt parameter * use user defined system prompt * clarify Co-authored-by: Xuan-Son Nguyen <[email protected]> * add warning * clarify Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>
…) (ggml-org#12132) * Update outdated message * wording Co-authored-by: Xuan-Son Nguyen <[email protected]> --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>
* Use jinja chat template system prompt by default * faster conditional order * remove nested ternary --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
* vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <[email protected]>
It's already found by FindVulkan.cmake in the parent CMakeLists
* Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA
* ggml: Add op l2_norm Signed-off-by: Molly Sophia <[email protected]> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <[email protected]> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <[email protected]> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <[email protected]> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <[email protected]> * Apply code-format changes Signed-off-by: Molly Sophia <[email protected]> * fix MUSA build Signed-off-by: Molly Sophia <[email protected]> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]>
…ion and driver issues (ggml-org#12434)
…g#12447) * context : always use non-causal attention for encoder graphs ggml-ci * context : move the change to llama_context::encode() ggml-ci
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      Labels
      
    android
  
    Apple Metal
  
    build
  
    devops
  
    documentation
  Improvements or additions to documentation 
  
    examples
  
    ggml
  
    Nvidia GPU
  
    python
  
    script
  
    server
  
    SYCL
  
    testing
  
    Vulkan
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
add: rm other backend CI.