- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
sampling: Integrate Top-nσ into main sampling chain (and add it to the server) #13264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
    Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    
              
                    CISC
  
              
              reviewed
              
                  
                    May 5, 2025 
                  
              
              
            
            
              
                    CISC
  
              
              approved these changes
              
                  
                    May 5, 2025 
                  
              
              
            
            
      
        
      
      
  
    1 task
  
    
  gabe-l-hart 
      added a commit
        to gabe-l-hart/llama.cpp
      that referenced
      this pull request
    
      May 6, 2025 
    
    
      
  
    
      
    
  
* origin/master: (27 commits) llama : fix build_ffn without gate (ggml-org#13336) CUDA: fix bad asserts for partial offload (ggml-org#13337) convert : qwen2/3moe : set yarn metadata if present (ggml-org#13331) CUDA: fix --split-mode row for MMQ (ggml-org#13323) gguf-py : avoid requiring pyside6 for other scripts (ggml-org#13036) CUDA: fix logic for clearing padding with -ngl 0 (ggml-org#13320) sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (ggml-org#13264) server : Webui - change setText command from parent window to also send the message. (ggml-org#13309) mtmd : rename llava directory to mtmd (ggml-org#13311) clip : fix confused naming ffn_up and ffn_down (ggml-org#13290) convert : bailingmoe : set yarn metadata if present (ggml-org#13312) SYCL: Disable mul_mat kernels for noncontiguous tensor b (ggml-org#13308) mtmd : add C public API (ggml-org#13184) rpc : use backend registry, support dl backends (ggml-org#13304) ggml : activate s390x simd for Q3_K (ggml-org#13301) llava/mtmd : fixes to fully support dl backends (ggml-org#13303) llama : build windows releases with dl backends (ggml-org#13220) CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299) CUDA: fix race condition in MMQ ids_dst (ggml-org#13294) vulkan: Additional type support for unary, binary, and copy (ggml-org#13266) ...
| 
           Please update documentation (server README).  | 
    
      
        
      
      
  
    4 tasks
  
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Top-nσ support was added in #11223, where it was implemented as a special case that ignored samplers other than
top_kandtemperaturewhentop_n_sigmawas present.Following #11896 (comment), this PR integrates this sampler into the main sampling chain. This removes the special case handling and makes it possible to combine
top_n_sigmawith other sampling methods likemin_p.I have used #11896 as a starting point, so this PR also makes
top_n_sigmaavailable in llama-server.Verification
I have tested it with llama-server and it seems to work. Below are the top probabilities after
My name iswithtop_n_sigma=1(left) andtop_n_sigma=5(right).