Skip to content

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 20, 2025

@ngxson ngxson requested a review from ggerganov January 20, 2025 13:14
@github-actions github-actions bot added the python python script changes label Jan 20, 2025
@ngxson ngxson merged commit ec7f3ac into ggml-org:master Jan 20, 2025
48 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented Jan 20, 2025

cc @bartowski1182, you can now make GGUF quants :D

@prusnak
Copy link
Contributor

prusnak commented Jan 20, 2025

Are similar changes needed to support DeepSeek-R1-Distill-Llama-* or no change is needed?

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 20, 2025

@prusnak I don't have time to try, but there are already many GGUFs for that model on the HF hub. Can you try?

@prusnak
Copy link
Contributor

prusnak commented Jan 20, 2025

@prusnak I don't have time to try, but there are already many GGUFs for that model on the HF hub. Can you try?

I just tried DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf from https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF and it works on current master. 👍

Animaxx added a commit to Animaxx/llama.cpp that referenced this pull request Jan 20, 2025
@wakamex
Copy link

wakamex commented Jan 20, 2025

my llama-server hangs with @bartowski1182's distill, while llama-cli works fine
false alarm, I had a broken build

./build/bin/llama-server -m DeepSeek-R1-Distill-Qwen-14B-Q6_K_L.gguf --port 8083 -v
...
srv  add_waiting_: add task 2 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 2/1, front = 0
que    start_loop: processing task, id = 2
slot get_availabl: id  0 | task 0 | selected slot by lru, t_last = 338721759369
slot        reset: id  0 | task 0 | 
slot launch_slot_: id  0 | task 2 | launching slot : {"id":0,"id_task":2,"n_ctx":4096,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":4096,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"t_max_predict_ms":-1,"n_indent":0,"response_fields":[],"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":5,"speculative.p_min":0.8999999761581421,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<|begin▁of▁sentence|>You are a helpful assistant.\n\n<|User|>hi<|Assistant|>","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
slot launch_slot_: id  0 | task 2 | processing task
srv  cancel_tasks: cancel task, id_task = 2
srv  remove_waiti: remove task 2 from waiting list. current waiting = 1 (before remove)
que          post: new task, id = 3/1, front = 1
request: POST /v1/chat/completions 127.0.0.1 200
request:  {"messages":[{"role":"system","content":"You are a helpful assistant."},{"id":1737405941029,"role":"user","content":"hi"}],"stream":true,"cache_prompt":true,"samplers":"edkypmxt","temperature":0.8,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"typical_p":1,"xtc_probability":0,"xtc_threshold":0.1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"max_tokens":-1,"timings_per_token":false}
response: 
srv  remove_waiti: remove task 2 from waiting list. current waiting = 0 (before remove)
que    start_loop: processing task, id = 3
slot      release: id  0 | task 2 | stop processing: n_past = 0, truncated = 0

@bartowski1182
Copy link
Contributor

I saw a similar (though reversed) issue with lmstudio, where the model sends one response and then crashes in the chat, but the server works fine 🤔

anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025
* llama : add support for Deepseek-R1-Qwen distill model

* coding style
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
* llama : add support for Deepseek-R1-Qwen distill model

* coding style
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* llama : add support for Deepseek-R1-Qwen distill model

* coding style
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* llama : add support for Deepseek-R1-Qwen distill model

* coding style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants