Replies: 1 comment
-
Could be related to #7969 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello all:
I wanted to upgrade my Mistral-7B-Instruct-v0.2 to v.03. So I:
Upgraded to llama-cpp-b3263, (I'm using an A10G GPU in AWS, RHEL9, 550.54.15, CUDA 12.4)
Built with CUDA support and no errors.
Untarred the Mistral-7B-Instruct-v0.3 model
Converted it to F16 GGUF
Quantized it down to Q6_K
No errors or warnings. So far so good.
Ran it with:
% llama-server -n 2000 -ngl 33 -m Mistral-7B-Instruct-v0.3_Q6_K.gguf
And when I connect my client using the OpenAI API, I get lots of repetition. It repeats the system prompt and it's own response several times with subtle variations. Sometimes it seems to get into a loop and never breaks out. --repeat-penalty n seems to have no observable effect.
b3263 runs the older Mistral-7B-Instruct-v.02_Q6_K.gguf seemingly fine. So it appears to be something funny with the new model, but I'm at a loss to narrow it down. Maybe this is the new tokenizer. Maybe the new v0.3 Instruct doesn't like the OpenAI chat template in llama-server. Maybe the new hf-to-gguf or quantizer scrambled the model.
Suggestions, Questions, Commiseration?
Beta Was this translation helpful? Give feedback.
All reactions