How to use serve-model with llama.cpp and cuda? #325

ebudmada · 2024-02-17T19:03:34Z

ebudmada
Feb 17, 2024

Hello, when trying to use the model in serve-model, the model always gets reloaded in cpu instead using the model loaded in gpu.

i am able to load my model in local using all 3 gpu with code:
query_string = '''
"Q: In one word, What is the captital of {country}? \n"
"A: [CAPITAL] \n"
"Q: What is the main sight in {CAPITAL}? \n"
"A: [ANSWER]" where (len(TOKENS(CAPITAL)) < 10)
and (len(TOKENS(ANSWER)) < 200) and STOPS_AT(CAPITAL, '\n')
and STOPS_AT(ANSWER, '\n')
'''
print(lmql.run_sync(query_string,
country="united kingdom",
model = lmql.model("local:llama.cpp:/home/ebudmada/llama.cpp/phind-codellama-34b-v2.Q8_0.gguf",
cuda=True,
n_ctx=512,
n_gpu_layers=-1,
tokenizer = 'Phind/Phind-CodeLlama-34B-v2')).variables)

but i am unable to use the serve-model instance: loading using terminal: lmql serve-model llama.cpp:/home/ebudmada/llama.cpp/phind-codellama-34b-v2.Q8_0.gguf --cuda --n_ctx=512 n_gpu_layer=-1 --trust_remote_code True
[Serving LMTP endpoint on ws://localhost:8080/]
I see the gpu ram being used a little (276mb each 3x)
but when i try to use the model with my python script, it is always loading a new model in cpu. : i try a lot of things, but essentially using:
print(lmql.run_sync(query_string,
country="united kingdom",
model = lmql.model("llama.cpp:/home/ebudmada/llama.cpp/phind-codellama-34b-v2.Q8_0.gguf",
cuda=True,
n_ctx=512,
n_gpu_layers=-1,
endpoint="localhost:8080",
tokenizer = 'Phind/Phind-CodeLlama-34B-v2')).variables)

can anyone help me? Thank you

sashokbg · 2024-03-22T09:05:47Z

sashokbg
Mar 22, 2024

This sounds like a llama.cpp problem, not lmql related.

Make sure you install llama cpp with the following command:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install ninja llama-cpp-python --force-reinstall --no-cache-dir

0 replies

ebudmada · 2024-03-23T16:37:08Z

ebudmada
Mar 23, 2024
Author

Yes turn out it is a llama-cpp-python problem. Because it is working with llama.cpp. Maybe they will fix in future release, they are always far behind llama.cpp

1 reply

sashokbg Mar 24, 2024

Hello @ebudmada I am actually running LMQL with llama cpp python on my NVIDIA locally with no issues.

I run my server like so:

lmql serve-model llama.cpp:/home/alexander/Games2/lmql/models/mistral-7b-instruct-v0.2.Q5_K_S.gguf \
  --cuda \
  --port 9999 \
  --n_ctx 4096 \
  --n_gpu_layers 35

and I installed llama-cpp-python using the command I sent in my previous post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use serve-model with llama.cpp and cuda? #325

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use serve-model with llama.cpp and cuda? #325

Uh oh!

ebudmada Feb 17, 2024

Replies: 2 comments · 1 reply

Uh oh!

sashokbg Mar 22, 2024

Uh oh!

ebudmada Mar 23, 2024 Author

Uh oh!

sashokbg Mar 24, 2024

ebudmada
Feb 17, 2024

Replies: 2 comments 1 reply

sashokbg
Mar 22, 2024

ebudmada
Mar 23, 2024
Author