You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi.
I've noticed a big degradation on inference speed when using the webserver vs "just using the library".
Whenever I run my script, I get around 80 tokens/sec. When using the server I get 40 tokens/sec for the exact same prompt, samplers, llm config.
OS: Linux
GPU: Nvidia RTX 3090
library versions: I discovered this on version 0.2.75 to the latest. I used CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_FORCE_MMQ=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]==<version> when installing, ensuring I wasn't using pip cache. I also tried it with the prebuilt wheels.
I using the same prompt every time, at temperature 0. And making repeated identical calls (in case the problem was with the cache).
POST http://127.0.0.1:8080/v1/completions
{
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\r\n\r\nYou are a helpful AI assistant that rewrites every input as if it were written bu a pirate. You a lenghtly and thorough, rewriting every bit of the text.<|eot_id|><|start_header_id|>user<|end_header_id|>\r\n\r\n<<<---MY VERY BIG TEXT--->>><|eot_id|><|start_header_id|>assistant<|end_header_id|>\r\n\r\n",
"max_tokens": 2048,
"temperature": 0,
"seed": 123,
"stream": false
}
My script:
import llama_cpp
PROMPT = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant that rewrites every input as if it were written bu a pirate. You a lenghtly and thorough, rewriting every bit of the text.<|eot_id|><|start_header_id|>user<|end_header_id|>
<<<---MY VERY BIG TEXT--->>><|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
new_param = {
"model_path": "../MODELS/Meta-Llama-3-8B-Instruct-Q8_0.gguf",
"n_ctx": 8192,
"n_threads": 12,
"n_threads_batch": 12,
"n_batch": 512,
"use_mmap": True,
"use_mlock": False,
"mul_mat_q": True,
"numa": False,
"n_gpu_layers": 33,
"rope_freq_base": 500000,
"tensor_split": None,
"rope_freq_scale": 1.0,
"offload_kqv": True,
"split_mode": 1,
"flash_attn": True,
"cache": True,
}
llm = llama_cpp.Llama(**new_param)
cache = llama_cpp.LlamaRAMCache(capacity_bytes=2 << 30)
llm.set_cache(cache)
copy_of_server_default_sampler = {
'suffix': None,
'max_tokens': 2048,
'temperature': 0.0,
'top_p': 0.95,
'min_p': 0.05,
'echo': False,
'stop': None,
'stream': False,
'logprobs': None,
'presence_penalty': 0.0,
'frequency_penalty': 0.0,
'logit_bias': None,
'seed': 123,
'model': None,
'top_k': 40,
'repeat_penalty': 1.1,
'mirostat_mode': 0,
'mirostat_tau': 5.0,
'mirostat_eta': 0.1,
'grammar': None,
}
def make_completion():
output = llm(
PROMPT,
**copy_of_server_default_sampler,
)
print(output)
Extra 1
I also tried to time the call to this function: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L742
(I did it by editing the filed directly in my venv. And I used time.time(). I know it's unreliable but the results were consistent.)
When running the script each call took around 0.00077 seconds. While the with the server it took 0.00085 seconds. While worst, it's not double the time 🤷
Extra 2
Using nvtop, I also noticed that the script would be way more intense on my GPU, taking it to around 90% while the server would make it go to 50% usage.
With all of the above said... is this lack of performance cause by some misconfiguration on my part? If so, how can I improve it?
Or is is something inherent to the implementation of the server? (I could see the multithreading/multuser implementation having something to do with this)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi.
I've noticed a big degradation on inference speed when using the webserver vs "just using the library".
Whenever I run my script, I get around 80 tokens/sec. When using the server I get 40 tokens/sec for the exact same prompt, samplers, llm config.
OS: Linux
GPU: Nvidia RTX 3090
library versions: I discovered this on version 0.2.75 to the latest. I used
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_FORCE_MMQ=on" FORCE_CMAKE=1 pip install llama-cpp-python[server]==<version>when installing, ensuring I wasn't using pip cache. I also tried it with the prebuilt wheels.I using the same prompt every time, at temperature 0. And making repeated identical calls (in case the problem was with the cache).
When using the server:
Server config:
Request:
My script:
Extra 1
I also tried to time the call to this function: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L742
(I did it by editing the filed directly in my venv. And I used time.time(). I know it's unreliable but the results were consistent.)
When running the script each call took around 0.00077 seconds. While the with the server it took 0.00085 seconds. While worst, it's not double the time 🤷
Extra 2
Using nvtop, I also noticed that the script would be way more intense on my GPU, taking it to around 90% while the server would make it go to 50% usage.
With all of the above said... is this lack of performance cause by some misconfiguration on my part? If so, how can I improve it?
Or is is something inherent to the implementation of the server? (I could see the multithreading/multuser implementation having something to do with this)
Thanks in advance 🙏
Beta Was this translation helpful? Give feedback.
All reactions