Replies: 2 comments
-
|
With CPU is expected to have a slower inference time, but you can enable parallel requests by setting the environment variable appropriately: Line 72 in 39a6b56 However that would probably not work very well with CPUs. |
Beta Was this translation helpful? Give feedback.
-
|
Hello @mudler, sorry to bring this up on an old thread, but it is not clear to me whether Thanks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have deploy the latest localAI container (v2.7.0) on a 2 x Xeon 2680V4 (56 threads in total) with 198 GB of RAM, but from what I can tell, the request hitting the /v1/chat/completions endpoint, are being processed one after the other, not in parallel.
I believe there are enough resources on my system to process these requests in parallel.
I am using the openhermes-2.5-mistral-7b.Q8_0.gguf model.
To get a response for the curl example below, it takes about 20+ seconds. Is this good or bad for the system I have?
Thanks
PS: I can't install a GPU on this system as it is a 1U unit.
Curl:
Docker-compose:
LocalAI logs:
Beta Was this translation helpful? Give feedback.
All reactions