-
Notifications
You must be signed in to change notification settings - Fork 13.5k
server: bench: minor fixes #10765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: bench: minor fixes #10765
Conversation
- support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token
- fix when prometheus not started - wait for server to be ready before starting bench
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the hw to test, but LGTM.
Thought, I'm looking forward to migrate to python solution like Locust (as mentioned in the PR description). This can simplify a lot in the installation process, while giving much more flexibility for the script (Ideally, we only need single bench.py script in the future that can do all at once)
| "model": model, | ||
| "stream": true, | ||
| "stream_options": { | ||
| "include_usage": true, // False to be supported in llama.cpp server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here, but in llama.cpp we ignore include_usage and always include include the usage info.
* server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench
* server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench
* server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench
Context
After a nice exchange with @ngxson, this is a minor change to the current server bench framework in order to refresh it a bit. Although the target would be to replace k6/xk6-sse with something like Locust (to be assessed) , python based.
Changes
Tests (phi2 on RTX 3050)
LLAMA_SERVER_BIN_PATH=../../../cmake-build-debug/bin/llama-server python bench.py \ --runner-label local \ --name local \ --branch `git rev-parse --abbrev-ref HEAD` \ --commit `git rev-parse HEAD` \ --scenario script.js \ --duration 5m \ --hf-repo ggml-org/models \ --hf-file phi-2/ggml-model-q4_0.gguf \ --model-path-prefix models \ --parallel 4 \ -ngl 33 \ --batch-size 2048 \ --ubatch-size 256 \ --ctx-size 4096 \ --n-prompts 200 \ --max-prompt-tokens 256 \ --max-tokens 256Results: