当前速度和性能的瓶颈在哪里 #1142
Unanswered
JennieGao-njust
asked this question in
Q&A
当前速度和性能的瓶颈在哪里
#1142
Replies: 1 comment
-
假定软件部分保持不变: 另外,内存、显存容量影响可加载的模型类型,越大的模型需要的内存和显存越大(使用KT以外的方案吃的全部是显存) 关于多GPU,请参考以下教程: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
机器配置:500GB内存 500GB硬盘 模型是UD-Q2_K_XL
cpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 52 bits physical, 57 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Gold 6462C
Stepping: 8
内核版本:
Linux Richco12 5.4.0-204-generic #224 SMP Thu Dec 5 13:38:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
内存占用:
Image
显卡机器L20 以及占用:
Image
启动命令python ktransformers/server/main.py
--port 10002
--model_path /mnt/work/models/deepseek-ai/DeepSeek-V3-0324-GGUF/DeepSeek-V3
--gguf_path /mnt/work/models/deepseek-ai/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--max_new_tokens 1024
--cache_lens 131072
--chunk_size 128
--max_batch_size 4
--backend_type balance_serve
--cpu_infer 32
单并发decode 11 tokens/s 双并发:
Request 1: Decode Speed = 6.07 tokens/s
Request 0: Decode Speed = 7.97 tokens/s
以上配置 改变启动命令中的参数对速度影响不明显 是否有多gpu 方案能进一步提高并发速度?目前显存利用率35%左右,内存free100多G
Beta Was this translation helpful? Give feedback.
All reactions