Skip to content

Latest commit

 

History

History
798 lines (753 loc) · 60.5 KB

File metadata and controls

798 lines (753 loc) · 60.5 KB

SGLang

  • max_prefill_tokens: Token budget of how many tokens to accept in one prefill batch. The actual number is the max of this parameter and the context_length.
  • max_total_tokens: The maximum number of tokens that can be stored into the KV cache. Use mainly for debugging.
  • context_length: The number of tokens our model can process including the input. Note that extending the default might lead to strange behavior.
  • To enable multi-GPU tensor parallelism, add --tp 2 , To enable multi-GPU data parallelism, add --dp 2.
  • For my small LLM Model test, use --dp 2

https://docs.sglang.ai/backend/server_arguments.html

python -m sglang.launch_server --model-path /home/vllm/llm/Qwen7B-awq --context-length 8192 --dp 2

// A snapshot of when test finished the power dropped to 53W
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     17126      C   sglang::scheduler_DP0_TP0                   20916MiB |
|    1   N/A  N/A     17127      C   sglang::scheduler_DP1_TP0                   20916MiB |
+-----------------------------------------------------------------------------------------+
Sun Apr 27 14:44:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10                     On  |   00000000:00:07.0 Off |                    0 |
|  0%   31C    P0             53W /  150W |   20922MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10                     On  |   00000000:00:08.0 Off |                    0 |
|  0%   31C    P0             55W /  150W |   20922MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+


Test completed in 34.77 seconds
        SGLang Stress Test Results - 2025-04-27 13:26:25        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 34.77 seconds                   │
│ Min Response Time          │ 2.04 seconds                    │
│ Max Response Time          │ 16.34 seconds                   │
│ Average Response Time      │ 3.47 seconds                    │
│ Median Response Time       │ 2.05 seconds                    │
│ P90 Response Time          │ 16.33 seconds                   │
│ P95 Response Time          │ 16.34 seconds                   │
│ P99 Response Time          │ 16.34 seconds                   │
│ Std Dev Response Time      │ 4.29 seconds                    │
│ Theoretical Max Throughput │ 18.36 requests/second           │
│ Actual Throughput          │ 8.63 requests/second            │
│ Total Generated Tokens     │ 31238                           │
│ Tokens Per Second          │ 898.49                          │
│ Avg Tokens Per Request     │ 104.13                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 20.49 seconds
        SGLang Stress Test Results - 2025-04-27 13:27:05        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 20.49 seconds                   │
│ Min Response Time          │ 2.04 seconds                    │
│ Max Response Time          │ 2.05 seconds                    │
│ Average Response Time      │ 2.05 seconds                    │
│ Median Response Time       │ 2.05 seconds                    │
│ P90 Response Time          │ 2.05 seconds                    │
│ P95 Response Time          │ 2.05 seconds                    │
│ P99 Response Time          │ 2.05 seconds                    │
│ Std Dev Response Time      │ 0.00 seconds                    │
│ Theoretical Max Throughput │ 146.27 requests/second          │
│ Actual Throughput          │ 14.64 requests/second           │
│ Total Generated Tokens     │ 31255                           │
│ Tokens Per Second          │ 1525.57                         │
│ Avg Tokens Per Request     │ 104.18                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 20.45 seconds
        SGLang Stress Test Results - 2025-04-27 13:28:14        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 20.45 seconds                   │
│ Min Response Time          │ 2.04 seconds                    │
│ Max Response Time          │ 2.05 seconds                    │
│ Average Response Time      │ 2.04 seconds                    │
│ Median Response Time       │ 2.04 seconds                    │
│ P90 Response Time          │ 2.05 seconds                    │
│ P95 Response Time          │ 2.05 seconds                    │
│ P99 Response Time          │ 2.05 seconds                    │
│ Std Dev Response Time      │ 0.00 seconds                    │
│ Theoretical Max Throughput │ 146.38 requests/second          │
│ Actual Throughput          │ 14.67 requests/second           │
│ Total Generated Tokens     │ 31276                           │
│ Tokens Per Second          │ 1529.23                         │
│ Avg Tokens Per Request     │ 104.25                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘


Test completed in 32.07 seconds
        SGLang Stress Test Results - 2025-04-27 13:42:50        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 32.07 seconds                   │
│ Min Response Time          │ 1.77 seconds                    │
│ Max Response Time          │ 7.87 seconds                    │
│ Average Response Time      │ 2.96 seconds                    │
│ Median Response Time       │ 2.73 seconds                    │
│ P90 Response Time          │ 5.97 seconds                    │
│ P95 Response Time          │ 7.09 seconds                    │
│ P99 Response Time          │ 7.83 seconds                    │
│ Std Dev Response Time      │ 1.48 seconds                    │
│ Theoretical Max Throughput │ 38.12 requests/second           │
│ Actual Throughput          │ 9.35 requests/second            │
│ Total Generated Tokens     │ 31203                           │
│ Tokens Per Second          │ 972.87                          │
│ Avg Tokens Per Request     │ 104.01                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 29.16 seconds
        SGLang Stress Test Results - 2025-04-27 13:43:47        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 29.16 seconds                   │
│ Min Response Time          │ 1.59 seconds                    │
│ Max Response Time          │ 4.04 seconds                    │
│ Average Response Time      │ 2.67 seconds                    │
│ Median Response Time       │ 2.42 seconds                    │
│ P90 Response Time          │ 3.71 seconds                    │
│ P95 Response Time          │ 3.73 seconds                    │
│ P99 Response Time          │ 4.04 seconds                    │
│ Std Dev Response Time      │ 0.69 seconds                    │
│ Theoretical Max Throughput │ 74.18 requests/second           │
│ Actual Throughput          │ 10.29 requests/second           │
│ Total Generated Tokens     │ 31072                           │
│ Tokens Per Second          │ 1065.63                         │
│ Avg Tokens Per Request     │ 103.57                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 39.04 seconds
        SGLang Stress Test Results - 2025-04-27 13:44:49        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 39.04 seconds                   │
│ Min Response Time          │ 1.83 seconds                    │
│ Max Response Time          │ 5.16 seconds                    │
│ Average Response Time      │ 3.29 seconds                    │
│ Median Response Time       │ 3.11 seconds                    │
│ P90 Response Time          │ 4.73 seconds                    │
│ P95 Response Time          │ 5.09 seconds                    │
│ P99 Response Time          │ 5.13 seconds                    │
│ Std Dev Response Time      │ 1.00 seconds                    │
│ Theoretical Max Throughput │ 58.10 requests/second           │
│ Actual Throughput          │ 7.68 requests/second            │
│ Total Generated Tokens     │ 31185                           │
│ Tokens Per Second          │ 798.84                          │
│ Avg Tokens Per Request     │ 103.95                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 32.17 seconds
        SGLang Stress Test Results - 2025-04-27 14:45:36        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 32.16 seconds                   │
│ Min Response Time          │ 2.02 seconds                    │
│ Max Response Time          │ 5.58 seconds                    │
│ Average Response Time      │ 3.01 seconds                    │
│ Median Response Time       │ 2.81 seconds                    │
│ P90 Response Time          │ 4.51 seconds                    │
│ P95 Response Time          │ 4.59 seconds                    │
│ P99 Response Time          │ 5.58 seconds                    │
│ Std Dev Response Time      │ 0.75 seconds                    │
│ Theoretical Max Throughput │ 53.80 requests/second           │
│ Actual Throughput          │ 9.33 requests/second            │
│ Total Generated Tokens     │ 31300                           │
│ Tokens Per Second          │ 973.11                          │
│ Avg Tokens Per Request     │ 104.33                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Prompt: 'Explain what makes a good API in one paragraph....' (47 chars)
Processing requests... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Test completed in 60.19 seconds
        SGLang Stress Test Results - 2025-04-27 14:47:25        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 60.19 seconds                   │
│ Min Response Time          │ 1.90 seconds                    │
│ Max Response Time          │ 12.31 seconds                   │
│ Average Response Time      │ 4.70 seconds                    │
│ Median Response Time       │ 3.76 seconds                    │
│ P90 Response Time          │ 7.87 seconds                    │
│ P95 Response Time          │ 8.43 seconds                    │
│ P99 Response Time          │ 12.31 seconds                   │
│ Std Dev Response Time      │ 2.57 seconds                    │
│ Theoretical Max Throughput │ 24.37 requests/second           │
│ Actual Throughput          │ 4.98 requests/second            │
│ Total Generated Tokens     │ 31191                           │
│ Tokens Per Second          │ 518.18                          │
│ Avg Tokens Per Request     │ 103.97                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 7.80 seconds
        SGLang Stress Test Results - 2025-04-27 13:45:33        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 5                               │
│ Total Requests             │ 20                              │
│ Successful Requests        │ 20 (100.0%)                     │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 7.80 seconds                    │
│ Min Response Time          │ 1.57 seconds                    │
│ Max Response Time          │ 2.38 seconds                    │
│ Average Response Time      │ 1.92 seconds                    │
│ Median Response Time       │ 1.80 seconds                    │
│ P90 Response Time          │ 2.38 seconds                    │
│ P95 Response Time          │ 2.38 seconds                    │
│ P99 Response Time          │ 2.38 seconds                    │
│ Std Dev Response Time      │ 0.27 seconds                    │
│ Theoretical Max Throughput │ 8.40 requests/second            │
│ Actual Throughput          │ 2.56 requests/second            │
│ Total Generated Tokens     │ 2131                            │
│ Tokens Per Second          │ 273.17                          │
│ Avg Tokens Per Request     │ 106.55                          │
│ Peak Requests In Flight    │ 5                               │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 12.11 seconds
        SGLang Stress Test Results - 2025-04-27 13:45:54        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 5                               │
│ Total Requests             │ 20                              │
│ Successful Requests        │ 20 (100.0%)                     │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 12.11 seconds                   │
│ Min Response Time          │ 1.49 seconds                    │
│ Max Response Time          │ 4.60 seconds                    │
│ Average Response Time      │ 2.95 seconds                    │
│ Median Response Time       │ 2.38 seconds                    │
│ P90 Response Time          │ 4.60 seconds                    │
│ P95 Response Time          │ 4.60 seconds                    │
│ P99 Response Time          │ 4.60 seconds                    │
│ Std Dev Response Time      │ 1.49 seconds                    │
│ Theoretical Max Throughput │ 4.35 requests/second            │
│ Actual Throughput          │ 1.65 requests/second            │
│ Total Generated Tokens     │ 1975                            │
│ Tokens Per Second          │ 163.10                          │
│ Avg Tokens Per Request     │ 98.75                           │
│ Peak Requests In Flight    │ 5                               │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 11.04 seconds
        SGLang Stress Test Results - 2025-04-27 13:50:56        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 5                               │
│ Total Requests             │ 20                              │
│ Successful Requests        │ 20 (100.0%)                     │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 11.04 seconds                   │
│ Min Response Time          │ 1.48 seconds                    │
│ Max Response Time          │ 4.52 seconds                    │
│ Average Response Time      │ 2.52 seconds                    │
│ Median Response Time       │ 1.89 seconds                    │
│ P90 Response Time          │ 4.51 seconds                    │
│ P95 Response Time          │ 4.52 seconds                    │
│ P99 Response Time          │ 4.52 seconds                    │
│ Std Dev Response Time      │ 1.13 seconds                    │
│ Theoretical Max Throughput │ 4.42 requests/second            │
│ Actual Throughput          │ 1.81 requests/second            │
│ Total Generated Tokens     │ 2083                            │
│ Tokens Per Second          │ 188.67                          │
│ Avg Tokens Per Request     │ 104.15                          │
│ Peak Requests In Flight    │ 5                               │
└────────────────────────────┴─────────────────────────────────┘


//再来

Test completed in 36.91 seconds
        SGLang Stress Test Results - 2025-04-27 14:00:37        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 36.91 seconds                   │
│ Min Response Time          │ 1.55 seconds                    │
│ Max Response Time          │ 8.58 seconds                    │
│ Average Response Time      │ 3.23 seconds                    │
│ Median Response Time       │ 2.31 seconds                    │
│ P90 Response Time          │ 8.53 seconds                    │
│ P95 Response Time          │ 8.55 seconds                    │
│ P99 Response Time          │ 8.58 seconds                    │
│ Std Dev Response Time      │ 1.94 seconds                    │
│ Theoretical Max Throughput │ 34.97 requests/second           │
│ Actual Throughput          │ 8.13 requests/second            │
│ Total Generated Tokens     │ 31515                           │
│ Tokens Per Second          │ 853.94                          │
│ Avg Tokens Per Request     │ 105.05                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 76.97 seconds
        SGLang Stress Test Results - 2025-04-27 14:02:13        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 76.97 seconds                   │
│ Min Response Time          │ 1.59 seconds                    │
│ Max Response Time          │ 42.59 seconds                   │
│ Average Response Time      │ 5.11 seconds                    │
│ Median Response Time       │ 2.74 seconds                    │
│ P90 Response Time          │ 7.21 seconds                    │
│ P95 Response Time          │ 21.17 seconds                   │
│ P99 Response Time          │ 37.93 seconds                   │
│ Std Dev Response Time      │ 7.00 seconds                    │
│ Theoretical Max Throughput │ 7.04 requests/second            │
│ Actual Throughput          │ 3.90 requests/second            │
│ Total Generated Tokens     │ 31213                           │
│ Tokens Per Second          │ 405.53                          │
│ Avg Tokens Per Request     │ 104.04                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 42.73 seconds
        SGLang Stress Test Results - 2025-04-27 14:03:19        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 42.73 seconds                   │
│ Min Response Time          │ 1.68 seconds                    │
│ Max Response Time          │ 11.69 seconds                   │
│ Average Response Time      │ 3.51 seconds                    │
│ Median Response Time       │ 3.20 seconds                    │
│ P90 Response Time          │ 4.36 seconds                    │
│ P95 Response Time          │ 5.21 seconds                    │
│ P99 Response Time          │ 11.69 seconds                   │
│ Std Dev Response Time      │ 1.70 seconds                    │
│ Theoretical Max Throughput │ 25.66 requests/second           │
│ Actual Throughput          │ 7.02 requests/second            │
│ Total Generated Tokens     │ 31132                           │
│ Tokens Per Second          │ 728.62                          │
│ Avg Tokens Per Request     │ 103.77                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘


Test completed in 60.97 seconds
        SGLang Stress Test Results - 2025-04-27 14:10:34        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 60.97 seconds                   │
│ Min Response Time          │ 1.52 seconds                    │
│ Max Response Time          │ 12.37 seconds                   │
│ Average Response Time      │ 4.44 seconds                    │
│ Median Response Time       │ 3.81 seconds                    │
│ P90 Response Time          │ 6.30 seconds                    │
│ P95 Response Time          │ 11.51 seconds                   │
│ P99 Response Time          │ 12.37 seconds                   │
│ Std Dev Response Time      │ 2.37 seconds                    │
│ Theoretical Max Throughput │ 24.26 requests/second           │
│ Actual Throughput          │ 4.92 requests/second            │
│ Total Generated Tokens     │ 31214                           │
│ Tokens Per Second          │ 511.92                          │
│ Avg Tokens Per Request     │ 104.05                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 49.14 seconds
        SGLang Stress Test Results - 2025-04-27 14:12:23        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 49.14 seconds                   │
│ Min Response Time          │ 1.61 seconds                    │
│ Max Response Time          │ 7.47 seconds                    │
│ Average Response Time      │ 3.60 seconds                    │
│ Median Response Time       │ 3.24 seconds                    │
│ P90 Response Time          │ 5.27 seconds                    │
│ P95 Response Time          │ 6.03 seconds                    │
│ P99 Response Time          │ 7.46 seconds                    │
│ Std Dev Response Time      │ 1.36 seconds                    │
│ Theoretical Max Throughput │ 40.14 requests/second           │
│ Actual Throughput          │ 6.11 requests/second            │
│ Total Generated Tokens     │ 31079                           │
│ Tokens Per Second          │ 632.51                          │
│ Avg Tokens Per Request     │ 103.60                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

vLLM

Test completed in 29.73 seconds
           vLLM Stress Test Results - 2025-04-27 14:25:12            
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:8000/v1/completions │
│ Model                      │ /home/vllm/llm/Qwen7B-awq            │
│ Max Tokens                 │ 256                                  │
│ Temperature                │ 0.7                                  │
│ Concurrent Requests        │ 30                                   │
│ Total Requests             │ 300                                  │
│ Successful Requests        │ 300 (100.0%)                         │
│ Failed Requests            │ 0 (0.0%)                             │
│ Total Test Duration        │ 29.72 seconds                        │
│ Min Response Time          │ 1.23 seconds                         │
│ Max Response Time          │ 3.62 seconds                         │
│ Average Response Time      │ 2.32 seconds                         │
│ Median Response Time       │ 2.35 seconds                         │
│ P90 Response Time          │ 2.60 seconds                         │
│ P95 Response Time          │ 2.70 seconds                         │
│ P99 Response Time          │ 3.32 seconds                         │
│ Std Dev Response Time      │ 0.28 seconds                         │
│ Theoretical Max Throughput │ 82.83 requests/second                │
│ Actual Throughput          │ 10.09 requests/second                │
│ Total Generated Tokens     │ 30492                                │
│ Tokens Per Second          │ 1025.81                              │
│ Avg Tokens Per Request     │ 101.64                               │
│ Peak Requests In Flight    │ 30                                   │
└────────────────────────────┴──────────────────────────────────────┘

Test completed in 27.73 seconds
           vLLM Stress Test Results - 2025-04-27 14:26:27            
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:8000/v1/completions │
│ Model                      │ /home/vllm/llm/Qwen7B-awq            │
│ Max Tokens                 │ 256                                  │
│ Temperature                │ 0.7                                  │
│ Concurrent Requests        │ 30                                   │
│ Total Requests             │ 300                                  │
│ Successful Requests        │ 300 (100.0%)                         │
│ Failed Requests            │ 0 (0.0%)                             │
│ Total Test Duration        │ 27.73 seconds                        │
│ Min Response Time          │ 1.34 seconds                         │
│ Max Response Time          │ 2.97 seconds                         │
│ Average Response Time      │ 2.26 seconds                         │
│ Median Response Time       │ 2.28 seconds                         │
│ P90 Response Time          │ 2.62 seconds                         │
│ P95 Response Time          │ 2.71 seconds                         │
│ P99 Response Time          │ 2.85 seconds                         │
│ Std Dev Response Time      │ 0.27 seconds                         │
│ Theoretical Max Throughput │ 101.01 requests/second               │
│ Actual Throughput          │ 10.82 requests/second                │
│ Total Generated Tokens     │ 29779                                │
│ Tokens Per Second          │ 1074.01                              │
│ Avg Tokens Per Request     │ 99.26                                │
│ Peak Requests In Flight    │ 30                                   │
└────────────────────────────┴──────────────────────────────────────┘

Test completed in 28.41 seconds
           vLLM Stress Test Results - 2025-04-27 14:29:22            
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:8000/v1/completions │
│ Model                      │ /home/vllm/llm/Qwen7B-awq            │
│ Max Tokens                 │ 256                                  │
│ Temperature                │ 0.7                                  │
│ Concurrent Requests        │ 30                                   │
│ Total Requests             │ 300                                  │
│ Successful Requests        │ 300 (100.0%)                         │
│ Failed Requests            │ 0 (0.0%)                             │
│ Total Test Duration        │ 28.41 seconds                        │
│ Min Response Time          │ 1.36 seconds                         │
│ Max Response Time          │ 3.53 seconds                         │
│ Average Response Time      │ 2.28 seconds                         │
│ Median Response Time       │ 2.28 seconds                         │
│ P90 Response Time          │ 2.57 seconds                         │
│ P95 Response Time          │ 2.67 seconds                         │
│ P99 Response Time          │ 2.96 seconds                         │
│ Std Dev Response Time      │ 0.26 seconds                         │
│ Theoretical Max Throughput │ 85.09 requests/second                │
│ Actual Throughput          │ 10.56 requests/second                │
│ Total Generated Tokens     │ 29861                                │
│ Tokens Per Second          │ 1051.13                              │
│ Avg Tokens Per Request     │ 99.54                                │
│ Peak Requests In Flight    │ 30                                   │
└────────────────────────────┴──────────────────────────────────────┘

root@iZ2zed7pdsnsrie8263p0kZ:/home# ./vllm-stress-test.py --concurrent 5 --total 20
Starting vLLM Stress Test
URL: http://localhost:8000/v1/completions
Model: /home/vllm/llm/Qwen7B-awq
Max Tokens: 256
Temperature: 0.7
Concurrent requests: 5
Total requests: 20
Prompt: 'Explain what makes a good API in one paragraph....' (47 chars)
Processing requests... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Test completed in 5.52 seconds
           vLLM Stress Test Results - 2025-04-27 14:27:41            
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:8000/v1/completions │
│ Model                      │ /home/vllm/llm/Qwen7B-awq            │
│ Max Tokens                 │ 256                                  │
│ Temperature                │ 0.7                                  │
│ Concurrent Requests        │ 5                                    │
│ Total Requests             │ 20                                   │
│ Successful Requests        │ 20 (100.0%)                          │
│ Failed Requests            │ 0 (0.0%)                             │
│ Total Test Duration        │ 5.52 seconds                         │
│ Min Response Time          │ 0.96 seconds                         │
│ Max Response Time          │ 1.44 seconds                         │
│ Average Response Time      │ 1.23 seconds                         │
│ Median Response Time       │ 1.22 seconds                         │
│ P90 Response Time          │ 1.39 seconds                         │
│ P95 Response Time          │ 1.44 seconds                         │
│ P99 Response Time          │ 1.44 seconds                         │
│ Std Dev Response Time      │ 0.14 seconds                         │
│ Theoretical Max Throughput │ 13.89 requests/second                │
│ Actual Throughput          │ 3.62 requests/second                 │
│ Total Generated Tokens     │ 1981                                 │
│ Tokens Per Second          │ 358.69                               │
│ Avg Tokens Per Request     │ 99.05                                │
│ Peak Requests In Flight    │ 5                                    │
└────────────────────────────┴──────────────────────────────────────┘
root@iZ2zed7pdsnsrie8263p0kZ:/home# ./vllm-stress-test.py --concurrent 5 --total 20
Starting vLLM Stress Test
URL: http://localhost:8000/v1/completions
Model: /home/vllm/llm/Qwen7B-awq
Max Tokens: 256
Temperature: 0.7
Concurrent requests: 5
Total requests: 20
Prompt: 'Explain what makes a good API in one paragraph....' (47 chars)
Processing requests... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Test completed in 5.53 seconds
           vLLM Stress Test Results - 2025-04-27 14:28:04            
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                                ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:8000/v1/completions │
│ Model                      │ /home/vllm/llm/Qwen7B-awq            │
│ Max Tokens                 │ 256                                  │
│ Temperature                │ 0.7                                  │
│ Concurrent Requests        │ 5                                    │
│ Total Requests             │ 20                                   │
│ Successful Requests        │ 20 (100.0%)                          │
│ Failed Requests            │ 0 (0.0%)                             │
│ Total Test Duration        │ 5.53 seconds                         │
│ Min Response Time          │ 0.78 seconds                         │
│ Max Response Time          │ 1.50 seconds                         │
│ Average Response Time      │ 1.18 seconds                         │
│ Median Response Time       │ 1.15 seconds                         │
│ P90 Response Time          │ 1.40 seconds                         │
│ P95 Response Time          │ 1.50 seconds                         │
│ P99 Response Time          │ 1.50 seconds                         │
│ Std Dev Response Time      │ 0.16 seconds                         │
│ Theoretical Max Throughput │ 13.33 requests/second                │
│ Actual Throughput          │ 3.62 requests/second                 │
│ Total Generated Tokens     │ 1908                                 │
│ Tokens Per Second          │ 345.18                               │
│ Avg Tokens Per Request     │ 95.40                                │
│ Peak Requests In Flight    │ 5                                    │
└────────────────────────────┴──────────────────────────────────────┘

One A10 has much better results for SGLang

For 5 & 30 concurrent requests, 2 A10 consistenly show unpredicted results!

Test completed in 11.02 seconds
        SGLang Stress Test Results - 2025-04-27 18:08:38        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 5                               │
│ Total Requests             │ 20                              │
│ Successful Requests        │ 20 (100.0%)                     │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 11.02 seconds                   │
│ Min Response Time          │ 1.67 seconds                    │
│ Max Response Time          │ 5.09 seconds                    │
│ Average Response Time      │ 2.58 seconds                    │
│ Median Response Time       │ 2.08 seconds                    │
│ P90 Response Time          │ 4.53 seconds                    │
│ P95 Response Time          │ 5.09 seconds                    │
│ P99 Response Time          │ 5.09 seconds                    │
│ Std Dev Response Time      │ 1.17 seconds                    │
│ Theoretical Max Throughput │ 3.93 requests/second            │
│ Actual Throughput          │ 1.82 requests/second            │
│ Total Generated Tokens     │ 2077                            │
│ Tokens Per Second          │ 188.56                          │
│ Avg Tokens Per Request     │ 103.85                          │
│ Peak Requests In Flight    │ 5                               │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 38.49 seconds
        SGLang Stress Test Results - 2025-04-27 18:05:16        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 38.49 seconds                   │
│ Min Response Time          │ 1.92 seconds                    │
│ Max Response Time          │ 8.23 seconds                    │
│ Average Response Time      │ 3.11 seconds                    │
│ Median Response Time       │ 2.63 seconds                    │
│ P90 Response Time          │ 5.08 seconds                    │
│ P95 Response Time          │ 6.09 seconds                    │
│ P99 Response Time          │ 8.17 seconds                    │
│ Std Dev Response Time      │ 1.27 seconds                    │
│ Theoretical Max Throughput │ 36.46 requests/second           │
│ Actual Throughput          │ 7.79 requests/second            │
│ Total Generated Tokens     │ 31184                           │
│ Tokens Per Second          │ 810.18                          │
│ Avg Tokens Per Request     │ 103.95                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘

Comapred to One A10 result

Test completed in 6.11 seconds
        SGLang Stress Test Results - 2025-04-27 18:10:03        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 5                               │
│ Total Requests             │ 20                              │
│ Successful Requests        │ 20 (100.0%)                     │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 6.11 seconds                    │
│ Min Response Time          │ 1.50 seconds                    │
│ Max Response Time          │ 1.59 seconds                    │
│ Average Response Time      │ 1.53 seconds                    │
│ Median Response Time       │ 1.51 seconds                    │
│ P90 Response Time          │ 1.59 seconds                    │
│ P95 Response Time          │ 1.59 seconds                    │
│ P99 Response Time          │ 1.59 seconds                    │
│ Std Dev Response Time      │ 0.04 seconds                    │
│ Theoretical Max Throughput │ 12.55 requests/second           │
│ Actual Throughput          │ 3.27 requests/second            │
│ Total Generated Tokens     │ 2083                            │
│ Tokens Per Second          │ 340.76                          │
│ Avg Tokens Per Request     │ 104.15                          │
│ Peak Requests In Flight    │ 5                               │
└────────────────────────────┴─────────────────────────────────┘

Test completed in 20.37 seconds
        SGLang Stress Test Results - 2025-04-27 18:12:37        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric                     ┃ Value                           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ URL                        │ http://localhost:30000/generate │
│ Max Tokens                 │ 256                             │
│ Concurrent Requests        │ 30                              │
│ Total Requests             │ 300                             │
│ Successful Requests        │ 300 (100.0%)                    │
│ Failed Requests            │ 0 (0.0%)                        │
│ Total Test Duration        │ 20.37 seconds                   │
│ Min Response Time          │ 2.02 seconds                    │
│ Max Response Time          │ 2.04 seconds                    │
│ Average Response Time      │ 2.03 seconds                    │
│ Median Response Time       │ 2.04 seconds                    │
│ P90 Response Time          │ 2.04 seconds                    │
│ P95 Response Time          │ 2.04 seconds                    │
│ P99 Response Time          │ 2.04 seconds                    │
│ Std Dev Response Time      │ 0.01 seconds                    │
│ Theoretical Max Throughput │ 146.75 requests/second          │
│ Actual Throughput          │ 14.73 requests/second           │
│ Total Generated Tokens     │ 31032                           │
│ Tokens Per Second          │ 1523.72                         │
│ Avg Tokens Per Request     │ 103.44                          │
│ Peak Requests In Flight    │ 30                              │
└────────────────────────────┴─────────────────────────────────┘