-
Notifications
You must be signed in to change notification settings - Fork 18
Description
root@ins-hdfz6-7f7dfc5855-9qf4r:~/data# llm-optimizer --framework sglang --model Qwen/Qwen3-0.6B --output-json sglang_results.json
2025-12-11 02:35:07,936 - llm_optimizer.main - INFO - Detected GPU type: NVIDIA GeForce RTX 4090, Count: 1
2025-12-11 02:35:07,936 [INFO] Detected GPU type: NVIDIA GeForce RTX 4090, Count: 1
2025-12-11 02:35:07,936 - llm_optimizer.main - INFO - Generated 1 configuration(s) to run.
2025-12-11 02:35:07,936 [INFO] Generated 1 configuration(s) to run.
2025-12-11 02:35:07,938 - llm_optimizer.main - INFO - --------------------------------------------------------------------------------
2025-12-11 02:35:07,938 [INFO] --------------------------------------------------------------------------------
2025-12-11 02:35:07,938 - llm_optimizer.main - INFO - Starting run 1/1: default
2025-12-11 02:35:07,938 [INFO] Starting run 1/1: default
2025-12-11 02:35:07,938 - llm_optimizer.server_utils - INFO - Starting server with command: python3 -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --host 127.0.0.1 --port 30000
2025-12-11 02:35:07,938 [INFO] Starting server with command: python3 -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --host 127.0.0.1 --port 30000
2025-12-11 02:35:07,942 - llm_optimizer.server_utils - INFO - Waiting for server to be ready at http://127.0.0.1:30000/health...
2025-12-11 02:35:07,942 [INFO] Waiting for server to be ready at http://127.0.0.1:30000/health...
2025-12-11 02:35:07,951 - llm_optimizer.server_utils - INFO - Server is up after 1 attempts.
2025-12-11 02:35:07,951 [INFO] Server is up after 1 attempts.
benchmark_args={'backend': 'vllm', 'model': 'Qwen/Qwen3-0.6B', 'host': '127.0.0.1', 'port': 30000, 'dataset_name': 'sharegpt', 'num_prompts': 1000, 'request_rate': inf, 'seed': 1}
tokenizer_config.json: 9.73kB [00:00, 15.3MB/s]
vocab.json: 2.78MB [00:00, 4.72MB/s]
merges.txt: 1.67MB [00:00, 9.41MB/s]
tokenizer.json: 0%|▏ | 9.10k/11.4M [00:01<39:57, 4.76kB/s]INFO 12-11 02:35:13 [init.py:239] Automatically detected platform cuda.
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:02<00:00, 4.39MB/s]
{'backend': 'vllm', 'model': 'Qwen/Qwen3-0.6B', 'host': '127.0.0.1', 'port': 30000, 'dataset_name': 'sharegpt', 'num_prompts': 1000, 'request_rate': inf, 'seed': 1}
============ Serving Benchmark Result ============
Backend: vllm
Traffic request rate: inf
Max request concurrency: not set