-
Notifications
You must be signed in to change notification settings - Fork 18
Description
llm-optimizer estimate --interactive framework selection is sglang, but tunimg commands gives both sglang and vllm result
Framework Selection
• sglang: Fast inference engine optimized for throughput
• vllm: Popular serving framework with good compatibility
• both: Generate configs for both frameworks
Framework [both]: sglang
=== Tuning Commands ===
--- SGLANG ---
Simple (concurrency + TP/DP):
llm-optimizer --framework sglang --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tp_sizedp_size=[(1, 2), (2, 1)]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json --constraints "ttft<200ms"
Advanced (additional parameters):
llm-optimizer --framework sglang --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tp_sizedp_size=[(1, 2), (2, 1)];chunked_prefill_size=[1024, 2048, 3072];schedule_conservativeness=[0.3, 0.6, 1.0];schedule_policy=fcfs" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json --constraints "ttft<200ms"
--- VLLM ---
Simple (concurrency + TP/DP):
llm-optimizer --framework vllm --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_sizedata_parallel_size=[(1, 2), (2, 1)]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json --constraints "ttft<200ms"
Advanced (additional parameters):
llm-optimizer --framework vllm --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_sizedata_parallel_size=[(1, 2), (2, 1)];max_num_batched_tokens=[16384, 24576, 32768]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json --constraints "ttft<200ms"