Skip to content

llm-optimizer estimate --interactive framework selection is sglang, but tunimg commands gives both sglang and vllm result #24

@liyuerich

Description

@liyuerich

llm-optimizer estimate --interactive framework selection is sglang, but tunimg commands gives both sglang and vllm result

Framework Selection
• sglang: Fast inference engine optimized for throughput
• vllm: Popular serving framework with good compatibility
• both: Generate configs for both frameworks
Framework [both]: sglang

=== Tuning Commands ===

--- SGLANG ---
Simple (concurrency + TP/DP):
llm-optimizer --framework sglang --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tp_sizedp_size=[(1, 2), (2, 1)]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json --constraints "ttft<200ms"
Advanced (additional parameters):
llm-optimizer --framework sglang --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tp_size
dp_size=[(1, 2), (2, 1)];chunked_prefill_size=[1024, 2048, 3072];schedule_conservativeness=[0.3, 0.6, 1.0];schedule_policy=fcfs" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json --constraints "ttft<200ms"

--- VLLM ---
Simple (concurrency + TP/DP):
llm-optimizer --framework vllm --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_sizedata_parallel_size=[(1, 2), (2, 1)]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json --constraints "ttft<200ms"
Advanced (additional parameters):
llm-optimizer --framework vllm --model Qwen/Qwen3-0.6B --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_size
data_parallel_size=[(1, 2), (2, 1)];max_num_batched_tokens=[16384, 24576, 32768]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=1000;max_concurrency=[32, 64, 96]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json --constraints "ttft<200ms"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions