11# CLI Quick Reference
22
3+ ## Architecture
4+
5+ The CLI is auto-generated from Pydantic models in ` config/schema.py ` using
6+ ` pydantic-settings ` CliApp. schema.py is the single source of truth for both
7+ YAML configs and CLI flags.
8+
9+ - ** Flat aliases** (` -e ` , ` -m ` , ` -d ` ) for frequently used fields
10+ - ** All schema fields** available as CLI flags on each subcommand
11+ - ** Environment variables** supported via ` pydantic-settings ` (` ENDPOINT_CONFIG__ENDPOINTS=... ` )
12+ - ** ` ${VAR} ` interpolation** in YAML files (with ` ${VAR:-default} ` fallback)
13+
314## Commands
415
516### Performance Benchmarking
617
718``` bash
8- # Offline (max throughput - CLI mode )
19+ # Offline (max throughput)
920inference-endpoint benchmark offline \
10- --endpoints URL \
11- --model Qwen/Qwen3-8B \
12- --dataset tests/datasets/dummy_1k.pkl
21+ -e URL \
22+ -m Qwen/Qwen3-8B \
23+ -d tests/datasets/dummy_1k.pkl
1324
14- # Online (sustained QPS - CLI mode - requires --target-qps , --load-pattern )
25+ # Online (sustained QPS - requires --load_pattern , --target_qps )
1526inference-endpoint benchmark online \
16- --endpoints URL \
17- --model Qwen/Qwen3-8B \
18- --dataset tests/datasets/dummy_1k.pkl \
19- --load-pattern poisson \
20- --target-qps 100
27+ -e URL \
28+ -m Qwen/Qwen3-8B \
29+ -d tests/datasets/dummy_1k.pkl \
30+ --load_pattern poisson \
31+ --target_qps 100
2132
2233# With detailed report generation
2334inference-endpoint benchmark offline \
24- --endpoints URL \
25- --model Qwen/Qwen3-8B \
26- --dataset tests/datasets/dummy_1k.pkl \
27- --report-dir my_benchmark_report
28-
29- # YAML-based (YAML mode - no CLI overrides)
30- inference-endpoint benchmark from-config \
31- --config test.yaml
35+ -e URL \
36+ -m Qwen/Qwen3-8B \
37+ -d tests/datasets/dummy_1k.pkl \
38+ --report_dir my_benchmark_report
39+
40+ # YAML-based
41+ inference-endpoint benchmark from_config \
42+ -c test.yaml
3243```
3344
3445** Default Test Dataset:** Use ` tests/datasets/dummy_1k.pkl ` (1000 samples, ~ 133 KB) for local testing.
@@ -44,12 +55,11 @@ inference-endpoint eval --dataset gpqa,aime --endpoints URL
4455``` bash
4556# Test endpoint connectivity
4657inference-endpoint probe \
47- --endpoints URL \
48- --model gpt-3.5-turbo \
49- --api-key KEY
58+ -e URL \
59+ --model gpt-3.5-turbo
5060
5161# Validate YAML config
52- inference-endpoint validate --config test.yaml
62+ inference-endpoint validate -c test.yaml
5363```
5464
5565### Utilities
@@ -62,31 +72,57 @@ inference-endpoint init --template offline # or: online, eval, submission
6272inference-endpoint info
6373```
6474
65- ## Common Options
75+ ## Common Options (Benchmark Subcommands)
76+
77+ ** Required:**
78+
79+ - ` -e, --endpoints URL ` - Endpoint URL(s), comma-separated
80+ - ` -m, --model NAME ` - Model name (e.g., Qwen/Qwen3-8B)
81+ - ` -d, --dataset PATH ` - Dataset file path
6682
67- - ` --endpoints, -e URL ` - Endpoint URL (required for CLI mode)
68- - ` --model NAME ` - Model name (required for CLI mode, e.g., Qwen/Qwen3-8B)
69- - ` --dataset, -d PATH ` - Dataset file (required for CLI mode)
70- - ` --config, -c PATH ` - YAML config file (required for from-config mode)
71- - ` --report-dir PATH ` - Save detailed benchmark report with metrics
72- - ` --verbose, -v ` - Increase verbosity (-vv for debug)
83+ ** Optional:**
7384
74- ## Benchmark Options (CLI Mode Only)
85+ - ` --api_type {OPENAI,SGLANG} ` - API type (default: OPENAI)
86+ - ` --api_key KEY ` - API authentication
87+ - ` --workers N ` - HTTP workers (-1=auto, default: -1)
88+ - ` --max_connections N ` - Max TCP connections (-1=unlimited)
89+ - ` --duration_s SEC ` - Duration in seconds
90+ - ` --num_samples N ` - Explicit sample count (overrides duration calculation)
91+ - ` --streaming {AUTO,ON,OFF} ` - Streaming mode (default: AUTO, resolves to OFF for offline, ON for online)
92+ - ` --mode {PERF,ACC,BOTH} ` - Test mode (default: PERF)
93+ - ` --temperature FLOAT ` - Sampling temperature
94+ - ` --max_output_tokens N ` - Max output tokens (default: 1024)
95+ - ` --min_output_tokens N ` - Min output tokens
96+ - ` --report_dir PATH ` - Report output directory
97+ - ` --timeout SEC ` - Timeout in seconds
7598
76- - ` --api-key KEY ` - API authentication
77- - ` --target-qps N ` - Target queries per second (required when --load-pattern=poisson)
78- - ` --duration SEC ` - Test duration in seconds (default: 0 - run until dataset exhausted)
79- - ` --num-samples N ` - Number of samples to issue (overrides dataset size and duration calculation)
80- - ` --streaming MODE ` - Streaming control: ` auto ` (default), ` on ` , or ` off ` . Streaming will enable token streaming in response.
81- - ` --workers N ` - HTTP workers (default: 4)
82- - ` --mode MODE ` - Test mode: ` perf ` (default), ` acc ` , or ` both `
83- - ` --min-output-tokens N ` - Min output tokens
84- - ` --max-output-tokens N ` - Max output tokens
99+ ** Online-specific (required for ` benchmark online ` ):**
85100
86- ## Online-Specific Options
101+ - ` --load_pattern {POISSON,CONCURRENCY} ` - Load pattern
102+ - ` --target_qps N ` - Target QPS (for poisson)
103+ - ` --concurrency N ` - Concurrent requests (for concurrency)
87104
88- - ` --load-pattern TYPE ` - Load pattern (required): ` poisson ` , ` concurrency `
89- - ` --concurrency N ` - Max concurrent requests (required when --load-pattern=concurrency)
105+ ## Environment Variables
106+
107+ ** In YAML files** — use ` ${VAR} ` or ` ${VAR:-default} ` syntax:
108+
109+ ``` yaml
110+ endpoint_config :
111+ endpoints :
112+ - " ${ENDPOINT_URL}"
113+ api_key : " ${API_KEY:-sk-test}"
114+ model_params :
115+ name : " ${MODEL_NAME:-Qwen/Qwen3-8B}"
116+ ` ` `
117+
118+ **Via pydantic-settings** — env vars auto-map to nested fields using ` __` separator:
119+
120+ ` ` ` bash
121+ export ENDPOINT_CONFIG__ENDPOINTS='["http://prod:8000"]'
122+ export MODEL_PARAMS__NAME="llama-2-70b"
123+ inference-endpoint benchmark offline -e http://x -m M -d D
124+ # env vars override CLI values
125+ ` ` `
90126
91127# # Dataset Formats
92128
@@ -252,33 +288,34 @@ endpoint_config:
252288
253289- All parameters from command line
254290- Quick testing and iteration
255- - Examples : ` benchmark offline --endpoints URL --model NAME --dataset FILE`
291+ - Example : ` benchmark offline -e URL -m NAME -d FILE`
256292
257- **YAML Mode** (`benchmark from-config `):
293+ **YAML Mode** (`benchmark from_config `):
258294
259295- All configuration from YAML file
260296- Reproducible, shareable configs
261- - No CLI parameter mixing (only `--timeout` auxiliary allowed)
262- - Example : ` benchmark from-config --config file.yaml --timeout 600`
297+ - Supports `${VAR}` env var interpolation
298+ - Optional `--timeout` and `--mode` overrides
299+ - Example : ` benchmark from_config -c file.yaml --timeout 600`
263300
264301# # Tips
265302
266303**Sample Count Control:**
267304
268- - Sample priority : ` --num-samples ` > dataset size (duration=0 ) > calculated (target_qps × duration)
269- - Default duration : 0 ( runs until dataset exhausted or max_duration reached)
305+ - Priority : ` --num_samples ` > dataset size (duration_s=None ) > calculated (target_qps × duration)
306+ - Offline default : duration_s=None → 0ms → runs until dataset exhausted
270307
271308**Mode Requirements:**
272309
273- - Online mode requires `--load-pattern ` (poisson or concurrency)
274- - ` --load-pattern poisson` requires `--target-qps `
275- - ` --load-pattern concurrency` requires `--concurrency`
276- - Use `--mode both ` for combined perf + accuracy runs
277- - Streaming : auto (default) enables streaming responses for online, disables for offline
310+ - Online mode requires `--load_pattern ` (poisson or concurrency)
311+ - ` poisson` requires `--target_qps `
312+ - ` concurrency` requires `--concurrency`
313+ - Use `--mode BOTH ` for combined perf + accuracy runs
314+ - Streaming : AUTO (default) resolves to OFF for offline, ON for online
278315
279316**Best Practices:**
280317
281318- Share YAML configs for reproducible results across systems
282- - Use `--report-dir ` for detailed metrics with TTFT, TPOT, and token analysis
319+ - Use `--report_dir ` for detailed metrics with TTFT, TPOT, and token analysis
283320- Set `HF_TOKEN` environment variable for non-public models
284- - Use `--min-output-tokens` and `--max-output-tokens` to control output length
321+ - Use `${VAR:-default}` in YAML for environment-specific configs
0 commit comments