Skip to content

Commit 18d8c4e

Browse files
committed
updates
1 parent f13db30 commit 18d8c4e

28 files changed

+2471
-2545
lines changed

docs/CLI_QUICK_REFERENCE.md

Lines changed: 92 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,45 @@
11
# CLI Quick Reference
22

3+
## Architecture
4+
5+
The CLI is auto-generated from Pydantic models in `config/schema.py` using
6+
`pydantic-settings` CliApp. schema.py is the single source of truth for both
7+
YAML configs and CLI flags.
8+
9+
- **Flat aliases** (`-e`, `-m`, `-d`) for frequently used fields
10+
- **All schema fields** available as CLI flags on each subcommand
11+
- **Environment variables** supported via `pydantic-settings` (`ENDPOINT_CONFIG__ENDPOINTS=...`)
12+
- **`${VAR}` interpolation** in YAML files (with `${VAR:-default}` fallback)
13+
314
## Commands
415

516
### Performance Benchmarking
617

718
```bash
8-
# Offline (max throughput - CLI mode)
19+
# Offline (max throughput)
920
inference-endpoint benchmark offline \
10-
--endpoints URL \
11-
--model Qwen/Qwen3-8B \
12-
--dataset tests/datasets/dummy_1k.pkl
21+
-e URL \
22+
-m Qwen/Qwen3-8B \
23+
-d tests/datasets/dummy_1k.pkl
1324

14-
# Online (sustained QPS - CLI mode - requires --target-qps, --load-pattern)
25+
# Online (sustained QPS - requires --load_pattern, --target_qps)
1526
inference-endpoint benchmark online \
16-
--endpoints URL \
17-
--model Qwen/Qwen3-8B \
18-
--dataset tests/datasets/dummy_1k.pkl \
19-
--load-pattern poisson \
20-
--target-qps 100
27+
-e URL \
28+
-m Qwen/Qwen3-8B \
29+
-d tests/datasets/dummy_1k.pkl \
30+
--load_pattern poisson \
31+
--target_qps 100
2132

2233
# With detailed report generation
2334
inference-endpoint benchmark offline \
24-
--endpoints URL \
25-
--model Qwen/Qwen3-8B \
26-
--dataset tests/datasets/dummy_1k.pkl \
27-
--report-dir my_benchmark_report
28-
29-
# YAML-based (YAML mode - no CLI overrides)
30-
inference-endpoint benchmark from-config \
31-
--config test.yaml
35+
-e URL \
36+
-m Qwen/Qwen3-8B \
37+
-d tests/datasets/dummy_1k.pkl \
38+
--report_dir my_benchmark_report
39+
40+
# YAML-based
41+
inference-endpoint benchmark from_config \
42+
-c test.yaml
3243
```
3344

3445
**Default Test Dataset:** Use `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB) for local testing.
@@ -44,12 +55,11 @@ inference-endpoint eval --dataset gpqa,aime --endpoints URL
4455
```bash
4556
# Test endpoint connectivity
4657
inference-endpoint probe \
47-
--endpoints URL \
48-
--model gpt-3.5-turbo \
49-
--api-key KEY
58+
-e URL \
59+
--model gpt-3.5-turbo
5060

5161
# Validate YAML config
52-
inference-endpoint validate --config test.yaml
62+
inference-endpoint validate -c test.yaml
5363
```
5464

5565
### Utilities
@@ -62,31 +72,57 @@ inference-endpoint init --template offline # or: online, eval, submission
6272
inference-endpoint info
6373
```
6474

65-
## Common Options
75+
## Common Options (Benchmark Subcommands)
76+
77+
**Required:**
78+
79+
- `-e, --endpoints URL` - Endpoint URL(s), comma-separated
80+
- `-m, --model NAME` - Model name (e.g., Qwen/Qwen3-8B)
81+
- `-d, --dataset PATH` - Dataset file path
6682

67-
- `--endpoints, -e URL` - Endpoint URL (required for CLI mode)
68-
- `--model NAME` - Model name (required for CLI mode, e.g., Qwen/Qwen3-8B)
69-
- `--dataset, -d PATH` - Dataset file (required for CLI mode)
70-
- `--config, -c PATH` - YAML config file (required for from-config mode)
71-
- `--report-dir PATH` - Save detailed benchmark report with metrics
72-
- `--verbose, -v` - Increase verbosity (-vv for debug)
83+
**Optional:**
7384

74-
## Benchmark Options (CLI Mode Only)
85+
- `--api_type {OPENAI,SGLANG}` - API type (default: OPENAI)
86+
- `--api_key KEY` - API authentication
87+
- `--workers N` - HTTP workers (-1=auto, default: -1)
88+
- `--max_connections N` - Max TCP connections (-1=unlimited)
89+
- `--duration_s SEC` - Duration in seconds
90+
- `--num_samples N` - Explicit sample count (overrides duration calculation)
91+
- `--streaming {AUTO,ON,OFF}` - Streaming mode (default: AUTO, resolves to OFF for offline, ON for online)
92+
- `--mode {PERF,ACC,BOTH}` - Test mode (default: PERF)
93+
- `--temperature FLOAT` - Sampling temperature
94+
- `--max_output_tokens N` - Max output tokens (default: 1024)
95+
- `--min_output_tokens N` - Min output tokens
96+
- `--report_dir PATH` - Report output directory
97+
- `--timeout SEC` - Timeout in seconds
7598

76-
- `--api-key KEY` - API authentication
77-
- `--target-qps N` - Target queries per second (required when --load-pattern=poisson)
78-
- `--duration SEC` - Test duration in seconds (default: 0 - run until dataset exhausted)
79-
- `--num-samples N` - Number of samples to issue (overrides dataset size and duration calculation)
80-
- `--streaming MODE` - Streaming control: `auto` (default), `on`, or `off`. Streaming will enable token streaming in response.
81-
- `--workers N` - HTTP workers (default: 4)
82-
- `--mode MODE` - Test mode: `perf` (default), `acc`, or `both`
83-
- `--min-output-tokens N` - Min output tokens
84-
- `--max-output-tokens N` - Max output tokens
99+
**Online-specific (required for `benchmark online`):**
85100

86-
## Online-Specific Options
101+
- `--load_pattern {POISSON,CONCURRENCY}` - Load pattern
102+
- `--target_qps N` - Target QPS (for poisson)
103+
- `--concurrency N` - Concurrent requests (for concurrency)
87104

88-
- `--load-pattern TYPE` - Load pattern (required): `poisson`, `concurrency`
89-
- `--concurrency N` - Max concurrent requests (required when --load-pattern=concurrency)
105+
## Environment Variables
106+
107+
**In YAML files** — use `${VAR}` or `${VAR:-default}` syntax:
108+
109+
```yaml
110+
endpoint_config:
111+
endpoints:
112+
- "${ENDPOINT_URL}"
113+
api_key: "${API_KEY:-sk-test}"
114+
model_params:
115+
name: "${MODEL_NAME:-Qwen/Qwen3-8B}"
116+
```
117+
118+
**Via pydantic-settings** — env vars auto-map to nested fields using `__` separator:
119+
120+
```bash
121+
export ENDPOINT_CONFIG__ENDPOINTS='["http://prod:8000"]'
122+
export MODEL_PARAMS__NAME="llama-2-70b"
123+
inference-endpoint benchmark offline -e http://x -m M -d D
124+
# env vars override CLI values
125+
```
90126

91127
## Dataset Formats
92128

@@ -252,33 +288,34 @@ endpoint_config:
252288

253289
- All parameters from command line
254290
- Quick testing and iteration
255-
- Examples: `benchmark offline --endpoints URL --model NAME --dataset FILE`
291+
- Example: `benchmark offline -e URL -m NAME -d FILE`
256292

257-
**YAML Mode** (`benchmark from-config`):
293+
**YAML Mode** (`benchmark from_config`):
258294

259295
- All configuration from YAML file
260296
- Reproducible, shareable configs
261-
- No CLI parameter mixing (only `--timeout` auxiliary allowed)
262-
- Example: `benchmark from-config --config file.yaml --timeout 600`
297+
- Supports `${VAR}` env var interpolation
298+
- Optional `--timeout` and `--mode` overrides
299+
- Example: `benchmark from_config -c file.yaml --timeout 600`
263300

264301
## Tips
265302

266303
**Sample Count Control:**
267304

268-
- Sample priority: `--num-samples` > dataset size (duration=0) > calculated (target_qps × duration)
269-
- Default duration: 0 (runs until dataset exhausted or max_duration reached)
305+
- Priority: `--num_samples` > dataset size (duration_s=None) > calculated (target_qps × duration)
306+
- Offline default: duration_s=None → 0ms → runs until dataset exhausted
270307

271308
**Mode Requirements:**
272309

273-
- Online mode requires `--load-pattern` (poisson or concurrency)
274-
- `--load-pattern poisson` requires `--target-qps`
275-
- `--load-pattern concurrency` requires `--concurrency`
276-
- Use `--mode both` for combined perf + accuracy runs
277-
- Streaming: auto (default) enables streaming responses for online, disables for offline
310+
- Online mode requires `--load_pattern` (poisson or concurrency)
311+
- `poisson` requires `--target_qps`
312+
- `concurrency` requires `--concurrency`
313+
- Use `--mode BOTH` for combined perf + accuracy runs
314+
- Streaming: AUTO (default) resolves to OFF for offline, ON for online
278315

279316
**Best Practices:**
280317

281318
- Share YAML configs for reproducible results across systems
282-
- Use `--report-dir` for detailed metrics with TTFT, TPOT, and token analysis
319+
- Use `--report_dir` for detailed metrics with TTFT, TPOT, and token analysis
283320
- Set `HF_TOKEN` environment variable for non-public models
284-
- Use `--min-output-tokens` and `--max-output-tokens` to control output length
321+
- Use `${VAR:-default}` in YAML for environment-specific configs

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ dependencies = [
4747
"sentencepiece==0.2.1",
4848
"protobuf==6.33.0",
4949
"openai_harmony==0.0.8",
50+
# CLI framework
51+
"pydantic-settings>=2.7",
5052
# Color support for cross-platform terminals
5153
"colorama==0.4.6",
5254
# Fix pytz-2024 import warning
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
"""Async runner with uvloop and eager_task_factory."""
17+
18+
from __future__ import annotations
19+
20+
import asyncio
21+
from collections.abc import Coroutine
22+
from typing import TypeVar
23+
24+
import uvloop
25+
26+
T = TypeVar("T")
27+
28+
29+
def run_async(coro: Coroutine[object, object, T]) -> T:
30+
"""Run a coroutine with uvloop and eager_task_factory.
31+
32+
Creates a fresh event loop per invocation. This is the standard way for
33+
Typer command handlers (which are sync) to execute async logic.
34+
"""
35+
with asyncio.Runner(loop_factory=uvloop.new_event_loop) as runner:
36+
runner.get_loop().set_task_factory(asyncio.eager_task_factory) # type: ignore[arg-type]
37+
return runner.run(coro)

0 commit comments

Comments
 (0)