mlcommons
diff --git a/‎docs/CLI_QUICK_REFERENCE.md‎
Lines changed: 92 additions & 55 deletions b/‎docs/CLI_QUICK_REFERENCE.md‎
Lines changed: 92 additions & 55 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/inference_endpoint/async_utils/runner.py‎
Lines changed: 37 additions & 0 deletions b/‎src/inference_endpoint/async_utils/runner.py‎
Lines changed: 37 additions & 0 deletions
@@ -1,34 +1,45 @@
 # CLI Quick Reference
 
+## Architecture
+
+The CLI is auto-generated from Pydantic models in `config/schema.py` using
+`pydantic-settings` CliApp. schema.py is the single source of truth for both
+YAML configs and CLI flags.
+
+- **Flat aliases** (`-e`, `-m`, `-d`) for frequently used fields
+- **All schema fields** available as CLI flags on each subcommand
+- **Environment variables** supported via `pydantic-settings` (`ENDPOINT_CONFIG__ENDPOINTS=...`)
+- **`${VAR}` interpolation** in YAML files (with `${VAR:-default}` fallback)
+
 ## Commands
 
 ### Performance Benchmarking
 
 ```bash
-# Offline (max throughput - CLI mode)
+# Offline (max throughput)
 inference-endpoint benchmark offline \
-  --endpoints URL \
-  --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl
+  -e URL \
+  -m Qwen/Qwen3-8B \
+  -d tests/datasets/dummy_1k.pkl
 
-# Online (sustained QPS - CLI mode - requires --target-qps, --load-pattern)
+# Online (sustained QPS - requires --load_pattern, --target_qps)
 inference-endpoint benchmark online \
-  --endpoints URL \
-  --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
-  --load-pattern poisson \
-  --target-qps 100
+  -e URL \
+  -m Qwen/Qwen3-8B \
+  -d tests/datasets/dummy_1k.pkl \
+  --load_pattern poisson \
+  --target_qps 100
 
 # With detailed report generation
 inference-endpoint benchmark offline \
-  --endpoints URL \
-  --model Qwen/Qwen3-8B \
-  --dataset tests/datasets/dummy_1k.pkl \
-  --report-dir my_benchmark_report
-
-# YAML-based (YAML mode - no CLI overrides)
-inference-endpoint benchmark from-config \
-  --config test.yaml
+  -e URL \
+  -m Qwen/Qwen3-8B \
+  -d tests/datasets/dummy_1k.pkl \
+  --report_dir my_benchmark_report
+
+# YAML-based
+inference-endpoint benchmark from_config \
+  -c test.yaml
 ```
 
 **Default Test Dataset:** Use `tests/datasets/dummy_1k.pkl` (1000 samples, ~133 KB) for local testing.
@@ -44,12 +55,11 @@ inference-endpoint eval --dataset gpqa,aime --endpoints URL
 ```bash
 # Test endpoint connectivity
 inference-endpoint probe \
-  --endpoints URL \
-  --model gpt-3.5-turbo \
-  --api-key KEY
+  -e URL \
+  --model gpt-3.5-turbo
 
 # Validate YAML config
-inference-endpoint validate --config test.yaml
+inference-endpoint validate -c test.yaml
 ```
 
 ### Utilities
@@ -62,31 +72,57 @@ inference-endpoint init --template offline        # or: online, eval, submission
 inference-endpoint info
 ```
 
-## Common Options
+## Common Options (Benchmark Subcommands)
+
+**Required:**
+
+- `-e, --endpoints URL` - Endpoint URL(s), comma-separated
+- `-m, --model NAME` - Model name (e.g., Qwen/Qwen3-8B)
+- `-d, --dataset PATH` - Dataset file path
 
-- `--endpoints, -e URL` - Endpoint URL (required for CLI mode)
-- `--model NAME` - Model name (required for CLI mode, e.g., Qwen/Qwen3-8B)
-- `--dataset, -d PATH` - Dataset file (required for CLI mode)
-- `--config, -c PATH` - YAML config file (required for from-config mode)
-- `--report-dir PATH` - Save detailed benchmark report with metrics
-- `--verbose, -v` - Increase verbosity (-vv for debug)
+**Optional:**
 
-## Benchmark Options (CLI Mode Only)
+- `--api_type {OPENAI,SGLANG}` - API type (default: OPENAI)
+- `--api_key KEY` - API authentication
+- `--workers N` - HTTP workers (-1=auto, default: -1)
+- `--max_connections N` - Max TCP connections (-1=unlimited)
+- `--duration_s SEC` - Duration in seconds
+- `--num_samples N` - Explicit sample count (overrides duration calculation)
+- `--streaming {AUTO,ON,OFF}` - Streaming mode (default: AUTO, resolves to OFF for offline, ON for online)
+- `--mode {PERF,ACC,BOTH}` - Test mode (default: PERF)
+- `--temperature FLOAT` - Sampling temperature
+- `--max_output_tokens N` - Max output tokens (default: 1024)
+- `--min_output_tokens N` - Min output tokens
+- `--report_dir PATH` - Report output directory
+- `--timeout SEC` - Timeout in seconds
 
-- `--api-key KEY` - API authentication
-- `--target-qps N` - Target queries per second (required when --load-pattern=poisson)
-- `--duration SEC` - Test duration in seconds (default: 0 - run until dataset exhausted)
-- `--num-samples N` - Number of samples to issue (overrides dataset size and duration calculation)
-- `--streaming MODE` - Streaming control: `auto` (default), `on`, or `off`. Streaming will enable token streaming in response.
-- `--workers N` - HTTP workers (default: 4)
-- `--mode MODE` - Test mode: `perf` (default), `acc`, or `both`
-- `--min-output-tokens N` - Min output tokens
-- `--max-output-tokens N` - Max output tokens
+**Online-specific (required for `benchmark online`):**
 
-## Online-Specific Options
+- `--load_pattern {POISSON,CONCURRENCY}` - Load pattern
+- `--target_qps N` - Target QPS (for poisson)
+- `--concurrency N` - Concurrent requests (for concurrency)
 
-- `--load-pattern TYPE` - Load pattern (required): `poisson`, `concurrency`
-- `--concurrency N` - Max concurrent requests (required when --load-pattern=concurrency)
+## Environment Variables
+
+**In YAML files** — use `${VAR}` or `${VAR:-default}` syntax:
+
+```yaml
+endpoint_config:
+  endpoints:
+    - "${ENDPOINT_URL}"
+  api_key: "${API_KEY:-sk-test}"
+model_params:
+  name: "${MODEL_NAME:-Qwen/Qwen3-8B}"
+```
+
+**Via pydantic-settings** — env vars auto-map to nested fields using `__` separator:
+
+```bash
+export ENDPOINT_CONFIG__ENDPOINTS='["http://prod:8000"]'
+export MODEL_PARAMS__NAME="llama-2-70b"
+inference-endpoint benchmark offline -e http://x -m M -d D
+# env vars override CLI values
+```
 
 ## Dataset Formats
 
@@ -252,33 +288,34 @@ endpoint_config:
 
 - All parameters from command line
 - Quick testing and iteration
-- Examples: `benchmark offline --endpoints URL --model NAME --dataset FILE`
+- Example: `benchmark offline -e URL -m NAME -d FILE`
 
-**YAML Mode** (`benchmark from-config`):
+**YAML Mode** (`benchmark from_config`):
 
 - All configuration from YAML file
 - Reproducible, shareable configs
-- No CLI parameter mixing (only `--timeout` auxiliary allowed)
-- Example: `benchmark from-config --config file.yaml --timeout 600`
+- Supports `${VAR}` env var interpolation
+- Optional `--timeout` and `--mode` overrides
+- Example: `benchmark from_config -c file.yaml --timeout 600`
 
 ## Tips
 
 **Sample Count Control:**
 
-- Sample priority: `--num-samples` > dataset size (duration=0) > calculated (target_qps × duration)
-- Default duration: 0 (runs until dataset exhausted or max_duration reached)
+- Priority: `--num_samples` > dataset size (duration_s=None) > calculated (target_qps × duration)
+- Offline default: duration_s=None → 0ms → runs until dataset exhausted
 
 **Mode Requirements:**
 
-- Online mode requires `--load-pattern` (poisson or concurrency)
-  - `--load-pattern poisson` requires `--target-qps`
-  - `--load-pattern concurrency` requires `--concurrency`
-- Use `--mode both` for combined perf + accuracy runs
-- Streaming: auto (default) enables streaming responses for online, disables for offline
+- Online mode requires `--load_pattern` (poisson or concurrency)
+  - `poisson` requires `--target_qps`
+  - `concurrency` requires `--concurrency`
+- Use `--mode BOTH` for combined perf + accuracy runs
+- Streaming: AUTO (default) resolves to OFF for offline, ON for online
 
 **Best Practices:**
 
 - Share YAML configs for reproducible results across systems
-- Use `--report-dir` for detailed metrics with TTFT, TPOT, and token analysis
+- Use `--report_dir` for detailed metrics with TTFT, TPOT, and token analysis
 - Set `HF_TOKEN` environment variable for non-public models
-- Use `--min-output-tokens` and `--max-output-tokens` to control output length
+- Use `${VAR:-default}` in YAML for environment-specific configs
@@ -47,6 +47,8 @@ dependencies = [
     "sentencepiece==0.2.1",
     "protobuf==6.33.0",
     "openai_harmony==0.0.8",
+    # CLI framework
+    "pydantic-settings>=2.7",
     # Color support for cross-platform terminals
     "colorama==0.4.6",
     # Fix pytz-2024 import warning
 
@@ -0,0 +1,37 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Async runner with uvloop and eager_task_factory."""
+
+from __future__ import annotations
+
+import asyncio
+from collections.abc import Coroutine
+from typing import TypeVar
+
+import uvloop
+
+T = TypeVar("T")
+
+
+def run_async(coro: Coroutine[object, object, T]) -> T:
+    """Run a coroutine with uvloop and eager_task_factory.
+
+    Creates a fresh event loop per invocation. This is the standard way for
+    Typer command handlers (which are sync) to execute async logic.
+    """
+    with asyncio.Runner(loop_factory=uvloop.new_event_loop) as runner:
+        runner.get_loop().set_task_factory(asyncio.eager_task_factory)  # type: ignore[arg-type]
+        return runner.run(coro)