[Docs] Add comprehensive CLI reference for all large vllm subcommands (vllm-project#22601)

hmellor · jingyu · commit b4c9c6978e0a · 2025-08-11T20:23:28.000Z
Signed-off-by: Harry Mellor &lt;19981378+hmellor@users.noreply.github.com&gt;
Signed-off-by: jingyu &lt;jingyu@omniml.ai&gt;
diff --git a/docs/.nav.yml b/docs/.nav.yml
@@ -11,7 +11,7 @@ nav:
     - Quick Links:
       - User Guide: usage/README.md
       - Developer Guide: contributing/README.md
-      - API Reference: api/summary.md
+      - API Reference: api/README.md
       - CLI Reference: cli/README.md
     - Timeline:
       - Roadmap: https://roadmap.vllm.ai
@@ -58,11 +58,9 @@ nav:
     - CI: contributing/ci
     - Design Documents: design
   - API Reference:
-    - Summary: api/summary.md
-    - Contents:
-      - api/vllm/*
-  - CLI Reference:
-    - Summary: cli/README.md
+    - api/README.md
+    - api/vllm/*
+  - CLI Reference: cli
   - Community:
     - community/*
     - Blog: https://blog.vllm.ai
diff --git a/docs/api/README.md b/docs/api/README.md
diff --git a/docs/cli/.meta.yml b/docs/cli/.meta.yml
@@ -0,0 +1 @@
+toc_depth: 3
diff --git a/docs/cli/.nav.yml b/docs/cli/.nav.yml
@@ -0,0 +1,8 @@
+nav:
+  - README.md
+  - serve.md
+  - chat.md
+  - complete.md
+  - run-batch.md
+  - vllm bench:
+    - bench/*.md
diff --git a/docs/cli/README.md b/docs/cli/README.md
@@ -1,7 +1,3 @@
----
-toc_depth: 4
----
-
 # vLLM CLI Guide
 
 The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
@@ -16,52 +12,48 @@ Available Commands:
 vllm {chat,complete,serve,bench,collect-env,run-batch}
 ```
 
-When passing JSON CLI arguments, the following sets of arguments are equivalent:
-
-- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
-- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
-
-Additionally, list elements can be passed individually using `+`:
+## serve
 
-- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
-- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
+Starts the vLLM OpenAI Compatible API server.
 
-## serve
+Start with a model:
 
-Start the vLLM OpenAI Compatible API server.
+```bash
+vllm serve meta-llama/Llama-2-7b-hf
+```
 
-??? console "Examples"
+Specify the port:
 
-    ```bash
-    # Start with a model
-    vllm serve meta-llama/Llama-2-7b-hf
+```bash
+vllm serve meta-llama/Llama-2-7b-hf --port 8100
+```
 
-    # Specify the port
-    vllm serve meta-llama/Llama-2-7b-hf --port 8100
+Serve over a Unix domain socket:
 
-    # Serve over a Unix domain socket
-    vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
+```bash
+vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
+```
 
-    # Check with --help for more options
-    # To list all groups
-    vllm serve --help=listgroup
+Check with --help for more options:
 
-    # To view a argument group
-    vllm serve --help=ModelConfig
+```bash
+# To list all groups
+vllm serve --help=listgroup
 
-    # To view a single argument
-    vllm serve --help=max-num-seqs
+# To view a argument group
+vllm serve --help=ModelConfig
 
-    # To search by keyword
-    vllm serve --help=max
+# To view a single argument
+vllm serve --help=max-num-seqs
 
-    # To view full help with pager (less/more)
-    vllm serve --help=page
-    ```
+# To search by keyword
+vllm serve --help=max
 
-### Options
+# To view full help with pager (less/more)
+vllm serve --help=page
+```
 
---8<-- "docs/argparse/serve.md"
+See [vllm serve](./serve.md) for the full reference of all available arguments.
 
 ## chat
 
@@ -78,6 +70,8 @@ vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
 vllm chat --quick "hi"
 ```
 
+See [vllm chat](./chat.md) for the full reference of all available arguments.
+
 ## complete
 
 Generate text completions based on the given prompt via the running API server.
@@ -93,7 +87,7 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
 vllm complete --quick "The future of AI is"
 ```
 
-</details>
+See [vllm complete](./complete.md) for the full reference of all available arguments.
 
 ## bench
 
@@ -120,6 +114,8 @@ vllm bench latency \
     --load-format dummy
 ```
 
+See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
+
 ### serve
 
 Benchmark the online serving throughput.
@@ -134,6 +130,8 @@ vllm bench serve \
     --num-prompts  5
 ```
 
+See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
+
 ### throughput
 
 Benchmark offline inference throughput.
@@ -147,6 +145,8 @@ vllm bench throughput \
     --load-format dummy
 ```
 
+See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
+
 ## collect-env
 
 Start collecting environment information.
@@ -159,24 +159,25 @@ vllm collect-env
 
 Run batch prompts and write results to file.
 
-<details>
-<summary>Examples</summary>
+Running with a local file:
 
 ```bash
-# Running with a local file
 vllm run-batch \
     -i offline_inference/openai_batch/openai_example_batch.jsonl \
     -o results.jsonl \
     --model meta-llama/Meta-Llama-3-8B-Instruct
+```
 
-# Using remote file
+Using remote file:
+
+```bash
 vllm run-batch \
     -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
     -o results.jsonl \
     --model meta-llama/Meta-Llama-3-8B-Instruct
 ```
 
-</details>
+See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
 
 ## More Help
 
diff --git a/docs/cli/bench/latency.md b/docs/cli/bench/latency.md
@@ -0,0 +1,9 @@
+# vllm bench latency
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Options
+
+--8<-- "docs/argparse/bench_latency.md"
diff --git a/docs/cli/bench/serve.md b/docs/cli/bench/serve.md
@@ -0,0 +1,9 @@
+# vllm bench serve
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Options
+
+--8<-- "docs/argparse/bench_serve.md"
diff --git a/docs/cli/bench/throughput.md b/docs/cli/bench/throughput.md
@@ -0,0 +1,9 @@
+# vllm bench throughput
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Options
+
+--8<-- "docs/argparse/bench_throughput.md"
diff --git a/docs/cli/chat.md b/docs/cli/chat.md
@@ -0,0 +1,5 @@
+# vllm chat
+
+## Options
+
+--8<-- "docs/argparse/chat.md"
diff --git a/docs/cli/complete.md b/docs/cli/complete.md
@@ -0,0 +1,5 @@
+# vllm complete
+
+## Options
+
+--8<-- "docs/argparse/complete.md"
diff --git a/docs/cli/json_tip.inc.md b/docs/cli/json_tip.inc.md
@@ -0,0 +1,9 @@
+When passing JSON CLI arguments, the following sets of arguments are equivalent:
+
+- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
+- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
+
+Additionally, list elements can be passed individually using `+`:
+
+- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
+- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
diff --git a/docs/cli/run-batch.md b/docs/cli/run-batch.md
@@ -0,0 +1,9 @@
+# vllm run-batch
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Options
+
+--8<-- "docs/argparse/run-batch.md"
diff --git a/docs/cli/serve.md b/docs/cli/serve.md
@@ -0,0 +1,9 @@
+# vllm serve
+
+## JSON CLI Arguments
+
+--8<-- "docs/cli/json_tip.inc.md"
+
+## Options
+
+--8<-- "docs/argparse/serve.md"
diff --git a/docs/configuration/engine_args.md b/docs/configuration/engine_args.md
@@ -11,15 +11,7 @@ Engine arguments control the behavior of the vLLM engine.
 
 The engine argument classes, [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs], are a combination of the configuration classes defined in [vllm.config][]. Therefore, if you are interested in developer documentation, we recommend looking at these configuration classes as they are the source of truth for types, defaults and docstrings.
 
-When passing JSON CLI arguments, the following sets of arguments are equivalent:
-
-- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
-- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
-
-Additionally, list elements can be passed individually using `+`:
-
-- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
-- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
+--8<-- "docs/cli/json_tip.inc.md"
 
 ## `EngineArgs`
 
diff --git a/docs/mkdocs/hooks/generate_argparse.py b/docs/mkdocs/hooks/generate_argparse.py
@@ -15,8 +15,14 @@
 sys.modules["blake3"] = MagicMock()
 sys.modules["vllm._C"] = MagicMock()
 
+from vllm.benchmarks import latency  # noqa: E402
+from vllm.benchmarks import serve  # noqa: E402
+from vllm.benchmarks import throughput  # noqa: E402
 from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs  # noqa: E402
-from vllm.entrypoints.openai.cli_args import make_arg_parser  # noqa: E402
+from vllm.entrypoints.cli.openai import ChatCommand  # noqa: E402
+from vllm.entrypoints.cli.openai import CompleteCommand  # noqa: E402
+from vllm.entrypoints.openai import cli_args  # noqa: E402
+from vllm.entrypoints.openai import run_batch  # noqa: E402
 from vllm.utils import FlexibleArgumentParser  # noqa: E402
 
 logger = logging.getLogger("mkdocs")
@@ -68,7 +74,8 @@ def add_arguments(self, actions):
                 self._markdown_output.append(
                     f"Possible choices: {metavar}\n\n")
 
-            self._markdown_output.append(f"{action.help}\n\n")
+            if action.help:
+                self._markdown_output.append(f"{action.help}\n\n")
 
             if (default := action.default) != SUPPRESS:
                 self._markdown_output.append(f"Default: `{default}`\n\n")
@@ -78,7 +85,7 @@ def format_help(self):
         return "".join(self._markdown_output)
 
 
-def create_parser(cls, **kwargs) -> FlexibleArgumentParser:
+def create_parser(add_cli_args, **kwargs) -> FlexibleArgumentParser:
     """Create a parser for the given class with markdown formatting.
     
     Args:
@@ -88,18 +95,12 @@ def create_parser(cls, **kwargs) -> FlexibleArgumentParser:
     Returns:
         FlexibleArgumentParser: A parser with markdown formatting for the class.
     """
-    parser = FlexibleArgumentParser()
+    parser = FlexibleArgumentParser(add_json_tip=False)
     parser.formatter_class = MarkdownFormatter
     with patch("vllm.config.DeviceConfig.__post_init__"):
-        return cls.add_cli_args(parser, **kwargs)
-
-
-def create_serve_parser() -> FlexibleArgumentParser:
-    """Create a parser for the serve command with markdown formatting."""
-    parser = FlexibleArgumentParser()
-    parser.formatter_class = lambda prog: MarkdownFormatter(
-        prog, starting_heading_level=4)
-    return make_arg_parser(parser)
+        _parser = add_cli_args(parser, **kwargs)
+    # add_cli_args might be in-place so return parser if _parser is None
+    return _parser or parser
 
 
 def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
@@ -113,10 +114,24 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
 
     # Create parsers to document
     parsers = {
-        "engine_args": create_parser(EngineArgs),
-        "async_engine_args": create_parser(AsyncEngineArgs,
-                                           async_args_only=True),
-        "serve": create_serve_parser(),
+        "engine_args":
+        create_parser(EngineArgs.add_cli_args),
+        "async_engine_args":
+        create_parser(AsyncEngineArgs.add_cli_args, async_args_only=True),
+        "serve":
+        create_parser(cli_args.make_arg_parser),
+        "chat":
+        create_parser(ChatCommand.add_cli_args),
+        "complete":
+        create_parser(CompleteCommand.add_cli_args),
+        "bench_latency":
+        create_parser(latency.add_cli_args),
+        "bench_throughput":
+        create_parser(throughput.add_cli_args),
+        "bench_serve":
+        create_parser(serve.add_cli_args),
+        "run-batch":
+        create_parser(run_batch.make_arg_parser),
     }
 
     # Generate documentation for each parser
diff --git a/requirements/docs.txt b/requirements/docs.txt
@@ -29,3 +29,5 @@ setproctitle
 torch
 transformers
 zmq
+uvloop
+prometheus-client
diff --git a/vllm/benchmarks/throughput.py b/vllm/benchmarks/throughput.py
diff --git a/vllm/entrypoints/cli/openai.py b/vllm/entrypoints/cli/openai.py
diff --git a/vllm/entrypoints/openai/run_batch.py b/vllm/entrypoints/openai/run_batch.py
diff --git a/vllm/utils/__init__.py b/vllm/utils/__init__.py