Skip to content

Commit f690570

Browse files
[VLM] Add a CLI plugin system for mlperf-inf-mm-q3vl benchmark (mlcommons#2420)
* Introduce the mlperf-inf-mm-q3vl benchmark plugin system * fix circular import * [Automated Commit] Format Codebase * update README --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 27db053 commit f690570

File tree

5 files changed

+302
-51
lines changed

5 files changed

+302
-51
lines changed

multimodal/qwen3-vl/README.md

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,201 @@ bash submit.sh --help
268268
- Testing duration $\ge$ 10 mins.
269269
- Sample concatenation permutation is enabled.
270270

271+
## Plugin System for `mlperf-inf-mm-q3vl benchmark`
272+
273+
The `mlperf-inf-mm-q3vl` package supports a plugin system that allows third-party
274+
packages to register additional subcommands under `mlperf-inf-mm-q3vl benchmark`. This
275+
uses Python's standard entry points mechanism.
276+
277+
The purpose of this feature is to allow benchmark result submitters to customize and fit
278+
`mlperf-inf-mm-q3vl` to the inference system that they would like to benchmark,
279+
**without** direct modification to the source code of `mlperf-inf-mm-q3vl` which is
280+
frozen after the benchmark being finalized.
281+
282+
### How it works
283+
284+
1. **Plugin Discovery**: When the CLI starts, it automatically discovers all registered
285+
plugins via the `mlperf_inf_mm_q3vl.benchmark_plugins` entry point group.
286+
2. **Plugin Loading**: Each plugin's entry point function is called to retrieve either a
287+
single command or a Typer app.
288+
3. **Command Registration**: The plugin's commands are automatically added to the
289+
`benchmark` subcommand group.
290+
291+
### Example: creating a `mlperf-inf-mm-q3vl-foo` plugin package for `mlperf-inf-mm-q3vl benchmark foo`
292+
293+
#### Step 1: Package Structure
294+
295+
Create a new Python package with the following structure:
296+
297+
```
298+
mlperf-inf-mm-q3vl-foo/
299+
├── pyproject.toml
300+
└── src/
301+
└── mlperf_inf_mm_q3vl_foo/
302+
├── __init__.py
303+
├── schema.py
304+
├── deploy.py
305+
└── plugin.py
306+
```
307+
308+
Note that this is only a minimalistically illustrative example. The users are free to
309+
structure and name their Python packages and modules in any way that they wish.
310+
311+
#### Step 2: Implement the `mlperf-inf-mm-q3vl-foo` plugin
312+
313+
Create your plugin entry point function in `plugin.py`:
314+
315+
```python
316+
"""Plugin to support benchmarking the Foo inference system."""
317+
318+
from typing import Annotated
319+
from collections.abc import Callable
320+
from loguru import logger
321+
from pydantic_typer import Typer
322+
from typer import Option
323+
from mlperf_inf_mm_q3vl.schema import Settings, Dataset, Endpoint, Verbosity
324+
from mlperf_inf_mm_q3vl.log import setup_loguru_for_benchmark
325+
326+
from .schema import FooEndpoint
327+
328+
def register_foo_benchmark() -> Callable:
329+
"""Entry point for the plugin to benchmark the Foo inference system.
330+
331+
This function is called when the CLI discovers the plugin.
332+
It should return either:
333+
- A single command function (decorated with appropriate options)
334+
- A tuple of (Typer app, command name) for more complex hierarchies
335+
"""
336+
337+
def benchmark_foo(
338+
*,
339+
settings: Settings,
340+
dataset: Dataset,
341+
# Add your foo-specific parameters here
342+
foo: FooEndpoint,
343+
custom_param: Annotated[
344+
int,
345+
Option(help="Custom parameter for foo backend"),
346+
] = 2,
347+
random_seed: Annotated[
348+
int,
349+
Option(help="The seed for the random number generator."),
350+
] = 12345,
351+
verbosity: Annotated[
352+
Verbosity,
353+
Option(help="The verbosity level of the logger."),
354+
] = Verbosity.INFO,
355+
) -> None:
356+
"""Deploy and benchmark using Foo backend.
357+
358+
This command deploys a model using the Foo backend
359+
and runs the MLPerf benchmark against it.
360+
"""
361+
from .deploy import FooDeployer
362+
363+
setup_loguru_for_benchmark(settings=settings, verbosity=verbosity)
364+
logger.info(
365+
f"Start to benchmark the Foo inference system with endpoint spec {} and custom param {}",
366+
foo,
367+
custom_param,
368+
)
369+
# Your implementation here
370+
with FooDeployer(endpoint=foo, settings=settings, custom_param=custom_param):
371+
# FooDeployer will make sure that Foo is deployed and currently healthy.
372+
# Run benchmark using the core run_benchmark function
373+
run_benchmark(
374+
settings=settings,
375+
dataset=dataset,
376+
endpoint=vllm,
377+
random_seed=random_seed,
378+
)
379+
380+
# Return the command function
381+
# The entry point name will be used as the subcommand name
382+
return benchmark_foo
383+
```
384+
385+
#### Step 3: Configure `pyproject.toml`
386+
387+
Register the plugin in its package's `pyproject.toml`:
388+
389+
```toml
390+
[project]
391+
name = "mlperf-inf-mm-q3vl-foo"
392+
version = "0.1.0"
393+
description = "Enable mlperf-inf-mm-q3vl to benchmark the Foo inference system."
394+
requires-python = ">=3.12"
395+
dependencies = [
396+
"mlperf-inf-mm-q3vl @ git+https://github.com/mlcommons/inference.git#subdirectory=multimodal/qwen3-vl/",
397+
# Add your backend-specific dependencies here
398+
]
399+
400+
[project.entry-points."mlperf_inf_mm_q3vl.benchmark_plugins"]
401+
# The key here becomes the subcommand name.
402+
foo = "mlperf_inf_mm_q3vl_foo.plugin:register_foo_benchmark"
403+
404+
[build-system]
405+
requires = ["setuptools>=80"]
406+
build-backend = "setuptools.build_meta"
407+
```
408+
409+
#### Step 4: Install and use `mlperf-inf-mm-q3vl benchmark foo`
410+
411+
```bash
412+
# Install your plugin package
413+
pip install mlperf-inf-mm-q3vl-foo
414+
415+
# The new subcommand is now available
416+
mlperf-inf-mm-q3vl benchmark foo --help
417+
mlperf-inf-mm-q3vl benchmark foo \
418+
--settings-file settings.toml \
419+
--dataset shopify-global-catalogue \
420+
--custom-param 3
421+
```
422+
423+
#### Advanced: Nested Subcommands
424+
425+
If you want to create multiple subcommands under a single plugin (e.g.,
426+
`mlperf-inf-mm-q3vl benchmark foo standard` and
427+
`mlperf-inf-mm-q3vl benchmark foo optimized`), return a tuple of `(Typer app, name)`:
428+
429+
```python
430+
def register_foo_benchmark() -> tuple[Typer, str]:
431+
"""Entry point that creates nested subcommands."""
432+
from pydantic_typer import Typer
433+
434+
# Create a Typer app for your plugin
435+
foo_app = Typer(help="Benchmarking options for the Foo inference systems.")
436+
437+
@foo_app.command(name="standard")
438+
def foo_standard(...) -> None:
439+
"""Run standard Foo benchmark."""
440+
# Implementation
441+
...
442+
443+
@foo_app.command(name="optimized")
444+
def foo_optimized(...) -> None:
445+
"""Run optimized Foo benchmark with max performance."""
446+
# Implementation
447+
...
448+
449+
# Return tuple of (app, command_name)
450+
return (foo_app, "foo")
451+
```
452+
453+
This will create:
454+
- `mlperf-inf-mm-q3vl benchmark foo standard`
455+
- `mlperf-inf-mm-q3vl benchmark foo optimized`
456+
457+
### Best Practices
458+
459+
1. Dependencies: Declare `mlperf-inf-mm-q3vl` as a dependency in your plugin package.
460+
2. Documentation: Provide clear docstrings for your plugin commands - they appear in
461+
`--help` output.
462+
3. Schema Reuse: Reuse the core `Settings`, `Dataset`, and other schemas from
463+
`mlperf_inf_mm_q3vl.schema` for consistency and minimizing boilerplate code.
464+
4. Lazy Imports: If your plugin has heavy dependencies, import them inside functions
465+
rather than at module level to avoid slowing down CLI startup
271466

272467
## Developer Guide
273468

multimodal/qwen3-vl/scripts/slurm/submit.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,12 +99,12 @@ while [[ $# -gt 0 ]]; do
9999
shift
100100
;;
101101
-seq | --server-expected-qps)
102-
server_expected_qps=$2
102+
server_target_qps=$2
103103
shift
104104
shift
105105
;;
106106
-seq=* | --server-expected-qps=*)
107-
server_expected_qps=${1#*=}
107+
server_target_qps=${1#*=}
108108
shift
109109
;;
110110
-tps | --tensor-parallel-size)
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
"""Core benchmark execution logic for the Qwen3-VL (Q3VL) benchmark."""
2+
3+
from __future__ import annotations
4+
5+
import mlperf_loadgen as lg
6+
from loguru import logger
7+
8+
from .schema import Dataset, Endpoint, Settings
9+
from .task import ShopifyGlobalCatalogue
10+
11+
12+
def run_benchmark(
13+
settings: Settings,
14+
dataset: Dataset,
15+
endpoint: Endpoint,
16+
random_seed: int,
17+
) -> None:
18+
"""Run the Qwen3-VL (Q3VL) benchmark."""
19+
logger.info(
20+
"Running Qwen3-VL (Q3VL) benchmark with settings: {}",
21+
settings)
22+
logger.info("Running Qwen3-VL (Q3VL) benchmark with dataset: {}", dataset)
23+
logger.info(
24+
"Running Qwen3-VL (Q3VL) benchmark with OpenAI API endpoint: {}",
25+
endpoint,
26+
)
27+
logger.info(
28+
"Running Qwen3-VL (Q3VL) benchmark with random seed: {}",
29+
random_seed)
30+
test_settings, log_settings = settings.to_lgtype()
31+
task = ShopifyGlobalCatalogue(
32+
dataset=dataset,
33+
endpoint=endpoint,
34+
settings=settings.test,
35+
random_seed=random_seed,
36+
)
37+
sut = task.construct_sut()
38+
qsl = task.construct_qsl()
39+
logger.info("Starting the Qwen3-VL (Q3VL) benchmark with LoadGen...")
40+
lg.StartTestWithLogSettings(sut, qsl, test_settings, log_settings)
41+
logger.info("The Qwen3-VL (Q3VL) benchmark with LoadGen completed.")
42+
lg.DestroyQSL(qsl)
43+
lg.DestroySUT(sut)

0 commit comments

Comments
 (0)