Skip to content

Commit 0a20748

Browse files
Victor49152nvzhihanjgithub-code-quality[bot]
authored
feat: Add shopify dataset for q3vl inference (#152)
* Initial commit for adding shopify dataset to predefined * temporal test script * Tested shopify dataset * pre-commit formatting * Remove unused counter and modified default image format * Apply the suggestion to use df.to_dict instead of iterrows * Unused import * Update folder and class names * Update folder and class names * Offline perf yaml verified * Rename * Accuracy results tested * Add the updated Readme * Add unit tests for new preset dataset and scorer * precommit formatting * Potential fix for pull request finding 'Unused global variable' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * Config and readme related updates * Refactor metadata related schema * Load_from_huggingface returns a HF dataset instead of pd frame. * Follow up fixes for updating load_from_huggingface * Adding Pillow to dependency for dataset decoding * Add logging for worker settings and also make worker init timeout configurable * Load from disk direcly loads what's inside, no split kwarg supported * No default cache dir needed. * Make zmq_rev/send buffer size configurable * Finalize the yaml file * Add new yaml config args to unit test * Fix typing * precommit update * Put metadata to a separate file for better formatting * format fix * ruff fix * remove redudant type checks as hf dataset sample is fetched * update pytest as PIL image input is assumed * Potential fix for pull request finding 'Unused global variable' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * Potential fix for pull request finding 'An assert statement has a side-effect' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * Revert "Adding Pillow to dependency for dataset decoding" This reverts commit 812fad1. * rename output to response to align with the function name * Add example calculation in docstring and update the naming * Revert the changes to load_from_huggingface * use hf original load_datasets * Remove uv file * Fix Pillow version --------- Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
1 parent 56971a5 commit 0a20748

File tree

19 files changed

+1322
-14
lines changed

19 files changed

+1322
-14
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Running Endpoints with Qwen3-VL-235B-A22B on Shopify Product Catalogue
2+
3+
This document describes how to perform MLPerf Q3VL benchmarking using the inference endpoints with [Qwen3-VL-235B-A22B-instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) model and [Shopify's Product Catalogue dataset](https://huggingface.co/datasets/Shopify/product-catalogue) for multimodal product taxonomy classification.
4+
5+
## Get Dataset
6+
7+
The Shopify Product Catalogue dataset is loaded from HuggingFace and will be generated automatically on first run. Images are converted to base64 for storage.
8+
9+
```
10+
# Dataset is auto-downloaded from https://huggingface.co/datasets/Shopify/product-catalogue
11+
# No manual download required - DataLoaderFactory handles it
12+
```
13+
14+
## Get Model
15+
16+
Use the public quantized MLPerf checkpoint:
17+
18+
```
19+
export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct
20+
export HF_TOKEN=<your Hugging Face token> # Optional for public model; may help with rate limits
21+
hf download $MODEL_NAME
22+
```
23+
24+
The model is available at [Qwen3-VL-235B-A22B-instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct) — no access request required.
25+
26+
**Note:** The Shopify Product Catalogue includes `ground_truth_category`, `ground_truth_brand`, and `ground_truth_is_secondhand` from the HuggingFace dataset. For accuracy evaluation, use the `shopify_category_f1` scorer which computes hierarchical F1 for category taxonomy (matches [MLCommons Q3VL evaluation](https://github.com/mlcommons/inference/blob/master/multimodal/qwen3-vl/src/mlperf_inf_mm_q3vl/evaluation.py)).
27+
28+
To add accuracy evaluation, include an accuracy dataset alongside the performance dataset:
29+
30+
```yaml
31+
datasets:
32+
- name: shopify_product_catalogue::q3vl
33+
type: "performance"
34+
force: true
35+
- name: shopify_product_catalogue::q3vl
36+
type: "accuracy"
37+
force: true
38+
accuracy_config:
39+
eval_method: "shopify_category_f1"
40+
ground_truth: "ground_truth_category"
41+
extractor: "identity_extractor" # Required by benchmark; scorer parses JSON internally
42+
num_repeats: 1
43+
```
44+
45+
## Benchmark Qwen3-VL-235B-A22B using a config file
46+
47+
Prepare the environment:
48+
49+
```
50+
export MODEL_NAME=Qwen/Qwen3-VL-235B-A22B-Instruct
51+
export HF_TOKEN=<your Hugging Face token> # Optional for public model
52+
export HF_HOME=<path to HuggingFace cache, e.g. ~/.cache/huggingface>
53+
```
54+
55+
Launch the vLLM server (vision model requires appropriate GPU resources):
56+
57+
```
58+
docker run --runtime nvidia --gpus all \
59+
-p 8000:8000 \
60+
--ipc=host \
61+
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
62+
--env "VLLM_HTTP_TIMEOUT_KEEP_ALIVE=3600" \
63+
--env "VLLM_ENGINE_READY_TIMEOUT_S=3600" \
64+
-v ${HF_HOME}:/root/.cache/huggingface \
65+
vllm/vllm-openai:latest \
66+
--model ${MODEL_NAME} \
67+
--tensor-parallel-size 4 \
68+
--max-model-len=32768 \
69+
--async-scheduling \
70+
--limit-mm-per-prompt.video 0
71+
```
72+
73+
Run the benchmark:
74+
75+
```
76+
inference-endpoint benchmark from-config -c examples/08_Qwen3-VL-235B-A22B_Example/offline_qwen3_vl_235b_a22b_shopify.yaml --timeout 600
77+
```
78+
79+
This config uses `test_mode: "acc"` for accuracy-only (hierarchical F1). Change to `"both"` for perf+acc or `"perf"` for perf-only.
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Offline Benchmark - Qwen3-VL-235B-A22B on Shopify Product Catalogue
2+
# MLPerf Inference Q3VL benchmark: multimodal product taxonomy classification
3+
name: "offline-qwen3-vl-235b-a22b-shopify-benchmark"
4+
version: "1.0"
5+
type: "offline"
6+
timeout: 14400 # Perf + acc run takes over 3 hours, consider limit n_samples_to_issue for perf run or remove accuracy dataset to skip accuracy run
7+
8+
model_params:
9+
name: "Qwen/Qwen3-VL-235B-A22B-Instruct"
10+
temperature: 0
11+
top_p: 1
12+
max_new_tokens: 150
13+
14+
datasets:
15+
- name: shopify_product_catalogue::q3vl
16+
type: "performance"
17+
force: true
18+
- name: shopify_product_catalogue::q3vl
19+
type: "accuracy"
20+
force: true
21+
accuracy_config:
22+
eval_method: "shopify_category_f1"
23+
ground_truth: "ground_truth_category"
24+
extractor: "identity_extractor"
25+
num_repeats: 1
26+
27+
settings:
28+
runtime:
29+
min_duration_ms: 600000 # 10 minute
30+
n_samples_to_issue: 100 # Limit queries for testing (remove or increase for full run)
31+
scheduler_random_seed: 42 # For Poisson/distribution sampling
32+
dataloader_random_seed: 42 # For dataset shuffling
33+
34+
load_pattern:
35+
type: "max_throughput"
36+
37+
client:
38+
workers: 2
39+
# ZMQ IPC buffers (bytes). Default 4MB; increase for large multimodal payloads (e.g. 16777216 = 16MB).
40+
zmq_recv_buffer_bytes: 16777216
41+
zmq_send_buffer_bytes: 16777216
42+
# Cap connections to avoid overwhelming server. Default -1 uses ~25k (ephemeral ports),
43+
# causing warmup failures and connection timeouts. 256-512 is typical for vLLM.
44+
max_connections: 512
45+
# Increase timeout for slow worker startup (spawn, imports). Default 40s may be too short.
46+
worker_initialization_timeout: 120
47+
48+
metrics:
49+
collect:
50+
- "throughput"
51+
- "latency"
52+
53+
endpoint_config:
54+
endpoints:
55+
- "http://localhost:8000"
56+
api_key: null
57+
58+
report_dir: results/qwen3_vl_235b_a22b_shopify_benchmark_mlperf/

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ dependencies = [
4444
"transformers==4.57.1",
4545
"numpy==2.3.4",
4646
"datasets==4.1.1",
47+
"Pillow==12.1.1",
4748
"sentencepiece==0.2.1",
4849
"protobuf==6.33.0",
4950
"openai_harmony==0.0.8",

src/inference_endpoint/async_utils/transport/zmq/transport.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ class _ZMQSocketConfig:
9292
high_water_mark: int = 0 # 0 = unlimited
9393
linger: int = -1 # Block indefinitely on close to send pending messages
9494
immediate: int = 1 # Only enqueue on ready connections
95+
# Default 4MB; increase for multimodal (VL) payloads via HTTPClientConfig / YAML / CLI.
9596
recv_buffer_size: int = 4 * 1024 * 1024 # 4MB
9697
send_buffer_size: int = 4 * 1024 * 1024 # 4MB
9798

@@ -646,7 +647,7 @@ def create(
646647
num_workers: Number of workers (required).
647648
zmq_context: Managed ZMQ context (e.g. from ManagedZMQContext.scoped()).
648649
*args: Ignored - prevents any errors with extraneous args and adheres with WorkerPoolTransport.create().
649-
**kwargs: Optional _ZMQSocketConfig overrides.
650+
**kwargs: Optional _ZMQSocketConfig overrides (e.g. ``recv_buffer_size``, ``send_buffer_size``).
650651
651652
Returns:
652653
Configured ZmqWorkerPoolTransport instance.

src/inference_endpoint/cli.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ def create_parser() -> argparse.ArgumentParser:
9191
"QPS is used to calculate total queries (QPS × duration).",
9292
)
9393
_add_shared_benchmark_args(offline_parser)
94+
_add_zmq_buffer_args(offline_parser)
9495
_add_auxiliary_args(offline_parser)
9596

9697
# benchmark online
@@ -101,6 +102,7 @@ def create_parser() -> argparse.ArgumentParser:
101102
)
102103
_add_shared_benchmark_args(online_parser)
103104
_add_online_specific_args(online_parser)
105+
_add_zmq_buffer_args(online_parser)
104106
_add_auxiliary_args(online_parser)
105107

106108
# benchmark from-config (YAML mode)
@@ -283,6 +285,24 @@ def _add_auxiliary_args(parser):
283285
)
284286

285287

288+
def _add_zmq_buffer_args(parser):
289+
"""ZMQ IPC buffer sizes for offline/online CLI mode only (not from-config)."""
290+
parser.add_argument(
291+
"--zmq-recv-buffer-bytes",
292+
type=int,
293+
default=argparse.SUPPRESS,
294+
metavar="N",
295+
help="ZMQ receive buffer size in bytes (default: 4MB; offline/online only)",
296+
)
297+
parser.add_argument(
298+
"--zmq-send-buffer-bytes",
299+
type=int,
300+
default=argparse.SUPPRESS,
301+
metavar="N",
302+
help="ZMQ send buffer size in bytes (default: 4MB; offline/online only)",
303+
)
304+
305+
286306
# Argparse structure enforces arg validity - no manual validation needed
287307

288308

src/inference_endpoint/commands/benchmark.py

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,9 @@
6464
from inference_endpoint.dataset_manager.dataset import Dataset
6565
from inference_endpoint.dataset_manager.factory import DataLoaderFactory
6666
from inference_endpoint.endpoint_client.config import HTTPClientConfig
67-
from inference_endpoint.endpoint_client.cpu_affinity import pin_loadgen
67+
from inference_endpoint.endpoint_client.cpu_affinity import (
68+
pin_loadgen,
69+
)
6870
from inference_endpoint.endpoint_client.http_client import HTTPEndpointClient
6971
from inference_endpoint.endpoint_client.http_sample_issuer import HttpClientSampleIssuer
7072
from inference_endpoint.evaluation import Extractor
@@ -292,6 +294,16 @@ def _build_config_from_cli(
292294
timeout = getattr(args, "timeout", None)
293295
verbose_level = getattr(args, "verbose", 0)
294296
api_type = APIType(getattr(args, "api_type", "openai"))
297+
client_kwargs: dict[str, Any] = {
298+
"workers": args.workers if args.workers else -1,
299+
"log_level": "DEBUG" if verbose_level >= 2 else "INFO",
300+
"warmup_connections": getattr(args, "warmup_connections", -1),
301+
"max_connections": getattr(args, "max_connections", None) or -1,
302+
}
303+
if hasattr(args, "zmq_recv_buffer_bytes"):
304+
client_kwargs["zmq_recv_buffer_bytes"] = args.zmq_recv_buffer_bytes
305+
if hasattr(args, "zmq_send_buffer_bytes"):
306+
client_kwargs["zmq_send_buffer_bytes"] = args.zmq_send_buffer_bytes
295307
# Build BenchmarkConfig from CLI params
296308
return BenchmarkConfig(
297309
name=f"cli_{benchmark_mode}",
@@ -322,12 +334,7 @@ def _build_config_from_cli(
322334
scheduler_random_seed=42,
323335
dataloader_random_seed=42,
324336
),
325-
client=ClientSettings(
326-
workers=args.workers if args.workers else -1,
327-
log_level="DEBUG" if verbose_level >= 2 else "INFO",
328-
warmup_connections=getattr(args, "warmup_connections", -1),
329-
max_connections=getattr(args, "max_connections", None) or -1,
330-
),
337+
client=ClientSettings(**client_kwargs),
331338
),
332339
model_params=ModelParams(
333340
name=args.model,
@@ -580,6 +587,13 @@ def _run_benchmark(
580587
try:
581588
api_type: APIType = config.endpoint_config.api_type
582589
assert api_type is not None
590+
warmup = config.settings.client.warmup_connections
591+
max_conn = config.settings.client.max_connections
592+
init_timeout = config.settings.client.worker_initialization_timeout
593+
logger.info(
594+
f"HTTP client: workers={num_workers}, warmup_connections={warmup}, "
595+
f"max_connections={max_conn}, worker_init_timeout={init_timeout}s"
596+
)
583597
http_config = HTTPClientConfig(
584598
endpoint_urls=[urljoin(e, api_type.default_route()) for e in endpoints],
585599
api_type=api_type,
@@ -588,9 +602,12 @@ def _run_benchmark(
588602
event_logs_dir=report_dir,
589603
log_level=config.settings.client.log_level,
590604
cpu_affinity=affinity_plan,
591-
warmup_connections=config.settings.client.warmup_connections,
592-
max_connections=config.settings.client.max_connections,
605+
warmup_connections=warmup,
606+
max_connections=max_conn,
607+
worker_initialization_timeout=init_timeout,
593608
api_key=config.endpoint_config.api_key,
609+
zmq_recv_buffer_bytes=config.settings.client.zmq_recv_buffer_bytes,
610+
zmq_send_buffer_bytes=config.settings.client.zmq_send_buffer_bytes,
594611
)
595612
http_client = HTTPEndpointClient(http_config, zmq_context=zmq_ctx)
596613
sample_issuer = HttpClientSampleIssuer(http_client)

src/inference_endpoint/config/schema.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,22 @@ class ClientSettings(BaseModel):
299299
# -1 = unlimited (bound by system ephemeral port limit)
300300
max_connections: int = -1
301301

302+
# Seconds to wait for workers to initialize (spawn, connect, signal ready).
303+
# Increase for slow systems or when workers load heavy dependencies.
304+
worker_initialization_timeout: float = 40.0
305+
306+
# ZMQ IPC socket buffer sizes (bytes). Increase for large multimodal requests.
307+
zmq_recv_buffer_bytes: int = Field(
308+
default=4 * 1024 * 1024,
309+
ge=1,
310+
description="ZMQ receive buffer size in bytes (default 4MB).",
311+
)
312+
zmq_send_buffer_bytes: int = Field(
313+
default=4 * 1024 * 1024,
314+
ge=1,
315+
description="ZMQ send buffer size in bytes (default 4MB).",
316+
)
317+
302318

303319
class Settings(BaseModel):
304320
"""Test settings (can be overridden by CLI)."""

src/inference_endpoint/dataset_manager/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
from .predefined.livecodebench import LiveCodeBench
2828
from .predefined.open_orca import OpenOrca
2929
from .predefined.random import RandomDataset
30+
from .predefined.shopify_product_catalogue import ShopifyProductCatalogue
3031
from .transforms import (
3132
AddStaticColumns,
3233
ColumnFilter,
@@ -56,4 +57,5 @@
5657
"LiveCodeBench",
5758
"CNNDailyMail",
5859
"RandomDataset",
60+
"ShopifyProductCatalogue",
5961
]

0 commit comments

Comments
 (0)