Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, self-contained example for benchmarking the Qwen2.5-0.5B-Instruct large language model. It provides a complete workflow for evaluating the model's performance using either vLLM or SGLang inference servers, covering both maximum throughput (offline) and concurrency profiling (online) scenarios. The addition aims to simplify the process for users to assess the model's capabilities and identify optimal deployment configurations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new comprehensive example for benchmarking the Qwen2.5-0.5B model, including scripts for running benchmarks, documentation, and configuration files. The changes are well-structured and add significant value. My review includes suggestions to improve the clarity of the documentation, enhance the robustness of the shell script, and make the Python summary script more resilient to missing optional dependencies.
| if [[ "$SERVER_TYPE" == "vllm" ]]; then | ||
| READY_PATTERN="Uvicorn running\|Application startup complete" | ||
| else | ||
| READY_PATTERN="Uvicorn running\|Server is ready" | ||
| fi |
There was a problem hiding this comment.
When using grep -E (as suggested for line 138), the alternation operator is |, not \|. The backslash should be removed for the pattern to be correctly interpreted by extended regex.
| if [[ "$SERVER_TYPE" == "vllm" ]]; then | |
| READY_PATTERN="Uvicorn running\|Application startup complete" | |
| else | |
| READY_PATTERN="Uvicorn running\|Server is ready" | |
| fi | |
| if [[ "$SERVER_TYPE" == "vllm" ]]; then | |
| READY_PATTERN="Uvicorn running|Application startup complete" | |
| else | |
| READY_PATTERN="Uvicorn running|Server is ready" | |
| fi |
| fi | ||
|
|
||
| while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do | ||
| if docker logs ${CONTAINER_NAME} 2>&1 | grep -q "$READY_PATTERN"; then |
There was a problem hiding this comment.
The grep -q command uses Basic Regular Expressions (BRE) by default, where | (from the READY_PATTERN) is not a special character for alternation. To correctly match either part of the pattern, you should use grep -qE for Extended Regular Expressions, which is also used on line 142.
| if docker logs ${CONTAINER_NAME} 2>&1 | grep -q "$READY_PATTERN"; then | |
| if docker logs ${CONTAINER_NAME} 2>&1 | grep -qE "$READY_PATTERN"; then |
|
|
||
| if curl -s http://localhost:${SERVER_PORT}/v1/models 2>/dev/null | grep -q "${MODEL_NAME}"; then | ||
| echo "✅ Server is responding correctly" | ||
| elif curl -s http://localhost:${SERVER_PORT}/health 2>/dev/null | grep -q "ok\|healthy"; then |
There was a problem hiding this comment.
For portability and correct regex evaluation, use grep -qE with | for alternation instead of grep -q with \|.
| elif curl -s http://localhost:${SERVER_PORT}/health 2>/dev/null | grep -q "ok\|healthy"; then | |
| elif curl -s http://localhost:${SERVER_PORT}/health 2>/dev/null | grep -qE "ok|healthy"; then |
| def write_plots(rows: list[dict], path: Path) -> None: | ||
| import matplotlib.pyplot as plt | ||
|
|
||
| successful = [r for r in rows if r["status"] == "ok"] | ||
| if not successful: | ||
| print("No successful runs to plot.", file=sys.stderr) | ||
| return | ||
|
|
||
| x = [r["concurrency"] for r in successful] | ||
| tps = [r["tps"] for r in successful] | ||
| ttft_p99 = [r["ttft_p99_ms"] for r in successful] | ||
| tpot_p50 = [r["tpot_p50_ms"] for r in successful] | ||
|
|
||
| fig, axes = plt.subplots(3, 1, figsize=(9, 11), sharex=True) | ||
| fig.suptitle("Concurrency Sweep Performance", fontsize=14, fontweight="bold") | ||
|
|
||
| metrics = [ | ||
| (axes[0], tps, "TPS", "Tokens / s", "tab:blue"), | ||
| (axes[1], ttft_p99, "TTFT P99 (ms)", "Latency (ms)", "tab:orange"), | ||
| (axes[2], tpot_p50, "TPOT P50 (ms)", "Latency (ms)", "tab:green"), | ||
| ] | ||
| for ax, values, label, ylabel, color in metrics: | ||
| ax.plot(x, values, marker="o", linewidth=2, color=color, label=label) | ||
| ax.set_ylabel(ylabel) | ||
| ax.set_xscale("log", base=2) | ||
| ax.set_xticks(x) | ||
| ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda v, _: str(int(v)))) | ||
| ax.legend(loc="upper left") | ||
| ax.grid(True, which="both", linestyle="--", alpha=0.5) | ||
|
|
||
| axes[-1].set_xlabel("Concurrency") | ||
| fig.tight_layout() | ||
| fig.savefig(path, dpi=150, bbox_inches="tight") | ||
| plt.close(fig) | ||
|
|
There was a problem hiding this comment.
The script will crash if matplotlib is not installed because of the direct import. Since plotting is an optional enhancement, it's better to handle this dependency gracefully. I suggest moving the import inside the write_plots function and wrapping it in a try...except ImportError block. This allows the script to produce text and CSV summaries even if matplotlib is unavailable.
def write_plots(rows: list[dict], path: Path) -> None:
try:
import matplotlib.pyplot as plt
except ImportError:
print("Warning: matplotlib not found. Skipping plot generation.", file=sys.stderr)
print("Install it with: pip install matplotlib", file=sys.stderr)
return
successful = [r for r in rows if r["status"] == "ok"]
if not successful:
print("No successful runs to plot.", file=sys.stderr)
return
x = [r["concurrency"] for r in successful]
tps = [r["tps"] for r in successful]
ttft_p99 = [r["ttft_p99_ms"] for r in successful]
tpot_p50 = [r["tpot_p50_ms"] for r in successful]
fig, axes = plt.subplots(3, 1, figsize=(9, 11), sharex=True)
fig.suptitle("Concurrency Sweep Performance", fontsize=14, fontweight="bold")
metrics = [
(axes[0], tps, "TPS", "Tokens / s", "tab:blue"),
(axes[1], ttft_p99, "TTFT P99 (ms)", "Latency (ms)", "tab:orange"),
(axes[2], tpot_p50, "TPOT P50 (ms)", "Latency (ms)", "tab:green"),
]
for ax, values, label, ylabel, color in metrics:
ax.plot(x, values, marker="o", linewidth=2, color=color, label=label)
ax.set_ylabel(ylabel)
ax.set_xscale("log", base=2)
ax.set_xticks(x)
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda v, _: str(int(v))))
ax.legend(loc="upper left")
ax.grid(True, which="both", linestyle="--", alpha=0.5)
axes[-1].set_xlabel("Concurrency")
fig.tight_layout()
fig.savefig(path, dpi=150, bbox_inches="tight")
plt.close(fig)There was a problem hiding this comment.
Pull request overview
Adds a Qwen2.5-0.5B example workflow plus new concurrency sweep helper scripts to run multi-concurrency online benchmarks and summarize the resulting result_summary.json metrics into tables/files.
Changes:
- Add
scripts/concurrency_sweep/run.pyto execute a benchmark repeatedly over a list of concurrency targets (one subdir per run). - Add
scripts/concurrency_sweep/summarize.pyto aggregateresult_summary.jsonfiles into terminal tables + CSV/Markdown/PNG outputs. - Add a Qwen example package (YAML configs, dataset prep, run script, docs) for vLLM and SGLang.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/concurrency_sweep/summarize.py | Parses per-run result_summary.json metrics, prints tables, and writes CSV/MD/PNG summaries. |
| scripts/concurrency_sweep/run.py | Generates per-concurrency configs from a template YAML and runs the benchmark CLI for each sweep point. |
| examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml | Online SGLang benchmark config used as sweep template input. |
| examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml | Offline SGLang benchmark config. |
| examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml | Online vLLM benchmark config used as sweep template input. |
| examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml | Offline vLLM benchmark config. |
| examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh | End-to-end runner: prepares dataset, launches server container, runs offline benchmark or online sweep, and prints next steps. |
| examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py | Produces a Qwen-specific dataset pickle with the expected prompt column. |
| examples/08_Qwen2.5-0.5B_Example/README.md | Full walkthrough for running Qwen benchmarks (manual + scripted). |
| examples/08_Qwen2.5-0.5B_Example/QUICKSTART.md | Short-form “one command” quick start instructions. |
| examples/08_Qwen2.5-0.5B_Example/.gitignore | Ignores example-local generated artifacts (data pickle, logs, etc.). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| READY_PATTERN="Uvicorn running\|Application startup complete" | ||
| else | ||
| READY_PATTERN="Uvicorn running\|Server is ready" | ||
| fi | ||
|
|
||
| while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do | ||
| if docker logs ${CONTAINER_NAME} 2>&1 | grep -q "$READY_PATTERN"; then |
| def collect_results(sweep_dir: Path) -> list[dict]: | ||
| rows = [] | ||
| for concurrency_dir in sorted( | ||
| sweep_dir.glob("concurrency_*"), | ||
| key=lambda p: int(p.name.split("_")[1]), |
| ax.set_xscale("log", base=2) | ||
| ax.set_xticks(x) | ||
| ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda v, _: str(int(v)))) | ||
| ax.legend(loc="upper left") |
| def write_plots(rows: list[dict], path: Path) -> None: | ||
| import matplotlib.pyplot as plt | ||
|
|
||
| successful = [r for r in rows if r["status"] == "ok"] | ||
| if not successful: | ||
| print("No successful runs to plot.", file=sys.stderr) | ||
| return |
| api_key: null | ||
| api_type: "sglang" | ||
|
|
||
| report_dir: "results/qwen_sglang_online_benchmark/" |
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
e2ea7dc to
2fd76c4
Compare
There was a problem hiding this comment.
Pull request overview
Adds a concurrency sweep runner + result summarizer, along with a Qwen2.5-0.5B example (configs + scripts) to demonstrate offline and online benchmarking workflows in this repo.
Changes:
- Add
scripts/concurrency_sweep/run.pyto execute a benchmark sweep across multiple concurrency targets using a YAML config template. - Add
scripts/concurrency_sweep/summarize.pyto aggregateresult_summary.jsonfiles into tables plus CSV/Markdown/plot outputs. - Add
examples/08_Qwen2.5-0.5B_Example/with benchmark configs, dataset prep script, and a wrapper shell script for end-to-end runs on vLLM or SGLang.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/concurrency_sweep/summarize.py | Parses per-run result_summary.json and emits terminal tables + CSV/MD/PNG summaries. |
| scripts/concurrency_sweep/run.py | Runs inference-endpoint benchmark from-config repeatedly with varying concurrency and writes a sweep summary. |
| examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml | Online (concurrency) config for SGLang endpoint. |
| examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml | Offline config for SGLang endpoint. |
| examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml | Online (concurrency) config for vLLM/OpenAI-compatible endpoint. |
| examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml | Offline config for vLLM/OpenAI-compatible endpoint. |
| examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py | Converts the repo dummy dataset into the example’s required prompt format. |
| examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh | Automates dataset prep, container startup, readiness checks, and benchmark execution. |
| examples/08_Qwen2.5-0.5B_Example/README.md | Full walkthrough for running the example manually or via wrapper script. |
| examples/08_Qwen2.5-0.5B_Example/QUICKSTART.md | Short “fast path” instructions for running the example. |
| examples/08_Qwen2.5-0.5B_Example/.gitignore | Ignores example-local generated artifacts/logs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| ) | ||
| return 1 | ||
|
|
||
| concurrency_values = sorted(set(args.concurrency)) |
| def f(v: object) -> float: | ||
| """Return v as float, falling back to 0.0 for None/missing.""" | ||
| return float(v) if v is not None else 0.0 | ||
|
|
||
| pct = lambda d, k: f(d.get("percentiles", {}).get(k)) # noqa: E731 | ||
|
|
||
| return { | ||
| "qps": round(f(data.get("qps")), 2), | ||
| "tps": round(f(data.get("tps")), 2), | ||
| "latency_mean_ms": ns_to_ms(f(lat.get("avg"))), | ||
| "latency_p50_ms": ns_to_ms(f(lat.get("median"))), | ||
| "latency_p90_ms": ns_to_ms(pct(lat, "90")), | ||
| "latency_p95_ms": ns_to_ms(pct(lat, "95")), | ||
| "latency_p99_ms": ns_to_ms(pct(lat, "99")), | ||
| "ttft_mean_ms": ns_to_ms(f(ttft.get("avg"))), | ||
| "ttft_p50_ms": ns_to_ms(f(ttft.get("median"))), | ||
| "ttft_p90_ms": ns_to_ms(pct(ttft, "90")), | ||
| "ttft_p99_ms": ns_to_ms(pct(ttft, "99")), | ||
| "tpot_mean_ms": ns_to_ms(f(tpot.get("avg"))), | ||
| "tpot_p50_ms": ns_to_ms(f(tpot.get("median"))), | ||
| "tpot_p90_ms": ns_to_ms(pct(tpot, "90")), | ||
| "tpot_p99_ms": ns_to_ms(pct(tpot, "99")), | ||
| "n_completed": data.get("n_samples_completed", 0), | ||
| "duration_s": round(f(data.get("duration_ns")) / 1e9, 1), | ||
| "avg_output_tokens": round(f(osl.get("avg")), 1), |
| cells = [str(r.get(k, r.get("status", "-"))) for k in col_keys] | ||
| lines.append("|" + "|".join(cells) + "|") |
| if r["status"] != "ok": | ||
| vals = [f"{r['concurrency']:>6}", f" {'-- ' + r['status']:>12}"] | ||
| print(" ".join(vals)) | ||
| continue | ||
| parts = [f"{r['concurrency']:>6}"] | ||
| for _, key in columns: | ||
| parts.append(f"{r[key]:>12}") |
53adb81 to
9cd8039
Compare
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
9cd8039 to
644f755
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new concurrency sweep workflow (runner + summarizer) and a complete Qwen2.5-0.5B example package to run offline/online benchmarks and visualize sweep results.
Changes:
- Introduces
scripts/concurrency_sweep/run.pyto execute a benchmark across multiple concurrency levels and record per-run logs plus a sweep summary. - Introduces
scripts/concurrency_sweep/summarize.pyto tabulate sweep metrics and emit CSV/Markdown plus an optional matplotlib plot. - Adds a full Qwen2.5-0.5B example (configs, dataset prep, wrapper script, README/quickstart) for vLLM and SGLang.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/concurrency_sweep/summarize.py | Parses result_summary.json per concurrency and outputs tables/CSV/MD/PNG summary artifacts. |
| scripts/concurrency_sweep/run.py | Runs inference-endpoint benchmark from-config repeatedly with concurrency overrides and writes sweep-level summaries/logs. |
| examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml | vLLM online benchmark template config used as sweep input. |
| examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml | vLLM offline benchmark template config. |
| examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml | SGLang online benchmark template config (includes api_type: sglang). |
| examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml | SGLang offline benchmark template config (includes api_type: sglang). |
| examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py | Converts the repo dummy dataset into a prompt-column dataset for the example configs. |
| examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh | Automated end-to-end wrapper (dataset prep, container start, run benchmark/sweep). |
| examples/08_Qwen2.5-0.5B_Example/README.md | Full step-by-step guide for running the sweep, summarizing, and running offline benchmarks. |
| examples/08_Qwen2.5-0.5B_Example/QUICKSTART.md | Short-form quickstart instructions. |
| examples/08_Qwen2.5-0.5B_Example/.gitignore | Ignores generated datasets, logs, and results under the example directory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if verbose: | ||
| with log_path.open("w") as log_file: | ||
| proc = subprocess.Popen( | ||
| cmd, | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.STDOUT, | ||
| text=True, | ||
| ) | ||
| assert proc.stdout is not None | ||
| for line in proc.stdout: | ||
| print(line, end="", flush=True) | ||
| log_file.write(line) | ||
| try: | ||
| proc.wait(timeout=timeout_seconds) | ||
| except subprocess.TimeoutExpired: | ||
| proc.kill() | ||
| raise | ||
| returncode = proc.returncode |
| vals = [f"{r['concurrency']:>6}", f" {'-- ' + r['status']:>12}"] | ||
| print(" ".join(vals)) |
| hdr = "|" + "|".join(headers) + "|" | ||
| lines = [hdr, sep] | ||
| for r in data: | ||
| cells = [str(r.get(k, r.get("status", "-"))) for k in col_keys] |
| return 1 | ||
|
|
||
| base_config = load_config(args.config) | ||
|
|
What does this PR do?
Type of change
Related issues
Testing
Checklist