[llm-d] Keep working by kpouget · Pull Request #907 · openshift-psap/topsail

kpouget · 2026-03-20T09:33:37Z

Summary by CodeRabbit

New Features
- Added a GuideLLM Performance Analysis report with new throughput/latency charts and ranked key-insights.
Chores
- CI presets and flavor matrix updated; added heterogeneous evaluation and adjusted rates/max-requests.
- Test and toolbox workflows now run GuideLLM via CLI-style guidellm_args and no longer execute the multi-turn benchmark.
- Visualization and parsing pipelines removed multi-turn outputs and now consume GuideLLM benchmark data.
Documentation
- Replaced separate benchmark params with a single guidellm_args CLI-argument list; multiturn doc entry removed.

openshift-ci · 2026-03-20T09:33:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosokin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-20T09:33:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces multiturn benchmarking with GuideLLM-driven benchmarks across testing, toolbox, parsing, models, visualizations, and docs; adds new GuideLLM performance report and five analysis/stat classes, removes multiturn models/parsers/templates, and switches toolbox/job invocation to accept CLI-style guidellm_args.

Changes

Cohort / File(s)	Summary
Testing config `projects/llm-d/testing/config.yaml`	Adjusted CI presets: added node selector exclusion value, reduced flavor matrix, normalized/quoted `guidellm` rate fields, added `guidellm_heterogeneous_eval`, removed multiturn benchmark block and normalized `guidellm` keys.
Test runner `projects/llm-d/testing/test_llmd.py`	Removed multiturn invocation; added `run_guidellm_benchmark` flow that iterates over `rate` list values, builds `guidellm_args`, evaluates `{...}` expressions against current rate, and sets per-rate artifact naming/suffixes.
Toolbox API & templates `projects/llm-d/toolbox/llmd.py`, `.../llmd_run_guidellm_benchmark/defaults/main/config.yml`, `.../templates/guidellm_benchmark_job.yaml.j2`	Removed `run_multiturn_benchmark`; changed `run_guidellm_benchmark` to accept `guidellm_args` list; defaults replaced by `llmd_run_guidellm_benchmark_guidellm_args: []`; job template now emits iterable `guidellm_args` for container args.
Removed multiturn role `projects/llm-d/toolbox/llmd_run_multiturn_benchmark/*`	Deleted role defaults, tasks, job template, and resource var related to the multiturn benchmark.
Visualizations — reports `projects/llm-d/visualizations/llmd_inference/data/plots.yaml`, `projects/llm-d/visualizations/llmd_inference/plotting/report.py`	Replaced “Multi-turn Results Table” with `GuidellmPerformanceAnalysisReport`; report now reads `entry.results.guidellm_benchmarks`, renders new analysis sections (Tokens vs Concurrency, TTFT/TPOT/ITL/E2E), computes Key Insights, and removes prior multi-turn tables and `_create_html_table`.
Visualizations — analysis/stats `projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`	Added registrations and implementations: `GuidellmTokensConcurrency` and `GuidellmLatencyAnalysisBase` plus four subclasses (`TTFT`, `TPOT`, `ITL`, `E2E`); removed `MultiturnTTFTByTurn`. New classes compute per-config metrics, plots, and textual summaries.
Models & parsers `projects/llm-d/visualizations/llmd_inference/models/lts.py`, `.../store/parsers.py`, `.../store/lts_parser.py`	Removed `MultiturnBenchmark` model and `Results` multiturn fields; removed multiturn parsing; added discovery/aggregation of multiple `guidellm` benchmark directories and parsing into `results.guidellm_benchmarks`; updated imports and IMPORTANT_FILES handling.
Analysis & error reporting `projects/llm-d/visualizations/llmd_inference/analyze/__init__.py`, `.../plotting/error_report.py`	Removed multiturn-specific analysis function and removed multiturn entries/links from error reporting; GuideLLM and Prometheus reporting preserved.
Docs `docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst`, `docs/toolbox.generated/index.rst`	Replaced documented `rate/max_seconds/data` with single `guidellm_args` list parameter; removed `run_multiturn_benchmark` doc link from index.

Sequence Diagram(s)

sequenceDiagram
    participant CI as CI/Test Runner
    participant Toolbox as Llmd Toolbox
    participant Job as Kubernetes Job (guidellm)
    participant Artifacts as Job Artifacts/Logs
    participant Parser as LTS Parser
    participant Viz as Visualization/Report

    CI->>Toolbox: call run_guidellm_benchmark(name, guidellm_args)
    Toolbox->>Job: render job template with guidellm_args and create Job
    Job->>Artifacts: produce artifacts (guidellm_benchmark_job.logs, outputs.json)
    Parser->>Artifacts: discover and parse guidellm benchmark dirs
    Parser->>Parser: aggregate results.guidellm_benchmarks
    Viz->>Parser: request entry.results.guidellm_benchmarks
    Viz->>Viz: run GuidellmTokensConcurrency & Latency analyses
    Viz->>CI: produce report (HTML, plots, Key Insights)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

[llm-d] Keep working #906 — overlapping edits to Llmd.run_guidellm_benchmark signature, job template, defaults, test wiring, and CI presets.
[llmd] Keep working #902 — overlapping changes to test_llmd, toolbox methods, and visualization/parsing for benchmark results.

Suggested labels

ok-to-test

Poem

🐰 I hopped through configs, args in a row,

Tokens leapt higher, latencies learned to slow,
Reports stitched graphs where the numbers all play,
I nudged a few parsers — then bounced on my way!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.13% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title "[llm-d] Keep working" is vague and generic, using non-descriptive language that does not convey meaningful information about the substantial changes made across multiple files and systems.	Consider using a more descriptive title that captures the primary change, such as "[llm-d] Migrate multiturn benchmarking to GuideLLM" or similar to clarify the main intent.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

kpouget · 2026-03-20T09:36:39Z

/test jump-ci-plot llm-d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/906/pull-ci-openshift-psap-topsail-main-jump-ci/2034733206493728768/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

projects/llm-d/visualizations/llmd_inference/plotting/report.py (2)

678-678: Remove extraneous f prefix from string without placeholders.

🔧 Proposed fix

-                entry_name = entry.get_name(variables) if variables else f"Unknown Configuration"
+                entry_name = entry.get_name(variables) if variables else "Unknown Configuration"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` at line 678,
The code sets entry_name using entry.get_name(variables) if variables else
f"Unknown Configuration" — remove the unnecessary f-string prefix because the
fallback string has no placeholders; update the expression so the fallback is a
plain string ("Unknown Configuration") while keeping the branch and the call to
entry.get_name(variables) unchanged (look for entry_name and entry.get_name to
locate the line).

648-668: Remove extraneous f prefixes from strings without placeholders.

The static analysis correctly identifies that these f-strings have no placeholders. They should be regular strings.

🔧 Proposed fix

-        header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args)
+        header += report.Plot_and_Text("Guidellm Tokens vs Concurrency", args)

...

-        header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args)
+        header += report.Plot_and_Text("Guidellm TTFT Analysis", args)

...

-        header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args)
+        header += report.Plot_and_Text("Guidellm TPOT Analysis", args)

...

-        header += report.Plot_and_Text(f"Guidellm ITL Analysis", args)
+        header += report.Plot_and_Text("Guidellm ITL Analysis", args)

...

-        header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args)
+        header += report.Plot_and_Text("Guidellm E2E Latency Analysis", args)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
648 - 668, The calls to report.Plot_and_Text use unnecessary f-strings with no
placeholders (e.g., report.Plot_and_Text(f"Guidellm Tokens vs Concurrency",
args)); change these to plain string literals (remove the leading f) for each
occurrence (Tokens vs Concurrency, TTFT Analysis, TPOT Analysis, ITL Analysis,
E2E Latency Analysis) so report.Plot_and_Text and the header construction use
regular strings while leaving header, args and surrounding HTML elements
unchanged.

projects/llm-d/testing/config.yaml (1)

48-48: Inconsistent spacing in rate values.

The rate string uses spaces after commas (32, 64, 128, 256, 512), while other rate definitions in this file (lines 43 and 53) use no spaces (1,10,50 and 1,10,50,100,200,300). Consider using consistent formatting.
🔧 Suggested fix for consistency
-    tests.llmd.benchmarks.guidellm.rate: 32, 64, 128, 256, 512
+    tests.llmd.benchmarks.guidellm.rate: 32,64,128,256,512
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/config.yaml` at line 48, The rate list for
tests.llmd.benchmarks.guidellm.rate uses spaces after commas ("32, 64, 128, 256,
512") which is inconsistent with other rate keys (e.g.,
tests.llmd.benchmarks.someOther.rate values like "1,10,50" and
"1,10,50,100,200,300"); update the value of tests.llmd.benchmarks.guidellm.rate
to remove spaces after commas ("32,64,128,256,512") so formatting matches the
rest of the file and maintain uniform CSV style across rate definitions.

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (2)

518-518: Minor: Unused loop variable i.

The static analysis correctly identifies that i is not used in the loop body. Consider renaming to _i or using _ to indicate it's intentionally unused.
🔧 Proposed fix
-        for i, config_name in enumerate(configurations):
+        for _i, config_name in enumerate(configurations):
Or simply:
-        for i, config_name in enumerate(configurations):
+        for config_name in configurations:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
at line 518, The loop uses an unused index variable `i` in the line "for i,
config_name in enumerate(configurations):" — replace it with a clear unused name
(e.g., "for _i, config_name in enumerate(configurations):" or "for _,
config_name in enumerate(configurations):") or simply iterate without enumerate
("for config_name in configurations:") to remove the unused variable; update the
loop header where `configurations` is iterated (the for-loop in
throughput_analysis.py) accordingly.
660-949: Significant code duplication in standalone TPOT and ITL analysis classes.

These two standalone classes (lines 660-803 and 806-949) duplicate almost all logic from GuidellmLatencyAnalysisBase.do_plot(). If the intent is to use a consistent factorized pattern, consider removing these standalone classes and keeping only the base subclass versions (after fixing the dead code issue above).

If there's a specific reason to keep standalone implementations, please document it. Otherwise, the base class pattern used by GuidellmTTFTAnalysis and GuidellmE2ELatencyAnalysis should be applied consistently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 660 - 949, The two classes GuidellmTPOTAnalysis and
GuidellmITLAnalysis duplicate the logic already implemented in
GuidellmLatencyAnalysisBase.do_plot(); remove these standalone classes (or
refactor them to subclass GuidellmLatencyAnalysisBase and override only what
differs) so you reuse the base implementation used by GuidellmTTFTAnalysis and
GuidellmE2ELatencyAnalysis; ensure any unique behavior (field names like 'TPOT
P50 (ms)', 'ITL P50 (ms)', hover text, and conversion from seconds to ms) is
handled via small overrides or parameters in the base class rather than
duplicated code, and add a brief comment documenting why a standalone
implementation is retained if you choose not to refactor.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 49: The config key tests.llmd.benchmarks.guidellm.max_requests is
declared but not used; update run_guidellm_benchmark to read max_requests from
the config (same place it reads rate, max_seconds, data, timeout) and wire it
into the benchmark invocation or loop that limits total requests (e.g., enforce
a hard cap on sent requests or pass as an argument to the runner), referencing
the run_guidellm_benchmark function and whatever helper/runner it calls; also
normalize the formatting of the rate list to match the other entries (remove or
add spaces consistently so lists like 32,64,128,256,512 match 1,10,50 style).

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 621-657: The file contains dead duplicate class definitions:
GuidellmTPOTAnalysis and GuidellmITLAnalysis are defined twice (once as small
subclasses of GuidellmLatencyAnalysisBase around the first block and again as
full standalone classes later), so remove the redundant earlier subclass
definitions (the small factorized ones) OR alternatively delete the later
standalone implementations and keep the base-subclass pattern — ensure any
external references use the remaining class definitions; update imports/registry
usage if needed to point to the preserved classes (refer to
GuidellmTPOTAnalysis, GuidellmITLAnalysis, GuidellmLatencyAnalysisBase to locate
the duplicates).

---

Nitpick comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 48: The rate list for tests.llmd.benchmarks.guidellm.rate uses spaces
after commas ("32, 64, 128, 256, 512") which is inconsistent with other rate
keys (e.g., tests.llmd.benchmarks.someOther.rate values like "1,10,50" and
"1,10,50,100,200,300"); update the value of tests.llmd.benchmarks.guidellm.rate
to remove spaces after commas ("32,64,128,256,512") so formatting matches the
rest of the file and maintain uniform CSV style across rate definitions.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Line 678: The code sets entry_name using entry.get_name(variables) if
variables else f"Unknown Configuration" — remove the unnecessary f-string prefix
because the fallback string has no placeholders; update the expression so the
fallback is a plain string ("Unknown Configuration") while keeping the branch
and the call to entry.get_name(variables) unchanged (look for entry_name and
entry.get_name to locate the line).
- Around line 648-668: The calls to report.Plot_and_Text use unnecessary
f-strings with no placeholders (e.g., report.Plot_and_Text(f"Guidellm Tokens vs
Concurrency", args)); change these to plain string literals (remove the leading
f) for each occurrence (Tokens vs Concurrency, TTFT Analysis, TPOT Analysis, ITL
Analysis, E2E Latency Analysis) so report.Plot_and_Text and the header
construction use regular strings while leaving header, args and surrounding HTML
elements unchanged.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 518: The loop uses an unused index variable `i` in the line "for i,
config_name in enumerate(configurations):" — replace it with a clear unused name
(e.g., "for _i, config_name in enumerate(configurations):" or "for _,
config_name in enumerate(configurations):") or simply iterate without enumerate
("for config_name in configurations:") to remove the unused variable; update the
loop header where `configurations` is iterated (the for-loop in
throughput_analysis.py) accordingly.
- Around line 660-949: The two classes GuidellmTPOTAnalysis and
GuidellmITLAnalysis duplicate the logic already implemented in
GuidellmLatencyAnalysisBase.do_plot(); remove these standalone classes (or
refactor them to subclass GuidellmLatencyAnalysisBase and override only what
differs) so you reuse the base implementation used by GuidellmTTFTAnalysis and
GuidellmE2ELatencyAnalysis; ensure any unique behavior (field names like 'TPOT
P50 (ms)', 'ITL P50 (ms)', hover text, and conversion from seconds to ms) is
handled via small overrides or parameters in the base class rather than
duplicated code, and add a brief comment documenting why a standalone
implementation is retained if you choose not to refactor.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8882473f-4a34-452c-a7c6-4e1ecdacf9b9

📥 Commits

Reviewing files that changed from the base of the PR and between 337a112 and 497cc52.

📒 Files selected for processing (4)

projects/llm-d/testing/config.yaml
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/plotting/report.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

projects/llm-d/testing/config.yaml

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

topsail-bot · 2026-03-20T09:40:19Z

🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 21 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci-plot
PR_POSITIONAL_ARG_0: jump-ci-plot
PR_POSITIONAL_ARG_1: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/906/pull-ci-openshift-psap-topsail-main-jump-ci/2034733206493728768/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

kpouget · 2026-03-20T09:41:39Z

/test jump-ci llm-d multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci

topsail-bot · 2026-03-20T09:44:57Z

🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 01 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: multi-flavor
PR_POSITIONAL_ARG_2: guidellm_heterogeneous_eval

Failure indicator: Empty. (See run.log)

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 686-714: The code currently clamps request_concurrency to 1 when
computing and displaying efficiency, which hides low (<1) mean concurrency
values from parsers.py (concurrency_mean) and can mis-rank "Most efficient";
update the selection and display to use the actual benchmark.request_concurrency
(remove the max(..., 1) clamp) when computing the key for best_efficiency_idx
and when formatting the efficiency string for best_efficiency, and add a minimal
guard (e.g., treat zero concurrency explicitly) to avoid division-by-zero rather
than forcing a floor of 1.
- Around line 648-668: The reported lint failures come from using unnecessary
f-strings with no placeholders; in the calls to report.Plot_and_Text (e.g.,
report.Plot_and_Text("Guidellm Tokens vs Concurrency", args),
report.Plot_and_Text("Guidellm TTFT Analysis", args),
report.Plot_and_Text("Guidellm TPOT Analysis", args),
report.Plot_and_Text("Guidellm ITL Analysis", args), and
report.Plot_and_Text("Guidellm E2E Latency Analysis", args)) remove the leading
f prefix so they are plain string literals, and likewise replace the f-prefixed
"Unknown Configuration" literal with a normal string; no other logic changes are
needed.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 369-370: The TTFT values are stored in seconds but are being
labeled as milliseconds; update every place that sets 'TTFT (ms)' (and any hover
column that uses benchmark.ttft_median or benchmark.ttft_p95) to multiply those
values by 1000 before assigning or displaying them. Locate usages in
plotting/throughput_analysis.py (e.g., the dict entry 'TTFT (ms)':
benchmark.ttft_median and the other block around lines 608-618) and in the
GuidellmTTFTAnalysis hover column code and change benchmark.ttft_median and
benchmark.ttft_p95 to benchmark.ttft_median * 1000 and benchmark.ttft_p95 * 1000
so the label 'ms' matches the numeric units.
- Around line 334-344: The loop dereferences entry.results.guidellm_benchmarks
without guarding against missing GuideLLM data; change the checks before
iterating so you skip entries where guidellm_benchmarks is None or missing
(e.g., replace the current if not entry.results.guidellm_benchmarks: with a
guarded form like if not getattr(entry.results, "guidellm_benchmarks", None):
continue) and use that guarded attribute when iterating benchmarks; apply the
same guarded check to the second occurrence around the block handling benchmarks
(the loop that inspects benchmark.strategy) so both places safely skip entries
lacking guidellm_benchmarks.
- Around line 384-393: The scatter currently groups points only by 'Test
Configuration' so update the plotting call to split traces by strategy to avoid
drawing a single line across different strategies: include the strategy column
(e.g., 'benchmark.strategy' or whatever column name holds strategy) as an
explicit grouping dimension (use PX arguments like symbol='benchmark.strategy'
and/or line_group=['Test Configuration','benchmark.strategy']) or, if you prefer
no connecting lines, change the trace mode to markers only; apply the same
change to the other plot block referenced around lines 518-547 so traces are
created per (configuration, strategy) rather than per configuration alone.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6b314f50-2878-4b3a-9a99-0054834cf70c

📥 Commits

Reviewing files that changed from the base of the PR and between 497cc52 and e59dd4d.

📒 Files selected for processing (3)

projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/plotting/report.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

✅ Files skipped from review due to trivial changes (1)

projects/llm-d/visualizations/llmd_inference/data/plots.yaml

coderabbitai · 2026-03-20T09:50:16Z

projects/llm-d/visualizations/llmd_inference/plotting/report.py

+        header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args)
+
+        # TTFT Analysis Section
+        header.append(html.H3("⚡ Time to First Token (TTFT) Analysis"))
+        header.append(html.P("Latency analysis showing response time percentiles across different concurrency levels"))
+        header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args)
+
+        # TPOT Analysis Section
+        header.append(html.H3("🔄 Time Per Output Token (TPOT) Analysis"))
+        header.append(html.P("Token generation speed analysis showing how quickly individual tokens are produced"))
+        header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args)
+
+        # ITL Analysis Section
+        header.append(html.H3("⏱️ Inter-Token Latency (ITL) Analysis"))
+        header.append(html.P("Streaming responsiveness analysis showing the delay between consecutive tokens"))
+        header += report.Plot_and_Text(f"Guidellm ITL Analysis", args)
+
+        # E2E Latency Analysis Section
+        header.append(html.H3("🎯 End-to-End (E2E) Latency Analysis"))
+        header.append(html.P("Complete request duration analysis from request initiation to final response completion"))
+        header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "report.py" -path "*/llm-d/visualizations/*" -type f

Repository: openshift-psap/topsail

Length of output: 1858

🌐 Web query:

Ruff F541 rule unused f-string prefix

💡 Result:

The Ruff rule F541, named "f-string-missing-placeholders", detects f-strings that do not contain any placeholder expressions (e.g., f"Hello, world!"). It flags these as unnecessary since a regular string literal should be used instead, as f-strings without placeholders can confuse readers or indicate a forgotten placeholder. The rule is derived from Pyflakes, added in Ruff v0.0.18, and is automatically fixable by removing the 'f' prefix. Example violation: f"Hello, world!" Suggested fix: "Hello, world!" Note on implicit concatenation: To match Pyflakes, F541 only flags f-strings without placeholders if they are part of an implicit concatenation where NONE of the f-string segments contain placeholders. For instance: ( f"Hello," f" {name}!" # Not flagged, due to the placeholder in the second. ) ( f"Hello," f" World!" # Flagged, no placeholders in either. )

Citations:

1: https://docs.astral.sh/ruff/rules/f-string-missing-placeholders/

2: [pyflakes] Identify f-string without a placeholder in concatenated strings (F541) astral-sh/ruff#10886

3: https://docs.astral.sh/ruff/rules/

4: F541 does not find violations in concatenations astral-sh/ruff#10885

Remove the no-op f prefixes from these string literals.

These f-strings lack placeholder expressions and will trigger Ruff's F541 rule (f-string-missing-placeholders), causing the file to fail linting. Remove the f prefix from:

f"Guidellm Tokens vs Concurrency"

f"Guidellm TTFT Analysis"

f"Guidellm TPOT Analysis"

f"Guidellm ITL Analysis"

f"Guidellm E2E Latency Analysis"

f"Unknown Configuration" (line 678)

🧰 Tools

🪛 Ruff (0.15.6)

[error] 648-648: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 653-653: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 658-658: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 663-663: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 668-668: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 648 - 668, The reported lint failures come from using unnecessary f-strings with no placeholders; in the calls to report.Plot_and_Text (e.g., report.Plot_and_Text("Guidellm Tokens vs Concurrency", args), report.Plot_and_Text("Guidellm TTFT Analysis", args), report.Plot_and_Text("Guidellm TPOT Analysis", args), report.Plot_and_Text("Guidellm ITL Analysis", args), and report.Plot_and_Text("Guidellm E2E Latency Analysis", args)) remove the leading f prefix so they are plain string literals, and likewise replace the f-prefixed "Unknown Configuration" literal with a normal string; no other logic changes are needed.

coderabbitai · 2026-03-20T09:50:16Z

projects/llm-d/visualizations/llmd_inference/plotting/report.py

+            best_tokens_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second)
+            best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1))
+
+            best_tokens = all_benchmarks[best_tokens_idx]
+            best_efficiency = all_benchmarks[best_efficiency_idx]
+
+            # Configuration analysis
+            configurations = {}
+            for i, benchmark in enumerate(all_benchmarks):
+                config = benchmark_configs[i]
+                if config not in configurations:
+                    configurations[config] = {
+                        'max_tokens': 0,
+                        'optimal_concurrency': 0,
+                        'strategies': 0
+                    }
+
+                if benchmark.tokens_per_second > configurations[config]['max_tokens']:
+                    configurations[config]['max_tokens'] = benchmark.tokens_per_second
+                    configurations[config]['optimal_concurrency'] = benchmark.request_concurrency
+
+                configurations[config]['strategies'] += 1
+
+            # Sort configurations by performance
+            sorted_configs = sorted(configurations.items(), key=lambda x: x[1]['max_tokens'], reverse=True)
+
+            insights = [
+                f"🏆 Best token throughput: {best_tokens.strategy} in {benchmark_configs[best_tokens_idx]} ({best_tokens.tokens_per_second:.0f} tok/s)",
+                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)",


⚠️ Potential issue | 🟠 Major

Use the real mean concurrency when ranking efficiency.

request_concurrency is populated from concurrency_mean in projects/llm-d/visualizations/llmd_inference/store/parsers.py:249-314, so values below 1 are valid. Clamping it to 1 here can pick the wrong "Most efficient" run and understate its score for low-rate benchmarks.

🛠️ Proposed fix

- best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1)) + best_efficiency_idx = max( + range(len(all_benchmarks)), + key=lambda i: ( + all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency + if all_benchmarks[i].request_concurrency > 0 + else float("-inf") + ), + ) best_tokens = all_benchmarks[best_tokens_idx] best_efficiency = all_benchmarks[best_efficiency_idx] + best_efficiency_score = ( + best_efficiency.tokens_per_second / best_efficiency.request_concurrency + if best_efficiency.request_concurrency > 0 + else 0 + )

- f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)", + f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({best_efficiency_score:.0f} tok/s per concurrency unit)",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

best_tokens_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second)

best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1))

best_tokens = all_benchmarks[best_tokens_idx]

best_efficiency = all_benchmarks[best_efficiency_idx]

# Configuration analysis

configurations = {}

for i, benchmark in enumerate(all_benchmarks):

config = benchmark_configs[i]

if config not in configurations:

configurations[config] = {

'max_tokens': 0,

'optimal_concurrency': 0,

'strategies': 0

}

if benchmark.tokens_per_second > configurations[config]['max_tokens']:

configurations[config]['max_tokens'] = benchmark.tokens_per_second

configurations[config]['optimal_concurrency'] = benchmark.request_concurrency

configurations[config]['strategies'] += 1

# Sort configurations by performance

sorted_configs = sorted(configurations.items(), key=lambda x: x[1]['max_tokens'], reverse=True)

insights = [

f"🏆 Best token throughput: {best_tokens.strategy} in {benchmark_configs[best_tokens_idx]} ({best_tokens.tokens_per_second:.0f} tok/s)",

f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)",

best_tokens_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second)

best_efficiency_idx = max(

range(len(all_benchmarks)),

key=lambda i: (

all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency

if all_benchmarks[i].request_concurrency > 0

else float("-inf")

),

)

best_tokens = all_benchmarks[best_tokens_idx]

best_efficiency = all_benchmarks[best_efficiency_idx]

best_efficiency_score = (

best_efficiency.tokens_per_second / best_efficiency.request_concurrency

if best_efficiency.request_concurrency > 0

else 0

)

# Configuration analysis

configurations = {}

for i, benchmark in enumerate(all_benchmarks):

config = benchmark_configs[i]

if config not in configurations:

configurations[config] = {

'max_tokens': 0,

'optimal_concurrency': 0,

'strategies': 0

}

if benchmark.tokens_per_second > configurations[config]['max_tokens']:

configurations[config]['max_tokens'] = benchmark.tokens_per_second

configurations[config]['optimal_concurrency'] = benchmark.request_concurrency

configurations[config]['strategies'] += 1

# Sort configurations by performance

sorted_configs = sorted(configurations.items(), key=lambda x: x[1]['max_tokens'], reverse=True)

insights = [

f"🏆 Best token throughput: {best_tokens.strategy} in {benchmark_configs[best_tokens_idx]} ({best_tokens.tokens_per_second:.0f} tok/s)",

f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({best_efficiency_score:.0f} tok/s per concurrency unit)",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 686 - 714, The code currently clamps request_concurrency to 1 when computing and displaying efficiency, which hides low (<1) mean concurrency values from parsers.py (concurrency_mean) and can mis-rank "Most efficient"; update the selection and display to use the actual benchmark.request_concurrency (remove the max(..., 1) clamp) when computing the key for best_efficiency_idx and when formatting the efficiency string for best_efficiency, and add a minimal guard (e.g., treat zero concurrency explicitly) to avoid division-by-zero rather than forcing a floor of 1.

coderabbitai · 2026-03-20T09:50:16Z

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

+        for entry in entries:
+            if not entry.results.guidellm_benchmarks:
+                continue
+
+            # Get unique name for this entry (includes flavor info)
+            entry_name = entry.get_name(variables)
+
+            # Include all strategies - let's show the full picture
+            for benchmark in entry.results.guidellm_benchmarks:
+                if benchmark.strategy == "throughput":
+                    continue


⚠️ Potential issue | 🟠 Major

Guard guidellm_benchmarks before reading it.

GuidellmPerformanceAnalysisReport already treats missing GuideLLM data as a normal case. These helpers still dereference entry.results.guidellm_benchmarks unconditionally, so one mixed matrix entry will raise AttributeError before the plot can skip it.

🛡️ Proposed fix

- if not entry.results.guidellm_benchmarks: + benchmarks = getattr(entry.results, "guidellm_benchmarks", None) + if not benchmarks: continue ... - for benchmark in entry.results.guidellm_benchmarks: + for benchmark in benchmarks:

- if not entry.results.guidellm_benchmarks: + benchmarks = getattr(entry.results, "guidellm_benchmarks", None) + if not benchmarks: continue ... - for benchmark in entry.results.guidellm_benchmarks: + for benchmark in benchmarks:

Also applies to: 472-481

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 334 - 344, The loop dereferences entry.results.guidellm_benchmarks without guarding against missing GuideLLM data; change the checks before iterating so you skip entries where guidellm_benchmarks is None or missing (e.g., replace the current if not entry.results.guidellm_benchmarks: with a guarded form like if not getattr(entry.results, "guidellm_benchmarks", None): continue) and use that guarded attribute when iterating benchmarks; apply the same guarded check to the second occurrence around the block handling benchmarks (the loop that inspects benchmark.strategy) so both places safely skip entries lacking guidellm_benchmarks.

coderabbitai · 2026-03-20T09:50:16Z

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

+                    'TTFT (ms)': benchmark.ttft_median,
+                    'Latency (ms)': benchmark.request_latency_median * 1000


⚠️ Potential issue | 🟠 Major

Convert TTFT back to milliseconds before labeling it as ms.

projects/llm-d/visualizations/llmd_inference/store/parsers.py:249-314 stores ttft_median and ttft_p95 in seconds. The new hover column and GuidellmTTFTAnalysis currently display those values with an ms label without multiplying by 1000.

⏱️ Proposed fix

- 'TTFT (ms)': benchmark.ttft_median, + 'TTFT (ms)': benchmark.ttft_median * 1000,

- 'unit_conversion': 1, # Already in ms + 'unit_conversion': 1000, # Convert from seconds to ms

Also applies to: 608-618

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 369 - 370, The TTFT values are stored in seconds but are being labeled as milliseconds; update every place that sets 'TTFT (ms)' (and any hover column that uses benchmark.ttft_median or benchmark.ttft_p95) to multiply those values by 1000 before assigning or displaying them. Locate usages in plotting/throughput_analysis.py (e.g., the dict entry 'TTFT (ms)': benchmark.ttft_median and the other block around lines 608-618) and in the GuidellmTTFTAnalysis hover column code and change benchmark.ttft_median and benchmark.ttft_p95 to benchmark.ttft_median * 1000 and benchmark.ttft_p95 * 1000 so the label 'ms' matches the numeric units.

coderabbitai · 2026-03-20T09:50:16Z

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

+        fig = px.scatter(df,
+                        hover_data=df.columns,
+                        x='Concurrency',
+                        y='Tokens/s',
+                        color='Test Configuration',
+                        color_discrete_map=color_map,
+                        title='Token Throughput vs Concurrency by Configuration<br><sub>Higher is better</sub>')
+
+        fig.update_traces(mode='lines+markers')
+        fig.update_layout(showlegend=True)


⚠️ Potential issue | 🟠 Major

Don't draw a single line across different strategies.

Both plots collect multiple benchmark.strategy values per configuration, but the traces are built only per configuration. That joins unrelated strategies into one synthetic curve and can misstate the scaling trend. Split traces by (configuration, strategy) or keep these as markers only.

Also applies to: 518-547

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 384 - 393, The scatter currently groups points only by 'Test Configuration' so update the plotting call to split traces by strategy to avoid drawing a single line across different strategies: include the strategy column (e.g., 'benchmark.strategy' or whatever column name holds strategy) as an explicit grouping dimension (use PX arguments like symbol='benchmark.strategy' and/or line_group=['Test Configuration','benchmark.strategy']) or, if you prefer no connecting lines, change the trace mode to markers only; apply the same change to the other plot block referenced around lines 518-547 so traces are created per (configuration, strategy) rather than per configuration alone.

kpouget · 2026-03-20T09:51:53Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

projects/llm-d/visualizations/llmd_inference/plotting/report.py (3)

678-678: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix.

Line 678 uses f"Unknown Configuration" without placeholders.

🔧 Proposed fix

-                entry_name = entry.get_name(variables) if variables else f"Unknown Configuration"
+                entry_name = entry.get_name(variables) if variables else "Unknown Configuration"

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` at line 678,
The code assigns entry_name using an unnecessary f-string when no placeholders
are present; in the assignment that calls entry.get_name(variables) if variables
else f"Unknown Configuration", remove the extraneous `f` prefix so the fallback
is a plain string ("Unknown Configuration")—update the expression in the same
function/method where entry_name is computed (the conditional using
entry.get_name and variables) to use a regular string literal for the else
branch.

648-668: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefixes from string literals.

These f-strings have no placeholders, triggering Ruff F541. Remove the f prefix from each:

Line 648: f"Guidellm Tokens vs Concurrency"
Line 653: f"Guidellm TTFT Analysis"
Line 658: f"Guidellm TPOT Analysis"
Line 663: f"Guidellm ITL Analysis"
Line 668: f"Guidellm E2E Latency Analysis"

🔧 Proposed fix

-        header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args)
+        header += report.Plot_and_Text("Guidellm Tokens vs Concurrency", args)
...
-        header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args)
+        header += report.Plot_and_Text("Guidellm TTFT Analysis", args)
...
-        header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args)
+        header += report.Plot_and_Text("Guidellm TPOT Analysis", args)
...
-        header += report.Plot_and_Text(f"Guidellm ITL Analysis", args)
+        header += report.Plot_and_Text("Guidellm ITL Analysis", args)
...
-        header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args)
+        header += report.Plot_and_Text("Guidellm E2E Latency Analysis", args)

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
648 - 668, The string literals passed to report.Plot_and_Text and header.append
are incorrectly using f-strings with no placeholders (e.g., calls to
report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) and similar for
TTFT/TPOT/ITL/E2E sections); remove the unnecessary leading "f" on each string
so they become plain string literals (e.g., report.Plot_and_Text("Guidellm
Tokens vs Concurrency", args) and likewise for "Guidellm TTFT Analysis",
"Guidellm TPOT Analysis", "Guidellm ITL Analysis", and "Guidellm E2E Latency
Analysis") to satisfy Ruff F541.

687-687: ⚠️ Potential issue | 🟠 Major

Efficiency calculation clamps concurrency to 1, potentially misranking low-rate benchmarks.

The max(..., 1) clamp distorts efficiency scores when request_concurrency is legitimately below 1 (which is valid per parsers.py using concurrency_mean). This could select the wrong "Most efficient" run.

🛠️ Proposed fix

-            best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1))
+            best_efficiency_idx = max(
+                range(len(all_benchmarks)),
+                key=lambda i: (
+                    all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency
+                    if all_benchmarks[i].request_concurrency > 0
+                    else float("-inf")
+                ),
+            )

And update line 714 similarly:

-                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)",
+                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / best_efficiency.request_concurrency if best_efficiency.request_concurrency > 0 else 0):.0f} tok/s per concurrency unit)",

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` at line 687,
The efficiency calculation currently forces a minimum concurrency of 1, which
skews ranking for benchmarks with request_concurrency < 1; in the selection
logic (best_efficiency_idx using all_benchmarks, tokens_per_second and
request_concurrency) remove the hard clamp to 1 and instead guard against
division-by-zero by either skipping entries with request_concurrency == 0 or
dividing by a very small epsilon (e.g., float_min) so fractional concurrencies
are honored; apply the same fix to the other efficiency calculation referenced
near the second occurrence (the similar expression around line 714) so both use
the true request_concurrency when computing tokens_per_second /
request_concurrency.

🧹 Nitpick comments (1)

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (1)

518-518: Unused loop variable i.

Per Ruff B007, the loop variable i is unused. Rename to _i or use _.

🔧 Proposed fix

-        for i, config_name in enumerate(configurations):
+        for _i, config_name in enumerate(configurations):

Or simply:

-        for i, config_name in enumerate(configurations):
+        for config_name in configurations:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
at line 518, The loop in throughput_analysis.py uses an unused index variable
`i` in the line `for i, config_name in enumerate(configurations):`; rename `i`
to `_` or `_i` to satisfy Ruff B007 and avoid the unused-variable warning,
ensuring there are no other references to `i` elsewhere in the function (e.g.,
inside any inner scope of the same loop such as in functions/methods used there
like plot_throughput or aggregate_metrics) before committing the rename.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 48-49: The rate value is being left as a YAML list and thus
rendered with brackets in the Jinja2 template; update toolbox/llmd.py (the logic
around the tuple conversion at lines handling llmd_run_guidellm_benchmark_rate)
to also detect lists and serialize them into the comma-separated string expected
by the guidellm CLI (e.g., join elements into "32,64,128,256,512") or explicitly
cast to the same format used for tuples; additionally, remove the unused
tests.llmd.benchmarks.guidellm.max_requests: 320 entry from config.yaml unless
you intend to add support for it—if adding support, expose a max_requests
parameter in run_guidellm_benchmark() and consume it in test_llmd.py
accordingly.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 608-618: The GuidellmTTFTAnalysis initializer currently sets
unit_conversion to 1 with a misleading comment "Already in ms" but TTFT values
are stored in seconds (parsed as ttft_median / 1000.0); update the
unit_conversion in the GuidellmTTFTAnalysis constructor to 1000 so displayed
values convert seconds back to milliseconds (or alternatively move conversion to
the display layer), and adjust the comment to reflect that stored values are in
seconds; refer to GuidellmTTFTAnalysis and its 'unit_conversion' field to locate
the change.

---

Duplicate comments:
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Line 678: The code assigns entry_name using an unnecessary f-string when no
placeholders are present; in the assignment that calls entry.get_name(variables)
if variables else f"Unknown Configuration", remove the extraneous `f` prefix so
the fallback is a plain string ("Unknown Configuration")—update the expression
in the same function/method where entry_name is computed (the conditional using
entry.get_name and variables) to use a regular string literal for the else
branch.
- Around line 648-668: The string literals passed to report.Plot_and_Text and
header.append are incorrectly using f-strings with no placeholders (e.g., calls
to report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) and similar for
TTFT/TPOT/ITL/E2E sections); remove the unnecessary leading "f" on each string
so they become plain string literals (e.g., report.Plot_and_Text("Guidellm
Tokens vs Concurrency", args) and likewise for "Guidellm TTFT Analysis",
"Guidellm TPOT Analysis", "Guidellm ITL Analysis", and "Guidellm E2E Latency
Analysis") to satisfy Ruff F541.
- Line 687: The efficiency calculation currently forces a minimum concurrency of
1, which skews ranking for benchmarks with request_concurrency < 1; in the
selection logic (best_efficiency_idx using all_benchmarks, tokens_per_second and
request_concurrency) remove the hard clamp to 1 and instead guard against
division-by-zero by either skipping entries with request_concurrency == 0 or
dividing by a very small epsilon (e.g., float_min) so fractional concurrencies
are honored; apply the same fix to the other efficiency calculation referenced
near the second occurrence (the similar expression around line 714) so both use
the true request_concurrency when computing tokens_per_second /
request_concurrency.

---

Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 518: The loop in throughput_analysis.py uses an unused index variable `i`
in the line `for i, config_name in enumerate(configurations):`; rename `i` to
`_` or `_i` to satisfy Ruff B007 and avoid the unused-variable warning, ensuring
there are no other references to `i` elsewhere in the function (e.g., inside any
inner scope of the same loop such as in functions/methods used there like
plot_throughput or aggregate_metrics) before committing the rename.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aeb662b9-94bf-4107-94e4-9ea5c29cece2

📥 Commits

Reviewing files that changed from the base of the PR and between e59dd4d and 207ad75.

📒 Files selected for processing (4)

projects/llm-d/testing/config.yaml
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/plotting/report.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

✅ Files skipped from review due to trivial changes (1)

projects/llm-d/visualizations/llmd_inference/data/plots.yaml

projects/llm-d/testing/config.yaml

coderabbitai · 2026-03-20T09:56:19Z

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

+class GuidellmTTFTAnalysis(GuidellmLatencyAnalysisBase):
+    def __init__(self):
+        super().__init__({
+            'name': 'TTFT',
+            'description': 'TTFT Latency',
+            'p50_field': 'ttft_median',
+            'p95_field': 'ttft_p95',
+            'unit_conversion': 1,  # Already in ms
+            'note_text': 'P1 and P90 percentiles require additional data extraction from logs.',
+            'insight_text': 'Lower TTFT values indicate better responsiveness. Look for configurations with consistently low and stable TTFT across concurrency levels.'
+        })


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check how ttft_median is parsed/stored in parsers.py rg -n 'ttft_median' --type=py -C5 projects/llm-d/visualizations/llmd_inference/store/

Repository: openshift-psap/topsail

Length of output: 2934

Fix TTFT unit conversion - stored values are in seconds, not milliseconds.

The parsers.py file (lines 303-304) explicitly converts TTFT from milliseconds to seconds: ttft_median / 1000.0. The comment "Already in ms" at line 615 is incorrect. Since benchmark values are stored in seconds, unit_conversion should be 1000 to correctly convert seconds back to milliseconds for display, or this conversion should be handled separately in the display logic.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 608 - 618, The GuidellmTTFTAnalysis initializer currently sets unit_conversion to 1 with a misleading comment "Already in ms" but TTFT values are stored in seconds (parsed as ttft_median / 1000.0); update the unit_conversion in the GuidellmTTFTAnalysis constructor to 1000 so displayed values convert seconds back to milliseconds (or alternatively move conversion to the display layer), and adjust the comment to reflect that stored values are in seconds; refer to GuidellmTTFTAnalysis and its 'unit_conversion' field to locate the change.

topsail-bot · 2026-03-20T13:08:56Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 14 minutes 00 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

kpouget · 2026-03-20T13:31:49Z

/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval
/cluster cks
/only test_ci

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (7)

projects/llm-d/visualizations/llmd_inference/plotting/report.py (2)

335-355: ⚠️ Potential issue | 🟡 Minor

Drop the no-op f prefixes in these literals.

These strings have no placeholders, so Ruff F541 will keep flagging the file until they are plain string literals.

Suggested fix

-        header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args)
+        header += report.Plot_and_Text("Guidellm Tokens vs Concurrency", args)
@@
-        header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args)
+        header += report.Plot_and_Text("Guidellm TTFT Analysis", args)
@@
-        header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args)
+        header += report.Plot_and_Text("Guidellm TPOT Analysis", args)
@@
-        header += report.Plot_and_Text(f"Guidellm ITL Analysis", args)
+        header += report.Plot_and_Text("Guidellm ITL Analysis", args)
@@
-        header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args)
+        header += report.Plot_and_Text("Guidellm E2E Latency Analysis", args)
@@
-                entry_name = entry.get_name(variables) if variables else f"Unknown Configuration"
+                entry_name = entry.get_name(variables) if variables else "Unknown Configuration"

Also applies to: 365-365

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
335 - 355, Remove the unnecessary f-string prefixes on literal strings passed to
report.Plot_and_Text and any other plain string literals in this block (e.g.,
calls like report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args),
report.Plot_and_Text(f"Guidellm TTFT Analysis", args),
report.Plot_and_Text(f"Guidellm TPOT Analysis", args),
report.Plot_and_Text(f"Guidellm ITL Analysis", args),
report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) and similar at the
other location), replacing them with normal string literals so Ruff F541 no
longer flags the file.

374-375: ⚠️ Potential issue | 🟠 Major

Use the real mean concurrency when ranking efficiency.

request_concurrency here is already the parsed mean concurrency. Flooring it to 1 hides valid sub-1 values, which can mis-rank the most efficient run and understate its score.

Suggested fix

-            best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1))
+            best_efficiency_idx = max(
+                range(len(all_benchmarks)),
+                key=lambda i: (
+                    all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency
+                    if all_benchmarks[i].request_concurrency > 0
+                    else float("-inf")
+                ),
+            )

             best_tokens = all_benchmarks[best_tokens_idx]
             best_efficiency = all_benchmarks[best_efficiency_idx]
+            best_efficiency_score = (
+                best_efficiency.tokens_per_second / best_efficiency.request_concurrency
+                if best_efficiency.request_concurrency > 0
+                else 0
+            )
@@
-                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)",
+                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({best_efficiency_score:.0f} tok/s per concurrency unit)",

Also applies to: 400-401

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
374 - 375, The ranking currently uses max(..., key=lambda i:
all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency,
1)) which floors request_concurrency to 1 and hides sub-1 mean concurrencies;
change the comparator in best_efficiency_idx (and the similar logic around the
blocks at the other occurrence) to divide tokens_per_second by the actual
all_benchmarks[i].request_concurrency value instead, but guard against a zero
value by using a tiny epsilon (e.g., treat 0 as 1e-9) or skip zero-concurrency
entries to avoid division-by-zero—update only the comparator used in
best_efficiency_idx and the matching comparator at the other spot.

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (5)

607-617: ⚠️ Potential issue | 🟠 Major

Verify TTFT unit conversion - comment may be incorrect.

The comment states "Already in ms" but past review identified that parsers.py converts TTFT to seconds (ttft_median / 1000.0). If stored in seconds, unit_conversion should be 1000 to display in milliseconds.

⏱️ Proposed fix (if values are in seconds)

-            'unit_conversion': 1,  # Already in ms
+            'unit_conversion': 1000,  # Convert from seconds to ms

, ,

#!/bin/bash
# Check how ttft_median is stored in parsers.py
rg -n 'ttft_median' --type=py -C5 projects/llm-d/visualizations/llmd_inference/store/

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 607 - 617, The TTFT unit_conversion is likely wrong: verify how
ttft_median is stored (check parsers.py where ttft_median is computed/converted)
and if ttft_median is stored in seconds (e.g., divided by 1000.0), update
GuidellmTTFTAnalysis to set unit_conversion to 1000 (instead of 1) and correct
the comment accordingly; if ttft_median is indeed already in milliseconds, leave
unit_conversion as 1 but update the comment to confirm units. Ensure you modify
the GuidellmTTFTAnalysis constructor (unit_conversion and comment) to reflect
the actual stored unit.

383-392: ⚠️ Potential issue | 🟡 Minor

Lines connect points across different strategies within the same configuration.

The scatter plot with mode='lines+markers' connects all data points per Test Configuration, but each configuration may contain multiple strategies. This creates misleading connecting lines between unrelated strategies.

Consider either:

Remove lines (mode='markers' only), or
Add strategy as an additional grouping dimension

♻️ Option 1: Markers only

-        fig.update_traces(mode='lines+markers')
+        fig.update_traces(mode='markers')

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 383 - 392, The plotted lines are connecting points across different
strategies within the same 'Test Configuration'; update the px.scatter call or
trace settings so lines only connect comparable points: either change the trace
drawing to markers-only by removing lines (set mode to 'markers' via
fig.update_traces or px.scatter parameters) or add an explicit grouping by
strategy (e.g., include 'Strategy' as an additional grouping dimension such as
line_group='Strategy' or use symbol/facet for 'Strategy') so lines are drawn per
strategy instead of across strategies; adjust references around px.scatter,
fig.update_traces, 'Test Configuration', 'Strategy', 'Concurrency', and
'Tokens/s' accordingly.

472-473: ⚠️ Potential issue | 🟠 Major

Guard guidellm_benchmarks before reading it.

Same issue as GuidellmTokensConcurrency - attribute access without guard.

🛡️ Proposed fix

-            if not entry.results.guidellm_benchmarks:
+            benchmarks = getattr(entry.results, "guidellm_benchmarks", None)
+            if not benchmarks:
                 continue
...
-            for benchmark in entry.results.guidellm_benchmarks:
+            for benchmark in benchmarks:

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 472 - 473, The current code reads entry.results.guidellm_benchmarks
without ensuring the attribute exists, which can raise AttributeError; change
the guard to safely probe the attribute (e.g., use getattr(entry.results,
"guidellm_benchmarks", None) or hasattr(entry.results, "guidellm_benchmarks"))
before using it so the loop continues when the attribute is missing or falsy —
apply the same pattern used in GuidellmTokensConcurrency to guard access to
entry.results.guidellm_benchmarks.

334-335: ⚠️ Potential issue | 🟠 Major

Guard guidellm_benchmarks before reading it.

entry.results.guidellm_benchmarks is accessed unconditionally. If entry.results lacks this attribute, an AttributeError will be raised before the check executes.

🛡️ Proposed fix

-            if not entry.results.guidellm_benchmarks:
+            benchmarks = getattr(entry.results, "guidellm_benchmarks", None)
+            if not benchmarks:
                 continue
...
-            for benchmark in entry.results.guidellm_benchmarks:
+            for benchmark in benchmarks:

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 334 - 335, The code reads entry.results.guidellm_benchmarks before
checking it which can raise AttributeError if results lacks that attribute;
change the guard to safely test for the attribute and its truthiness (e.g., use
hasattr(entry.results, "guidellm_benchmarks") or getattr(entry.results,
"guidellm_benchmarks", None) and then continue if missing or falsy) so the loop
skips entries without guidellm_benchmarks instead of throwing; update the
conditional that currently references entry.results.guidellm_benchmarks to use
one of these safe checks.

368-369: ⚠️ Potential issue | 🟠 Major

Convert TTFT to milliseconds before labeling it as ms.

Based on the parser (parsers.py), ttft_median is stored in seconds (converted via ttft_median / 1000.0). The hover data labels this as ms but uses the raw value in seconds.

⏱️ Proposed fix

-                    'TTFT (ms)': benchmark.ttft_median,
+                    'TTFT (ms)': benchmark.ttft_median * 1000,

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 368 - 369, The hover label 'TTFT (ms)' is populated with
benchmark.ttft_median which is in seconds; convert it to milliseconds to match
the unit label by replacing benchmark.ttft_median with benchmark.ttft_median *
1000 (similar to how request_latency_median is converted) where the dict with
key 'TTFT (ms)' is constructed in throughput_analysis.py so the displayed value
matches the "ms" unit.

🧹 Nitpick comments (1)

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (1)

517-517: Unused loop variable i.

The loop variable i is not used. Rename to _ to indicate it's intentionally unused.

-        for i, config_name in enumerate(configurations):
+        for _, config_name in enumerate(configurations):

Or simply iterate without enumerate:

-        for i, config_name in enumerate(configurations):
+        for config_name in configurations:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
at line 517, The loop uses an unused index variable `i` in "for i, config_name
in enumerate(configurations):"; remove the unused variable by either changing
the loop to "for config_name in configurations:" or replacing `i` with `_`
(i.e., "for _, config_name in enumerate(configurations):") to make intent clear;
update any references to `i` if present (there should be none) and keep
`config_name` as the iteration variable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 864-865: The suffix computation uses len(rate) which can be a
scalar and causes a TypeError; change the condition to inspect the normalized
list rate_values instead (i.e., set suffix = f"_rate{rate_value}" if
len(rate_values) > 1 else None) so the artifact suffix logic uses the list
created earlier (rate_values and rate_value).

---

Duplicate comments:
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 335-355: Remove the unnecessary f-string prefixes on literal
strings passed to report.Plot_and_Text and any other plain string literals in
this block (e.g., calls like report.Plot_and_Text(f"Guidellm Tokens vs
Concurrency", args), report.Plot_and_Text(f"Guidellm TTFT Analysis", args),
report.Plot_and_Text(f"Guidellm TPOT Analysis", args),
report.Plot_and_Text(f"Guidellm ITL Analysis", args),
report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) and similar at the
other location), replacing them with normal string literals so Ruff F541 no
longer flags the file.
- Around line 374-375: The ranking currently uses max(..., key=lambda i:
all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency,
1)) which floors request_concurrency to 1 and hides sub-1 mean concurrencies;
change the comparator in best_efficiency_idx (and the similar logic around the
blocks at the other occurrence) to divide tokens_per_second by the actual
all_benchmarks[i].request_concurrency value instead, but guard against a zero
value by using a tiny epsilon (e.g., treat 0 as 1e-9) or skip zero-concurrency
entries to avoid division-by-zero—update only the comparator used in
best_efficiency_idx and the matching comparator at the other spot.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 607-617: The TTFT unit_conversion is likely wrong: verify how
ttft_median is stored (check parsers.py where ttft_median is computed/converted)
and if ttft_median is stored in seconds (e.g., divided by 1000.0), update
GuidellmTTFTAnalysis to set unit_conversion to 1000 (instead of 1) and correct
the comment accordingly; if ttft_median is indeed already in milliseconds, leave
unit_conversion as 1 but update the comment to confirm units. Ensure you modify
the GuidellmTTFTAnalysis constructor (unit_conversion and comment) to reflect
the actual stored unit.
- Around line 383-392: The plotted lines are connecting points across different
strategies within the same 'Test Configuration'; update the px.scatter call or
trace settings so lines only connect comparable points: either change the trace
drawing to markers-only by removing lines (set mode to 'markers' via
fig.update_traces or px.scatter parameters) or add an explicit grouping by
strategy (e.g., include 'Strategy' as an additional grouping dimension such as
line_group='Strategy' or use symbol/facet for 'Strategy') so lines are drawn per
strategy instead of across strategies; adjust references around px.scatter,
fig.update_traces, 'Test Configuration', 'Strategy', 'Concurrency', and
'Tokens/s' accordingly.
- Around line 472-473: The current code reads entry.results.guidellm_benchmarks
without ensuring the attribute exists, which can raise AttributeError; change
the guard to safely probe the attribute (e.g., use getattr(entry.results,
"guidellm_benchmarks", None) or hasattr(entry.results, "guidellm_benchmarks"))
before using it so the loop continues when the attribute is missing or falsy —
apply the same pattern used in GuidellmTokensConcurrency to guard access to
entry.results.guidellm_benchmarks.
- Around line 334-335: The code reads entry.results.guidellm_benchmarks before
checking it which can raise AttributeError if results lacks that attribute;
change the guard to safely test for the attribute and its truthiness (e.g., use
hasattr(entry.results, "guidellm_benchmarks") or getattr(entry.results,
"guidellm_benchmarks", None) and then continue if missing or falsy) so the loop
skips entries without guidellm_benchmarks instead of throwing; update the
conditional that currently references entry.results.guidellm_benchmarks to use
one of these safe checks.
- Around line 368-369: The hover label 'TTFT (ms)' is populated with
benchmark.ttft_median which is in seconds; convert it to milliseconds to match
the unit label by replacing benchmark.ttft_median with benchmark.ttft_median *
1000 (similar to how request_latency_median is converted) where the dict with
key 'TTFT (ms)' is constructed in throughput_analysis.py so the displayed value
matches the "ms" unit.

---

Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 517: The loop uses an unused index variable `i` in "for i, config_name in
enumerate(configurations):"; remove the unused variable by either changing the
loop to "for config_name in configurations:" or replacing `i` with `_` (i.e.,
"for _, config_name in enumerate(configurations):") to make intent clear; update
any references to `i` if present (there should be none) and keep `config_name`
as the iteration variable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b6bd5929-619d-486d-a1b4-021f7fb0da3a

📥 Commits

Reviewing files that changed from the base of the PR and between 207ad75 and 341e419.

📒 Files selected for processing (19)

docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
docs/toolbox.generated/index.rst
projects/llm-d/testing/config.yaml
projects/llm-d/testing/test_llmd.py
projects/llm-d/toolbox/llmd.py
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
projects/llm-d/visualizations/llmd_inference/analyze/__init__.py
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/models/lts.py
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/visualizations/llmd_inference/plotting/report.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
projects/llm-d/visualizations/llmd_inference/store/parsers.py

💤 Files with no reviewable changes (8)

projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
docs/toolbox.generated/index.rst
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
projects/llm-d/visualizations/llmd_inference/analyze/init.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
projects/llm-d/visualizations/llmd_inference/models/lts.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml

✅ Files skipped from review due to trivial changes (2)

projects/llm-d/visualizations/llmd_inference/data/plots.yaml
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst

projects/llm-d/testing/test_llmd.py

topsail-bot · 2026-03-20T14:17:41Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 29 minutes 23 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval

kpouget · 2026-03-20T15:26:18Z

/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval
/cluster cks
/only test_ci

coderabbitai

♻️ Duplicate comments (6)

projects/llm-d/visualizations/llmd_inference/plotting/report.py (2)

363-365: ⚠️ Potential issue | 🟡 Minor

Drop the no-op f prefix here.

f"Unknown Configuration" has no placeholders and still triggers Ruff F541.

🧹 Suggested fix

-                entry_name = entry.get_name(variables) if variables else f"Unknown Configuration"
+                entry_name = entry.get_name(variables) if variables else "Unknown Configuration"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
363 - 365, The code creates entry_name using entry.get_name(variables) or an
f-string with no placeholders; remove the no-op f prefix to avoid Ruff F541 by
changing the fallback from f"Unknown Configuration" to a plain string "Unknown
Configuration" where entry_name is assigned inside the loop that iterates over
entries (see variables, entry.get_name, and entry_name).

373-375: ⚠️ Potential issue | 🟠 Major

Use the real mean concurrency when ranking efficiency.

In projects/llm-d/visualizations/llmd_inference/store/parsers.py, Lines 178-179 and 201, request_concurrency comes from GuideLLM's mean concurrency, so values below 1 are valid. Clamping it to 1 here can pick the wrong “Most efficient” run and understate its score; guard only against zero.

🛠️ Suggested fix

-            best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1))
+            best_efficiency_idx = max(
+                range(len(all_benchmarks)),
+                key=lambda i: (
+                    all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency
+                    if all_benchmarks[i].request_concurrency > 0
+                    else float("-inf")
+                ),
+            )
@@
-                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)",
+                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / best_efficiency.request_concurrency):.0f} tok/s per concurrency unit)" if best_efficiency.request_concurrency > 0 else
+                f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} (undefined for zero concurrency)",

Also applies to: 399-401

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
373 - 375, The efficiency ranking currently clamps
all_benchmarks[i].request_concurrency to 1 which incorrectly treats valid
fractional mean concurrencies as >=1; update the computation of
best_efficiency_idx to avoid clamping to 1 and instead only guard against zero
(e.g., divide by max(all_benchmarks[i].request_concurrency, tiny_eps) where
tiny_eps is a very small positive value) so that fractional concurrencies are
preserved; change the same pattern where used (the other occurrence around
best_tokens_idx/best_efficiency_idx).

projects/llm-d/testing/test_llmd.py (1)

894-895: ⚠️ Potential issue | 🔴 Critical

Use rate_values when deciding whether to suffix artifacts.

With the default tests.llmd.benchmarks.guidellm.rate: 1, len(rate) raises TypeError before the first benchmark starts. This branch needs to check the normalized list, not the raw config value.
🐛 Suggested fix
-            suffix = f"_rate{rate_value}" if len(rate) > 1\
+            suffix = f"_rate{rate_value}" if len(rate_values) > 1\
                 else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/test_llmd.py` around lines 894 - 895, The suffix logic
is checking len(rate) which can be a non-iterable and raise TypeError; change
the condition to inspect the normalized list (rate_values) instead of the raw
config (rate) so the branch uses len(rate_values) when deciding the suffix for
rate_value (i.e., set suffix = f"_rate{rate_value}" if len(rate_values) > 1 else
None).

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (3)

333-345: ⚠️ Potential issue | 🟠 Major

Guard guidellm_benchmarks before iterating.

One mixed matrix entry without parsed GuideLLM results will raise AttributeError here and abort the stat. These new paths need the same guarded access pattern as the report layer.

🛡️ Suggested fix

-        for entry in entries:
-            if not entry.results.guidellm_benchmarks:
+        for entry in entries:
+            benchmarks = getattr(entry.results, "guidellm_benchmarks", None)
+            if not benchmarks:
                 continue
@@
-            for benchmark in entry.results.guidellm_benchmarks:
+            for benchmark in benchmarks:
                 if benchmark.strategy == "throughput":
                     continue

-        for entry in entries:
-            if not entry.results.guidellm_benchmarks:
+        for entry in entries:
+            benchmarks = getattr(entry.results, "guidellm_benchmarks", None)
+            if not benchmarks:
                 continue
@@
-            for benchmark in entry.results.guidellm_benchmarks:
+            for benchmark in benchmarks:
                 if benchmark.strategy == "throughput":
                     continue

Also applies to: 471-479

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 333 - 345, The loop over entry.results.guidellm_benchmarks assumes
that attribute exists and raises AttributeError for mixed entries; before
iterating, guard access by checking the attribute (e.g., if not
getattr(entry.results, "guidellm_benchmarks", None): continue) so only entries
with parsed GuideLLM results are processed; apply the same guarded pattern to
the second loop that checks benchmark.strategy (the similar block around where
benchmark.strategy is inspected) and keep using entry.get_name(variables) only
for entries that pass this guard.

359-369: ⚠️ Potential issue | 🟠 Major

Convert TTFT back to milliseconds before labeling it ms.

projects/llm-d/visualizations/llmd_inference/store/parsers.py, Lines 213-214, stores ttft_median and ttft_p95 in seconds. These displays currently label seconds as milliseconds, so TTFT is understated by 1000x.

⏱️ Suggested fix

                 data.append({
                     'Test Configuration': entry_name,
                     'Concurrency': benchmark.request_concurrency,
                     'Tokens/s': benchmark.tokens_per_second,
                     'Input Tokens/s': benchmark.input_tokens_per_second,
                     'Output Tokens/s': benchmark.output_tokens_per_second,
                     'Strategy': benchmark.strategy,
                     'Strategy Type': strategy_type,
                     'Request Rate (req/s)': benchmark.request_rate,
-                    'TTFT (ms)': benchmark.ttft_median,
+                    'TTFT (ms)': benchmark.ttft_median * 1000,
                     'Latency (ms)': benchmark.request_latency_median * 1000
                 })

 class GuidellmTTFTAnalysis(GuidellmLatencyAnalysisBase):
     def __init__(self):
         super().__init__({
             'name': 'TTFT',
             'description': 'TTFT Latency',
             'p50_field': 'ttft_median',
             'p95_field': 'ttft_p95',
-            'unit_conversion': 1,  # Already in ms
+            'unit_conversion': 1000,  # Stored in seconds
             'note_text': 'P1 and P90 percentiles require additional data extraction from logs.',
             'insight_text': 'Lower TTFT values indicate better responsiveness. Look for configurations with consistently low and stable TTFT across concurrency levels.'
         })

Also applies to: 607-617

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 359 - 369, The TTFT values stored in parsers.py (ttft_median and
ttft_p95 are in seconds) are being labeled as milliseconds in
throughput_analysis.py; update the construction of the data dict in the function
that builds rows (the block that appends entries with keys 'TTFT (ms)' and later
the similar block around lines 607-617) to multiply benchmark.ttft_median and
benchmark.ttft_p95 by 1000 before assigning them to the 'TTFT (ms)' and
corresponding p95 fields so the displayed values are in milliseconds while
keeping the label unchanged.

383-392: ⚠️ Potential issue | 🟠 Major

Don't draw a single line across different strategies.

Both plots collect multiple benchmark.strategy values per configuration, but each trace is built only per configuration. That stitches unrelated strategies into one synthetic curve and misstates the scaling trend. Split traces by (configuration, strategy) or fall back to markers-only.

📉 Minimal safe fix

-        fig.update_traces(mode='lines+markers')
+        fig.update_traces(mode='markers')

-                mode='lines+markers',
+                mode='markers',
@@
-                mode='lines+markers',
+                mode='markers',

Also applies to: 517-546

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 383 - 392, The scatter currently stitches different
benchmark.strategy values into one continuous line because traces are only
grouped by 'Test Configuration'; update the px.scatter call in
throughput_analysis.py (the fig = px.scatter(...) call and subsequent
fig.update_traces(...)) to either (a) create separate traces per (configuration,
strategy) by adding a grouping like line_group='benchmark.strategy' and/or
symbol='benchmark.strategy' (or include benchmark.strategy in the color/symbol
combination) so each (Test Configuration, benchmark.strategy) pair gets its own
trace, or (b) if you prefer a minimal safe change, remove the line component and
use markers-only by changing fig.update_traces(mode='markers') to avoid
connecting unrelated strategies into one curve.

🧹 Nitpick comments (2)

projects/llm-d/toolbox/llmd.py (1)

55-57: Avoid a shared default guidellm_args list.

This default is allocated once and then reused across calls. If any downstream code mutates it, later invocations inherit stale args.

♻️ Suggested fix

     def run_guidellm_benchmark(
             self,
             endpoint_url,
             name="guidellm-benchmark", namespace="",
             image="ghcr.io/vllm-project/guidellm", version="pr-590",
             timeout=900,
-            guidellm_args=[],
+            guidellm_args=None,
     ):
@@
-        return RunAnsibleRole(locals())
+        if guidellm_args is None:
+            guidellm_args = []
+
+        return RunAnsibleRole(locals())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/toolbox/llmd.py` around lines 55 - 57, The default parameter
guidellm_args is a shared mutable list (guidellm_args=[]) which can be mutated
across invocations; change the function/class signature to use
guidellm_args=None and inside the function (where image, version, timeout are
used) set guidellm_args = [] if guidellm_args is None so each call gets a fresh
list; update any callers that rely on the previous default behavior only if they
passed nothing (no other code changes needed).

projects/llm-d/testing/test_llmd.py (1)

831-859: Replace eval() with a constrained expression evaluator.

These placeholders only need arithmetic on rate, but eval() still executes arbitrary Python and the fallback returns the raw {...} token on errors. That turns config typos into late, opaque CLI failures.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/test_llmd.py` around lines 831 - 859, The function
apply_rate_scaleup uses eval() inside evaluate_expression which allows arbitrary
code execution and silently returns the raw "{...}" token on errors; replace
eval() with a safe AST-based evaluator: in evaluate_expression, parse expression
with ast.parse(expression, mode='eval'), walk/validate nodes to only allow Expr,
BinOp, UnaryOp, Constant/Num, Name (only 'rate'), and allowed operators (+, -,
*, /, //, %, **, <<, >>, |, &, ^), then evaluate the AST recursively using a
small operator mapping and the context {"rate": rate}; on
parse/validation/evaluation errors, log the error via logging.warning and raise
a ValueError (do not return the original token) so config typos fail fast;
update evaluate_expression and references in apply_rate_scaleup accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 894-895: The suffix logic is checking len(rate) which can be a
non-iterable and raise TypeError; change the condition to inspect the normalized
list (rate_values) instead of the raw config (rate) so the branch uses
len(rate_values) when deciding the suffix for rate_value (i.e., set suffix =
f"_rate{rate_value}" if len(rate_values) > 1 else None).

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 363-365: The code creates entry_name using
entry.get_name(variables) or an f-string with no placeholders; remove the no-op
f prefix to avoid Ruff F541 by changing the fallback from f"Unknown
Configuration" to a plain string "Unknown Configuration" where entry_name is
assigned inside the loop that iterates over entries (see variables,
entry.get_name, and entry_name).
- Around line 373-375: The efficiency ranking currently clamps
all_benchmarks[i].request_concurrency to 1 which incorrectly treats valid
fractional mean concurrencies as >=1; update the computation of
best_efficiency_idx to avoid clamping to 1 and instead only guard against zero
(e.g., divide by max(all_benchmarks[i].request_concurrency, tiny_eps) where
tiny_eps is a very small positive value) so that fractional concurrencies are
preserved; change the same pattern where used (the other occurrence around
best_tokens_idx/best_efficiency_idx).

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 333-345: The loop over entry.results.guidellm_benchmarks assumes
that attribute exists and raises AttributeError for mixed entries; before
iterating, guard access by checking the attribute (e.g., if not
getattr(entry.results, "guidellm_benchmarks", None): continue) so only entries
with parsed GuideLLM results are processed; apply the same guarded pattern to
the second loop that checks benchmark.strategy (the similar block around where
benchmark.strategy is inspected) and keep using entry.get_name(variables) only
for entries that pass this guard.
- Around line 359-369: The TTFT values stored in parsers.py (ttft_median and
ttft_p95 are in seconds) are being labeled as milliseconds in
throughput_analysis.py; update the construction of the data dict in the function
that builds rows (the block that appends entries with keys 'TTFT (ms)' and later
the similar block around lines 607-617) to multiply benchmark.ttft_median and
benchmark.ttft_p95 by 1000 before assigning them to the 'TTFT (ms)' and
corresponding p95 fields so the displayed values are in milliseconds while
keeping the label unchanged.
- Around line 383-392: The scatter currently stitches different
benchmark.strategy values into one continuous line because traces are only
grouped by 'Test Configuration'; update the px.scatter call in
throughput_analysis.py (the fig = px.scatter(...) call and subsequent
fig.update_traces(...)) to either (a) create separate traces per (configuration,
strategy) by adding a grouping like line_group='benchmark.strategy' and/or
symbol='benchmark.strategy' (or include benchmark.strategy in the color/symbol
combination) so each (Test Configuration, benchmark.strategy) pair gets its own
trace, or (b) if you prefer a minimal safe change, remove the line component and
use markers-only by changing fig.update_traces(mode='markers') to avoid
connecting unrelated strategies into one curve.

---

Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 831-859: The function apply_rate_scaleup uses eval() inside
evaluate_expression which allows arbitrary code execution and silently returns
the raw "{...}" token on errors; replace eval() with a safe AST-based evaluator:
in evaluate_expression, parse expression with ast.parse(expression,
mode='eval'), walk/validate nodes to only allow Expr, BinOp, UnaryOp,
Constant/Num, Name (only 'rate'), and allowed operators (+, -, *, /, //, %, **,
<<, >>, |, &, ^), then evaluate the AST recursively using a small operator
mapping and the context {"rate": rate}; on parse/validation/evaluation errors,
log the error via logging.warning and raise a ValueError (do not return the
original token) so config typos fail fast; update evaluate_expression and
references in apply_rate_scaleup accordingly.

In `@projects/llm-d/toolbox/llmd.py`:
- Around line 55-57: The default parameter guidellm_args is a shared mutable
list (guidellm_args=[]) which can be mutated across invocations; change the
function/class signature to use guidellm_args=None and inside the function
(where image, version, timeout are used) set guidellm_args = [] if guidellm_args
is None so each call gets a fresh list; update any callers that rely on the
previous default behavior only if they passed nothing (no other code changes
needed).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1ae502b-4e9a-4221-9c96-bc6ff6dd221c

📥 Commits

Reviewing files that changed from the base of the PR and between 341e419 and 72eaad9.

📒 Files selected for processing (19)

docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
docs/toolbox.generated/index.rst
projects/llm-d/testing/config.yaml
projects/llm-d/testing/test_llmd.py
projects/llm-d/toolbox/llmd.py
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
projects/llm-d/visualizations/llmd_inference/analyze/__init__.py
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/models/lts.py
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/visualizations/llmd_inference/plotting/report.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
projects/llm-d/visualizations/llmd_inference/store/parsers.py

💤 Files with no reviewable changes (8)

projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
docs/toolbox.generated/index.rst
projects/llm-d/visualizations/llmd_inference/analyze/init.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/visualizations/llmd_inference/models/lts.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml

✅ Files skipped from review due to trivial changes (1)

projects/llm-d/visualizations/llmd_inference/data/plots.yaml

🚧 Files skipped from review as they are similar to previous changes (4)

projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2

topsail-bot · 2026-03-20T18:23:24Z

🟢 Test of 'llm-d test test_ci' succeeded after 02 hours 51 minutes 54 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval

kpouget · 2026-03-20T19:52:36Z

/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval
/cluster cks
/only test_ci

kpouget · 2026-03-20T20:27:09Z

/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.model: gpt-oss-120

kpouget · 2026-03-20T20:28:37Z

/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.model: gpt-oss-120

topsail-bot · 2026-03-20T23:15:37Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 18 minutes 48 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval

kpouget · 2026-03-21T07:56:51Z

/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval
/cluster cks
/only test_ci

topsail-bot · 2026-03-21T12:46:43Z

🔴 Test of 'llm-d test test_ci' failed after 01 hours 19 minutes 09 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval

Failure indicator: Empty. (See run.log)

kpouget · 2026-03-21T19:31:43Z

/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval gpt-oss
/cluster cks
/only test_ci

topsail-bot · 2026-03-21T20:47:49Z

🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 41 minutes 30 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_multiturn_eval
PR_POSITIONAL_ARG_4: gpt-oss

kpouget · 2026-03-23T08:45:27Z

/test jump-ci-plot llm-d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/907/pull-ci-openshift-psap-topsail-main-jump-ci/2034930937610899456/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

topsail-bot · 2026-03-23T08:54:20Z

🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 22 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci-plot
PR_POSITIONAL_ARG_0: jump-ci-plot
PR_POSITIONAL_ARG_1: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/907/pull-ci-openshift-psap-topsail-main-jump-ci/2034930937610899456/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

kpouget · 2026-03-23T08:55:50Z

/test jump-ci-plot llm-d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/907/pull-ci-openshift-psap-topsail-main-jump-ci/2035082112717033472/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

topsail-bot · 2026-03-23T09:01:48Z

🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 37 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci-plot
PR_POSITIONAL_ARG_0: jump-ci-plot
PR_POSITIONAL_ARG_1: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/907/pull-ci-openshift-psap-topsail-main-jump-ci/2035082112717033472/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

kpouget force-pushed the download-model branch from 497cc52 to e59dd4d Compare March 20, 2026 09:36

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

projects/llm-d/testing/config.yaml Outdated Show resolved Hide resolved

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py Show resolved Hide resolved

[llm-d] testing: config: update the test flavors

37ea554

kpouget force-pushed the download-model branch from e59dd4d to 207ad75 Compare March 20, 2026 09:41

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

projects/llm-d/testing/test_llmd.py Outdated Show resolved Hide resolved

kpouget force-pushed the download-model branch 2 times, most recently from 72eaad9 to f30dd18 Compare March 20, 2026 15:26

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

kpouget force-pushed the download-model branch from 3c586e1 to 8487263 Compare March 20, 2026 20:27

kpouget added 3 commits March 21, 2026 08:55

[llm-d] visualizations: llmd_inference: improve the visualization

bd8e697

[llm-d] Remove the non-guidellm multi-turn benchmark

02d8fc2

[llm-d] toolbox: llmd: rewrite guidellm argument passing

434c66f

kpouget added 4 commits March 21, 2026 08:55

[llm-d] testing: config: exclude another node

6c9b63c

[llm-d] testing: allow rate-proportional guidellm configs

cd1a2fe

[llm-d] testing: config: update the llmisvc basename

0cb0382

[llm-d] testing: config: update the vllm/guidellm config

c9fd4ea

kpouget force-pushed the download-model branch from 8487263 to c9fd4ea Compare March 21, 2026 07:56

[llm-d] testing: config: add the gpt-oss preset

7dba32a

kpouget merged commit 51c57bd into openshift-psap:main Mar 23, 2026
17 of 18 checks passed

kpouget deleted the download-model branch March 23, 2026 16:34

This was referenced Mar 23, 2026

[llm-d] Keep working #909

Merged

[llm-d] Keep working #910

Merged

[llm-d] Keep working #913

Merged

[llm-d] Keep working #915

Closed

[llm-d] Keep working #914

Closed

		'TTFT (ms)': benchmark.ttft_median,
		'Latency (ms)': benchmark.request_latency_median * 1000

Conversation

kpouget commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

topsail-bot bot commented Mar 20, 2026

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

topsail-bot bot commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

topsail-bot bot commented Mar 20, 2026

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

topsail-bot bot commented Mar 20, 2026

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

topsail-bot bot commented Mar 20, 2026

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

kpouget commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading