[llm-d] Keep working by kpouget · Pull Request #908 · openshift-psap/topsail

kpouget · 2026-03-20T20:01:33Z

Summary by CodeRabbit

Refactor
- Replaced the legacy multi‑turn benchmark workflow with a unified GuideLLM benchmark flow driven by flexible CLI‑style argument lists.
New Features
- Added GuideLLM performance report and visualizations (tokens vs concurrency, TTFT/TPOT/ITL/E2E analyses, configuration ranking).
Documentation
- Updated benchmark docs to document a single GuideLLM arguments list instead of separate rate/max/data fields.
Chores
- Removed legacy multi‑turn templates, defaults, parsing, and artifacts; updated test presets and CI configs to align with GuideLLM.

openshift-ci · 2026-03-20T20:01:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashishkamra for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-20T20:01:49Z

📝 Walkthrough

Walkthrough

The PR removes multi‑turn benchmark support and related Ansible role/templates, models, parsers, and visualizations, and refactors the GuideLLM benchmark flow to accept a unified guidellm_args list instead of separate rate, max_seconds, and data parameters.

Changes

Cohort / File(s)	Summary
Toolbox API & Docs `docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst`, `docs/toolbox.generated/index.rst`, `projects/llm-d/toolbox/llmd.py`, `projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml`, `projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2`	Replaced explicit benchmark params (`rate`, `max_seconds`, `data`) with a single `guidellm_args` list; updated docs, role defaults, and job template to render arbitrary CLI args.
Removed Multi‑turn Role & Templates `projects/llm-d/toolbox/llmd_run_multiturn_benchmark/*`	Deleted multiturn Ansible role (defaults, tasks, templates, vars) and its wiring.
Tests / Orchestration `projects/llm-d/testing/config.yaml`, `projects/llm-d/testing/test_llmd.py`	Updated CI presets and test config; replaced `run_multiturn_benchmark` with `run_guidellm_benchmark` in test flow that iterates over rates, builds `guidellm_args`, applies rate scaling, uses per-rate artifact suffixes, and aggregates error handling.
Visualization & Parsing `projects/llm-d/visualizations/llmd_inference/data/plots.yaml`, `projects/llm-d/visualizations/llmd_inference/models/lts.py`, `projects/llm-d/visualizations/llmd_inference/analyze/__init__.py`, `projects/llm-d/visualizations/llmd_inference/plotting/`, `projects/llm-d/visualizations/llmd_inference/store/`	Removed multiturn models/analysis/reporting and related artifact links; added GuideLLM-focused reports and latency/throughput visualizations; refactored artifact discovery/parsing to find and parse multiple `run_guidellm_benchmark` artifact directories and parse Guidellm logs.
Error reporting `projects/llm-d/visualizations/llmd_inference/plotting/error_report.py`	Removed multi‑turn availability counters, artifact links, and multiturn-specific report entries.
Misc config / manifests `projects/llm-d/testing/llmisvcs/...yaml`, `projects/llm-d/testing/config.yaml`	Minor manifest/config edits (routing plugin removed), CI flavor/preset adjustments, vLLM GPU cap tweak, and benchmark config format updates.

Sequence Diagram

sequenceDiagram
    participant Test as Test Orchestration
    participant Config as Test Config
    participant Toolbox as Llmd Toolbox
    participant Ansible as Ansible Role
    participant Guidellm as GuideLLM Job
    participant Storage as Artifact Storage

    Test->>Config: load benchmark preset
    Test->>Test: apply_rate_scaleup(data, max_requests, rate)
    loop per rate_value
        Test->>Test: build guidellm_args (CLI list)
        Test->>Toolbox: run_guidellm_benchmark(endpoint, guidellm_args, artifact_dir_suffix)
        Toolbox->>Ansible: RunAnsibleRole(locals())
        Ansible->>Guidellm: create K8s Job with container args from guidellm_args
        Guidellm->>Storage: emit benchmark logs/artifacts
        Ansible->>Storage: collect job manifests & logs
        Ansible-->>Toolbox: role returns status
        Toolbox-->>Test: return per-rate result
    end
    Test->>Storage: aggregate per-rate artifacts

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[llmd] Keep working #902: Modifies Llmd.run_guidellm_benchmark surface and docs—strongly related.
[llm-d] Keep working #906: Touches the same Guidellm benchmark interface/defaults; related to this refactor.
[llm-d] New project #900: Introduced/changed Llmd benchmark command and defaults that this PR adjusts—related.

Suggested labels

lgtm

Poem

🐰 Hopped through docs and roles today,
Multiturn carrots tucked away,
GuideLLM args in tidy rows,
Logs and rates in gentle flows,
We burrow onward—tests aglow! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.13% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title '[llm-d] Keep working' is vague and generic, providing no specific information about the actual changes made in the pull request.	Revise the title to clearly describe the main changes, such as '[llm-d] Migrate multiturn benchmarks to GuideLLM' or '[llm-d] Replace multiturn benchmarks with GuideLLM integration'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.

OpenGrep is compatible with Semgrep configurations. Add an opengrep.yml or semgrep.yml configuration file to your project to enable OpenGrep analysis.

kpouget · 2026-03-20T20:02:14Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

projects/llm-d/testing/config.yaml (1)
198-206: ⚠️ Potential issue | 🟡 Minor

Duplicate --enable-prefix-caching flag.

Line 200 already specifies --enable-prefix-caching, and it's duplicated at line 206. Remove the duplicate entry.
🐛 Proposed fix
       - "--gpu-memory-utilization=0.9"
-      - "--enable-prefix-caching"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/config.yaml` around lines 198 - 206, The vllm_args
list contains a duplicate flag "--enable-prefix-caching" (appears in the
vllm_args array twice); remove the redundant entry so the vllm_args array only
includes a single "--enable-prefix-caching" element (locate the vllm_args block
in projects/llm-d/testing/config.yaml and delete the duplicate instance).

🧹 Nitpick comments (4)

projects/llm-d/toolbox/llmd.py (1)

56-57: Consider using None as default instead of mutable [].

Using a mutable default argument ([]) is a Python anti-pattern (Ruff B006). While the current usage with RunAnsibleRole(locals()) likely doesn't mutate the list, using None with an in-function default is safer and more idiomatic.

♻️ Proposed fix

     `@AnsibleRole`("llmd_run_guidellm_benchmark")
     `@AnsibleMappedParams`
     def run_guidellm_benchmark(
             self,
             endpoint_url,
             name="guidellm-benchmark", namespace="",
             image="ghcr.io/vllm-project/guidellm", version="pr-590",
             timeout=900,
-            guidellm_args=[],
+            guidellm_args=None,
     ):
         """
         ...
           guidellm_args: List of additional guidellm arguments (e.g., ["--rate=10", "--max-seconds=30"])
         """
 
+        if guidellm_args is None:
+            guidellm_args = []
+
         return RunAnsibleRole(locals())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/toolbox/llmd.py` around lines 56 - 57, The default argument
guidellm_args is a mutable list; change its function or call-site default from
[] to None and inside the function (the place that constructs or uses
RunAnsibleRole(locals())) set guidellm_args = [] if guidellm_args is None before
use; update any references to guidellm_args (e.g., in the
RunAnsibleRole(locals()) call) so the code passes the canonical list and avoids
the mutable-default anti-pattern.

projects/llm-d/testing/test_llmd.py (1)

847-856: Security: eval() usage with restricted context.

While the eval() is restricted to only the rate variable with no builtins, this pattern can be fragile. Consider a safer approach using a simple expression parser for the limited arithmetic needed ({N*rate}).

♻️ Safer alternative using regex-based arithmetic

     def apply_rate_scaleup(value, rate):
         """
         Apply rate-based scaling to configuration values.

         Evaluates expressions like:
         - "{10*rate}" with rate=32 -> "320"
         - "prefix_count={2*rate}" with rate=32 -> "prefix_count=64"
         """
         if not isinstance(value, str):
             return value

         import re

         # Find all expressions in curly braces
         pattern = r'\{([^}]+)\}'

         def evaluate_expression(match):
             expression = match.group(1)
             try:
-                # Create a safe evaluation context with only 'rate' variable
-                context = {"rate": rate}
-                result = eval(expression, {"__builtins__": {}}, context)
-                return str(result)
+                # Parse simple arithmetic expressions like "10*rate" or "rate*2"
+                mult_match = re.match(r'^\s*(\d+)\s*\*\s*rate\s*$', expression)
+                if mult_match:
+                    return str(int(mult_match.group(1)) * rate)
+                mult_match = re.match(r'^\s*rate\s*\*\s*(\d+)\s*$', expression)
+                if mult_match:
+                    return str(rate * int(mult_match.group(1)))
+                # Fallback for unsupported expressions
+                logging.warning(f"Unsupported expression '{expression}', returning as-is")
+                return match.group(0)
             except Exception as e:
                 logging.warning(f"Failed to evaluate expression '{expression}' with rate={rate}: {e}")
                 return match.group(0)  # Return original if evaluation fails

         # Replace all expressions with their evaluated results
         return re.sub(pattern, evaluate_expression, value)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/test_llmd.py` around lines 847 - 856, The
evaluate_expression function currently uses eval(expression, {"__builtins__":
{}}, {"rate": rate}) which is fragile; replace it with a safe evaluator that
parses the expression with ast.parse and only permits a whitelist of nodes
(Module/Expr/BinOp/UnaryOp/Constant/Name) and allowed operators (+, -, *, /, //,
%, **, unary +/-, parentheses), only allows the Name "rate" and numeric
constants, and then evaluates the AST deterministically (e.g., using a small
recursive evaluator function) returning the computed string; keep the existing
exception handling/logging behavior and fall back to returning match.group(0) on
any validation or evaluation error. Reference: evaluate_expression function and
its use of rate.

projects/llm-d/visualizations/llmd_inference/plotting/report.py (1)

303-307: Consider prefixing unused unpacked variables with underscore.

The ordered_vars and cfg variables are unpacked but never used in this method. Prefixing them with underscores would make the intent clearer and silence linter warnings.

♻️ Proposed fix

     def do_plot(self, *args):
         """
         Generate focused report showing token throughput and TTFT performance analysis
         """
-        ordered_vars, settings, setting_lists, variables, cfg = args
+        _ordered_vars, settings, setting_lists, variables, _cfg = args
         entries = list(common.Matrix.all_records(settings, setting_lists))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
303 - 307, The do_plot method unpacks ordered_vars and cfg but never uses them;
to indicate they are intentionally unused and silence linters, rename those two
unpacked variables to _ordered_vars and _cfg (i.e., change the unpacking in def
do_plot(self, *args): ordered_vars, settings, setting_lists, variables, cfg =
args to use _ordered_vars and _cfg), leaving the remaining variable names and
logic unchanged.

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (1)

517-546: Unused loop variable i.

The enumerate() is unnecessary since i is not used in the loop body. Iterate directly over configurations.

♻️ Proposed fix

-        for i, config_name in enumerate(configurations):
+        for config_name in configurations:
             config_df = df[df['Test Configuration'] == config_name].sort_values('Concurrency')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 517 - 546, The loop uses an unused enumerate variable i; change the
loop header from "for i, config_name in enumerate(configurations):" to iterate
directly over configurations (e.g., "for config_name in configurations:") so
only the needed variable config_name is used when building traces (see usages in
the fig.add_trace blocks referencing config_df, color_map and f'{config["name"]}
P50/P95 (ms)').

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 47-50: Fix the typo in the inline comment for the guidellm
multiturn config: change "supporte" to "supported" in the comment adjacent to
the tests.llmd.benchmarks.guidellm.rate entry so it reads "multi-rate not
supported by guidellm-multiturn".

In `@projects/llm-d/testing/test_llmd.py`:
- Around line 825-829: The rate parsing only handles list/tuple so
comma-separated strings like "1,10,50" become a single element; update the logic
in the block that sets rate_values (used by apply_rate_scaleup) to also detect
str and split on commas (and strip whitespace) to produce a list of individual
rate strings/numbers, and convert elements to the expected numeric type before
downstream arithmetic in apply_rate_scaleup().
- Line 881: The CLI is passing rate_value as a string which breaks numeric
expressions in apply_rate_scaleup; convert rate_value to a numeric type before
adding it to guidellm_args (or ensure apply_rate_scaleup coerces rate to a
number). Locate where guidellm_args is built and change the append of
f"--rate={rate_value}" to supply a numeric conversion (e.g., float(rate_value)
or int where appropriate) so that apply_rate_scaleup and its eval context
receive a numeric rate; alternatively, update apply_rate_scaleup to explicitly
cast the context["rate"] to a number before evaluating expressions.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 363-369: The code builds entry_name using
entry.get_name(variables) or an f-string literal f"Unknown Configuration";
remove the unnecessary f prefix so the fallback is just "Unknown Configuration"
to avoid misleading usage of an f-string in the block handling entries ->
entry.results.guidellm_benchmarks and appending to all_benchmarks and
benchmark_configs.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 378-381: The current color assignment uses
px.colors.qualitative.Set1 which has only 9 colors and can raise a KeyError when
len(configurations) > 9; fix by cycling the palette when assigning colors to
each entry in configurations (use itertools.cycle or modulo indexing) so
color_map = {config: next(color_cycle) for config in configurations} (or
color_map = {config: colors[i % len(colors)] for i, config in
enumerate(configurations)}) and ensure you import itertools if using cycle;
reference the variables configurations, colors, color_map and the palette
px.colors.qualitative.Set1.
- Around line 512-515: The current color selection uses
px.colors.qualitative.Set1 sliced to the number of configurations, which fails
when there are >9 configurations; update the logic that builds colors and
color_map (variables configurations, colors, color_map) to cycle/repeat the base
qualitative palette when len(configurations) exceeds the palette length (e.g.,
use an iterator or repeat/extend the palette until it covers
len(configurations)), so each configuration gets a deterministic color even for
>9 entries while preserving existing colors for <=9.

---

Outside diff comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 198-206: The vllm_args list contains a duplicate flag
"--enable-prefix-caching" (appears in the vllm_args array twice); remove the
redundant entry so the vllm_args array only includes a single
"--enable-prefix-caching" element (locate the vllm_args block in
projects/llm-d/testing/config.yaml and delete the duplicate instance).

---

Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 847-856: The evaluate_expression function currently uses
eval(expression, {"__builtins__": {}}, {"rate": rate}) which is fragile; replace
it with a safe evaluator that parses the expression with ast.parse and only
permits a whitelist of nodes (Module/Expr/BinOp/UnaryOp/Constant/Name) and
allowed operators (+, -, *, /, //, %, **, unary +/-, parentheses), only allows
the Name "rate" and numeric constants, and then evaluates the AST
deterministically (e.g., using a small recursive evaluator function) returning
the computed string; keep the existing exception handling/logging behavior and
fall back to returning match.group(0) on any validation or evaluation error.
Reference: evaluate_expression function and its use of rate.

In `@projects/llm-d/toolbox/llmd.py`:
- Around line 56-57: The default argument guidellm_args is a mutable list;
change its function or call-site default from [] to None and inside the function
(the place that constructs or uses RunAnsibleRole(locals())) set guidellm_args =
[] if guidellm_args is None before use; update any references to guidellm_args
(e.g., in the RunAnsibleRole(locals()) call) so the code passes the canonical
list and avoids the mutable-default anti-pattern.

In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 303-307: The do_plot method unpacks ordered_vars and cfg but never
uses them; to indicate they are intentionally unused and silence linters, rename
those two unpacked variables to _ordered_vars and _cfg (i.e., change the
unpacking in def do_plot(self, *args): ordered_vars, settings, setting_lists,
variables, cfg = args to use _ordered_vars and _cfg), leaving the remaining
variable names and logic unchanged.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 517-546: The loop uses an unused enumerate variable i; change the
loop header from "for i, config_name in enumerate(configurations):" to iterate
directly over configurations (e.g., "for config_name in configurations:") so
only the needed variable config_name is used when building traces (see usages in
the fig.add_trace blocks referencing config_df, color_map and f'{config["name"]}
P50/P95 (ms)').

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c2983814-6c88-43cd-b600-2a3a76db5a5c

📥 Commits

Reviewing files that changed from the base of the PR and between 337a112 and 3c586e1.

📒 Files selected for processing (19)

docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
docs/toolbox.generated/index.rst
projects/llm-d/testing/config.yaml
projects/llm-d/testing/test_llmd.py
projects/llm-d/toolbox/llmd.py
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
projects/llm-d/visualizations/llmd_inference/analyze/__init__.py
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/models/lts.py
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/visualizations/llmd_inference/plotting/report.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
projects/llm-d/visualizations/llmd_inference/store/parsers.py

💤 Files with no reviewable changes (8)

projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
docs/toolbox.generated/index.rst
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
projects/llm-d/visualizations/llmd_inference/analyze/init.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
projects/llm-d/visualizations/llmd_inference/models/lts.py

projects/llm-d/testing/config.yaml

projects/llm-d/testing/test_llmd.py

coderabbitai · 2026-03-20T20:07:31Z

projects/llm-d/testing/test_llmd.py


-    failed = False
+            # Add rate parameter
+            guidellm_args.append(f"--rate={rate_value}")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify how rate values are used in the codebase ast-grep --pattern 'apply_rate_scaleup($_, $_)'

Repository: openshift-psap/topsail

Length of output: 335

🏁 Script executed:

# First, let's look at the context around line 881 to understand how rate_value is assigned sed -n '860,900p' projects/llm-d/testing/test_llmd.py | cat -n

Repository: openshift-psap/topsail

Length of output: 1807

🏁 Script executed:

# Find the apply_rate_scaleup function definition rg 'def apply_rate_scaleup' -A 5

Repository: openshift-psap/topsail

Length of output: 473

🏁 Script executed:

# Look for where rate_values is initially parsed/assigned rg 'rate_values.*=' projects/llm-d/testing/test_llmd.py | head -20

Repository: openshift-psap/topsail

Length of output: 121

🏁 Script executed:

# Get the full apply_rate_scaleup function definition rg 'def apply_rate_scaleup' -A 20 projects/llm-d/testing/test_llmd.py

Repository: openshift-psap/topsail

Length of output: 717

🏁 Script executed:

# Find where rate values are parsed/loaded (search for rate parsing in run_guidellm_benchmark) sed -n '800,880p' projects/llm-d/testing/test_llmd.py | cat -n

Repository: openshift-psap/topsail

Length of output: 3597

🏁 Script executed:

# Check the function signature to see how rate parameter comes in rg 'def run_guidellm_benchmark' -A 3 projects/llm-d/testing/test_llmd.py

Repository: openshift-psap/topsail

Length of output: 180

🏁 Script executed:

# Check if rate is converted to numeric type anywhere before use sed -n '815,865p' projects/llm-d/testing/test_llmd.py | cat -n

Repository: openshift-psap/topsail

Length of output: 2500

🏁 Script executed:

# Look for any type conversion or int/float casting of rate rg 'int\(rate|float\(rate' projects/llm-d/testing/test_llmd.py

Repository: openshift-psap/topsail

Length of output: 48

🏁 Script executed:

# Check if there are any config conversions that might handle type casting rg 'rate.*int|rate.*float' projects/llm-d/testing/test_llmd.py

Repository: openshift-psap/topsail

Length of output: 175

Rate value must be numeric type when used in apply_rate_scaleup expressions.

The --rate={rate_value} CLI argument on line 881 works with string or numeric types. However, apply_rate_scaleup (used at lines 888 and 892) evaluates expressions like {10*rate} using eval() with the rate in the context. If rate is a string from YAML configuration, these expressions silently fail—the evaluation catches the TypeError and returns the original {expression} unchanged, preventing the scaling from occurring.

Ensure rate values are converted to numeric type before being passed to apply_rate_scaleup, or add explicit type conversion within the function itself.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/test_llmd.py` at line 881, The CLI is passing rate_value as a string which breaks numeric expressions in apply_rate_scaleup; convert rate_value to a numeric type before adding it to guidellm_args (or ensure apply_rate_scaleup coerces rate to a number). Locate where guidellm_args is built and change the append of f"--rate={rate_value}" to supply a numeric conversion (e.g., float(rate_value) or int where appropriate) so that apply_rate_scaleup and its eval context receive a numeric rate; alternatively, update apply_rate_scaleup to explicitly cast the context["rate"] to a number before evaluating expressions.

projects/llm-d/visualizations/llmd_inference/plotting/report.py

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

kpouget · 2026-03-20T20:26:32Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.model: gpt-oss-120

kpouget · 2026-03-20T20:28:27Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.model: gpt-oss-120

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

projects/llm-d/testing/config.yaml (1)

49-49: ⚠️ Potential issue | 🟡 Minor

Fix the typo in the inline comment.

Line 49 still says supporte.

✏️ Proposed fix

-    tests.llmd.benchmarks.guidellm.rate: [32, 64, 128, 256, 512] # keep as a list, multi-rate not supporte by guidellm-multiturn
+    tests.llmd.benchmarks.guidellm.rate: [32, 64, 128, 256, 512] # keep as a list, multi-rate not supported by guidellm-multiturn

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/config.yaml` at line 49, The inline comment on the
config key tests.llmd.benchmarks.guidellm.rate contains a typo ("supporte");
update that comment to read "supported" so it becomes "...keep as a list,
multi-rate not supported by guidellm-multiturn" (edit the comment on the same
line where tests.llmd.benchmarks.guidellm.rate is defined).

🧹 Nitpick comments (1)

projects/llm-d/testing/config.yaml (1)

198-206: Remove the duplicated --enable-prefix-caching flag.

The same vLLM flag appears on Line 200 and Line 206. If vllm_args are forwarded verbatim, the rendered command line carries a duplicate boolean for no benefit and becomes harder to audit. Please verify the launcher is not relying on upstream deduplication here.

✂️ Proposed cleanup

       vllm_args:
         - "--disable-uvicorn-access-log"
         - "--enable-prefix-caching"
         - "--uvicorn-log-level=debug"
         - "--trust-remote-code"
         - "--disable-log-requests"
         - "--max-model-len=40960"
         - "--gpu-memory-utilization=0.9"
-        - "--enable-prefix-caching"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/config.yaml` around lines 198 - 206, The vllm_args
array contains a duplicated flag "--enable-prefix-caching"; remove the redundant
entry from the vllm_args list so each flag appears only once and update any code
that constructs or validates this list (look for references to the vllm_args
symbol or the config loader that reads it) to ensure flags are not intentionally
duplicated; after removing the duplicate, verify the launcher or
command-construction path that forwards vllm_args does not rely on downstream
deduplication by ensuring it passes the cleaned list verbatim (or adds explicit
deduplication if needed).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 224-227: The base preset in projects/llm-d/testing/config.yaml
leaves both max_seconds and max_requests null, causing runs without a preset to
rely on guidellm's internal defaults (see guidellm invocation in test_llmd.py
around the runner that would pass --max-seconds/--max-requests); update the base
preset to include an explicit stop bound (e.g., set max_seconds: 900 or a
sensible max_requests) so the runner always has an explicit termination limit,
or alternatively add a comment in the base config documenting the exact guidellm
fallback behavior and why it is acceptable.

---

Duplicate comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 49: The inline comment on the config key
tests.llmd.benchmarks.guidellm.rate contains a typo ("supporte"); update that
comment to read "supported" so it becomes "...keep as a list, multi-rate not
supported by guidellm-multiturn" (edit the comment on the same line where
tests.llmd.benchmarks.guidellm.rate is defined).

---

Nitpick comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 198-206: The vllm_args array contains a duplicated flag
"--enable-prefix-caching"; remove the redundant entry from the vllm_args list so
each flag appears only once and update any code that constructs or validates
this list (look for references to the vllm_args symbol or the config loader that
reads it) to ensure flags are not intentionally duplicated; after removing the
duplicate, verify the launcher or command-construction path that forwards
vllm_args does not rely on downstream deduplication by ensuring it passes the
cleaned list verbatim (or adds explicit deduplication if needed).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bae76231-62b7-474a-82ba-dcae89cb6c21

📥 Commits

Reviewing files that changed from the base of the PR and between 3c586e1 and 8487263.

📒 Files selected for processing (1)

projects/llm-d/testing/config.yaml

projects/llm-d/testing/config.yaml

topsail-bot · 2026-03-21T02:36:14Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 16 minutes 39 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

kpouget · 2026-03-21T07:56:38Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (2)

518-518: Unused loop variable i.

The loop index i is not used within the loop body. Rename to _i or use direct iteration.

♻️ Proposed fix

-        for i, config_name in enumerate(configurations):
+        for config_name in configurations:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
at line 518, The loop uses an unused index variable `i` in the statement "for i,
config_name in enumerate(configurations):"; replace it with a direct iteration
over `configurations` (i.e., "for config_name in configurations:") or rename `i`
to `_i` to indicate it's unused so linters won't flag it; update the loop in
throughput_analysis.py where `configurations` is iterated (the line currently
using enumerate) accordingly.

401-401: Consider defensive handling for zero concurrency.

If request_concurrency is zero for any benchmark, the efficiency calculation will produce infinity, potentially selecting that point as "most efficient."

♻️ Proposed defensive fix

-        best_efficiency = df.loc[(df['Tokens/s'] / df['Concurrency']).idxmax()]
+        efficiency_df = df[df['Concurrency'] > 0].copy()
+        if not efficiency_df.empty:
+            best_efficiency = efficiency_df.loc[(efficiency_df['Tokens/s'] / efficiency_df['Concurrency']).idxmax()]
+        else:
+            best_efficiency = best_tokens  # Fallback to best throughput

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
at line 401, The current best_efficiency selection divides df['Tokens/s'] by
df['Concurrency'] without guarding against zero concurrency; update the logic in
the best_efficiency selection to first filter out or ignore rows where
df['Concurrency'] <= 0 (e.g., df_nonzero = df[df['Concurrency'] > 0]) and
compute idxmax on df_nonzero['Tokens/s'] / df_nonzero['Concurrency']; also
handle the edge case where df_nonzero is empty by providing a sensible fallback
(skip selection or choose the row with max 'Tokens/s' using
df['Tokens/s'].idxmax()) so the code in best_efficiency never picks an infinite
value.

projects/llm-d/testing/test_llmd.py (2)

847-856: Don't execute YAML expressions with eval().

max_requests and data come from configuration, so this turns config data into executable Python. The supported syntax here is just arithmetic on rate; a tiny AST whitelist or dedicated parser would keep the feature without widening the runner's execution surface. I'd also fail fast on invalid expressions instead of forwarding the raw {...} placeholder.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/test_llmd.py` around lines 847 - 856, The
evaluate_expression function currently uses eval(expression, ...) which executes
arbitrary config-supplied code; replace it with a safe evaluator that parses the
expression with ast.parse and only allows a whitelist of AST node types (e.g.,
Expression, BinOp, UnaryOp, Num/Constant, Name with only the 'rate' identifier,
operators like Add/Sub/Mul/Div, Pow, Mod) and then evaluate the AST manually or
via a visitor that computes the result using the provided rate; update
evaluate_expression to raise a ValueError (fail fast) when the expression
contains any disallowed nodes or invalid syntax instead of returning the raw
placeholder, and ensure the surrounding code catches/logs that exception with
context about the offending expression and rate.

816-818: Verify the GuideLLM profile flag name.

GuideLLM's current README uses --profile, and the v0.4.0 release notes describe --rate-type as renamed. If the benchmark image no longer keeps a compatibility alias, emitting --rate-type=... here will fail at runtime; please verify that first or switch the generated arg to --profile. (github.com)

🔧 Suggested update

-    rate_type = config.project.get_config("tests.llmd.benchmarks.guidellm.rate_type")
+    profile = config.project.get_config("tests.llmd.benchmarks.guidellm.rate_type")
@@
-    if rate_type:
-        guidellm_args.append(f"--rate-type={rate_type}")
+    if profile:
+        guidellm_args.append(f"--profile={profile}")

Also applies to: 877-878

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/test_llmd.py` around lines 816 - 818, The code
retrieves benchmark flags into backend_type and rate_type and then generates CLI
args for GuideLLM; verify whether GuideLLM expects --profile (per README)
instead of --rate-type (renamed in v0.4.0) and update the argument generation
that uses rate_type to emit --profile=... if the image no longer provides a
compatibility alias; adjust both occurrences (the current uses around variables
backend_type, rate_type, max_seconds and the duplicate at the later occurrence)
so the produced CLI flag matches the actual GuideLLM version or make the flag
selection conditional based on a verified compatibility check.

projects/llm-d/toolbox/llmd.py (1)

56-57: Avoid a shared default list for guidellm_args.

guidellm_args=[] is created once at function definition time, so any future mutation will leak across calls. Default to None and normalize it locally before RunAnsibleRole(locals()).

♻️ Suggested fix

     def run_guidellm_benchmark(
             self,
             endpoint_url,
             name="guidellm-benchmark", namespace="",
             image="ghcr.io/vllm-project/guidellm", version="pr-590",
             timeout=900,
-            guidellm_args=[],
+            guidellm_args=None,
     ):
         """
         Runs a Guidellm benchmark job against the LLM inference service
@@
           guidellm_args: List of additional guidellm arguments (e.g., ["--rate=10", "--max-seconds=30"])
         """
 
+        if guidellm_args is None:
+            guidellm_args = []
+
         return RunAnsibleRole(locals())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/toolbox/llmd.py` around lines 56 - 57, The function currently
uses a mutable default guidellm_args=[] which can leak state across calls;
change the function signature to default guidellm_args to None and inside the
function normalize it (e.g., if guidellm_args is None: guidellm_args = [])
before calling RunAnsibleRole(locals()); update any references to guidellm_args
in that function to use the normalized local variable so RunAnsibleRole receives
a fresh list each invocation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 608-618: The TTFT analysis is using unit_conversion=1 while
ttft_median and ttft_p95 are parsed in seconds; update GuidellmTTFTAnalysis
(class name) to use unit_conversion=1000 so those p50_field 'ttft_median' and
p95_field 'ttft_p95' values are converted from seconds back to milliseconds for
display, and adjust any unit label/note_text if present to remain consistent
with "(ms)".

In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py`:
- Around line 245-252: Remove the unnecessary f-string prefixes on the two
logging.debug calls that contain no placeholders: change the lines that log "   
Raw table rows:" and "  Problem row data:" so they use plain strings instead of
f-strings; you can find these in the parser where summary_row, latency_rows, and
throughput_rows are logged (the logging.debug lines referencing "Raw table
rows:" and "Problem row data:") and update them to logging.debug("    Raw table
rows:") and logging.debug("  Problem row data:").

---

Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 847-856: The evaluate_expression function currently uses
eval(expression, ...) which executes arbitrary config-supplied code; replace it
with a safe evaluator that parses the expression with ast.parse and only allows
a whitelist of AST node types (e.g., Expression, BinOp, UnaryOp, Num/Constant,
Name with only the 'rate' identifier, operators like Add/Sub/Mul/Div, Pow, Mod)
and then evaluate the AST manually or via a visitor that computes the result
using the provided rate; update evaluate_expression to raise a ValueError (fail
fast) when the expression contains any disallowed nodes or invalid syntax
instead of returning the raw placeholder, and ensure the surrounding code
catches/logs that exception with context about the offending expression and
rate.
- Around line 816-818: The code retrieves benchmark flags into backend_type and
rate_type and then generates CLI args for GuideLLM; verify whether GuideLLM
expects --profile (per README) instead of --rate-type (renamed in v0.4.0) and
update the argument generation that uses rate_type to emit --profile=... if the
image no longer provides a compatibility alias; adjust both occurrences (the
current uses around variables backend_type, rate_type, max_seconds and the
duplicate at the later occurrence) so the produced CLI flag matches the actual
GuideLLM version or make the flag selection conditional based on a verified
compatibility check.

In `@projects/llm-d/toolbox/llmd.py`:
- Around line 56-57: The function currently uses a mutable default
guidellm_args=[] which can leak state across calls; change the function
signature to default guidellm_args to None and inside the function normalize it
(e.g., if guidellm_args is None: guidellm_args = []) before calling
RunAnsibleRole(locals()); update any references to guidellm_args in that
function to use the normalized local variable so RunAnsibleRole receives a fresh
list each invocation.

In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 518: The loop uses an unused index variable `i` in the statement "for i,
config_name in enumerate(configurations):"; replace it with a direct iteration
over `configurations` (i.e., "for config_name in configurations:") or rename `i`
to `_i` to indicate it's unused so linters won't flag it; update the loop in
throughput_analysis.py where `configurations` is iterated (the line currently
using enumerate) accordingly.
- Line 401: The current best_efficiency selection divides df['Tokens/s'] by
df['Concurrency'] without guarding against zero concurrency; update the logic in
the best_efficiency selection to first filter out or ignore rows where
df['Concurrency'] <= 0 (e.g., df_nonzero = df[df['Concurrency'] > 0]) and
compute idxmax on df_nonzero['Tokens/s'] / df_nonzero['Concurrency']; also
handle the edge case where df_nonzero is empty by providing a sensible fallback
(skip selection or choose the row with max 'Tokens/s' using
df['Tokens/s'].idxmax()) so the code in best_efficiency never picks an infinite
value.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2f0fc95-a80a-48d3-bcba-09d4f32e9851

📥 Commits

Reviewing files that changed from the base of the PR and between 8487263 and c9fd4ea.

📒 Files selected for processing (20)

docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
docs/toolbox.generated/index.rst
projects/llm-d/testing/config.yaml
projects/llm-d/testing/llmisvcs/llama-3-1-8b-instruct-fp8.yaml
projects/llm-d/testing/test_llmd.py
projects/llm-d/toolbox/llmd.py
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
projects/llm-d/visualizations/llmd_inference/analyze/__init__.py
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/models/lts.py
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/visualizations/llmd_inference/plotting/report.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
projects/llm-d/visualizations/llmd_inference/store/parsers.py

💤 Files with no reviewable changes (9)

projects/llm-d/testing/llmisvcs/llama-3-1-8b-instruct-fp8.yaml
docs/toolbox.generated/index.rst
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
projects/llm-d/visualizations/llmd_inference/analyze/init.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
projects/llm-d/visualizations/llmd_inference/models/lts.py
projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml

🚧 Files skipped from review as they are similar to previous changes (5)

projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
projects/llm-d/visualizations/llmd_inference/data/plots.yaml
projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py

projects/llm-d/visualizations/llmd_inference/store/parsers.py

topsail-bot · 2026-03-21T11:23:49Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 23 minutes 13 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval

kpouget · 2026-03-21T19:31:27Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss
/cluster cks
/only test_ci

coderabbitai

♻️ Duplicate comments (2)

projects/llm-d/testing/config.yaml (2)
228-229: ⚠️ Potential issue | 🔴 Critical

Give the base GuideLLM config an explicit stop condition.

Lines 228-229 are still both null, so any run that doesn't layer a benchmark-specific preset relies on GuideLLM defaults or the 900s job timeout to terminate. Please set one bound here or document the fallback explicitly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/config.yaml` around lines 228 - 229, The GuideLLM base
config currently has both max_seconds and max_requests set to null so runs can
hang until an external timeout; set an explicit stop condition by assigning a
sensible non-null default (e.g., max_seconds: 600 or max_requests: 10) in the
base config, or add a clear comment in the GuideLLM config stating which
external fallback (e.g., 900s job timeout) will be relied on; update the entries
for max_seconds and/or max_requests so the base GuideLLM preset always enforces
a termination bound.
49-49: ⚠️ Potential issue | 🟡 Minor

Fix the inline comment typo.

Line 49 still says supporte; please change it to supported.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/config.yaml` at line 49, Update the inline comment for
the config key tests.llmd.benchmarks.guidellm.rate by correcting the typo
"supporte" to "supported" so the comment reads "...multi-rate not supported by
guidellm-multiturn"; the change should be made on the same line containing the
tests.llmd.benchmarks.guidellm.rate list.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 228-229: The GuideLLM base config currently has both max_seconds
and max_requests set to null so runs can hang until an external timeout; set an
explicit stop condition by assigning a sensible non-null default (e.g.,
max_seconds: 600 or max_requests: 10) in the base config, or add a clear comment
in the GuideLLM config stating which external fallback (e.g., 900s job timeout)
will be relied on; update the entries for max_seconds and/or max_requests so the
base GuideLLM preset always enforces a termination bound.
- Line 49: Update the inline comment for the config key
tests.llmd.benchmarks.guidellm.rate by correcting the typo "supporte" to
"supported" so the comment reads "...multi-rate not supported by
guidellm-multiturn"; the change should be made on the same line containing the
tests.llmd.benchmarks.guidellm.rate list.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41f1a6f1-aeda-44b3-bd91-3accf9795725

📥 Commits

Reviewing files that changed from the base of the PR and between c9fd4ea and 7dba32a.

📒 Files selected for processing (1)

projects/llm-d/testing/config.yaml

topsail-bot · 2026-03-22T00:44:57Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 20 minutes 36 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
PR_POSITIONAL_ARG_4: gpt-oss

kpouget · 2026-03-23T08:44:52Z

/test jump-ci-plot llm-d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/908/pull-ci-openshift-psap-topsail-main-jump-ci/2035084538677301248/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

topsail-bot · 2026-03-23T08:49:50Z

🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 12 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci-plot
PR_POSITIONAL_ARG_0: jump-ci-plot
PR_POSITIONAL_ARG_1: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/908/pull-ci-openshift-psap-topsail-main-jump-ci/2035084538677301248/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

kpouget · 2026-03-23T08:58:58Z

/test jump-ci-plot llm-d https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/907/pull-ci-openshift-psap-topsail-main-jump-ci/2035015090146643968/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

topsail-bot · 2026-03-23T09:05:33Z

🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 22 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci-plot
PR_POSITIONAL_ARG_0: jump-ci-plot
PR_POSITIONAL_ARG_1: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift-psap_topsail/907/pull-ci-openshift-psap-topsail-main-jump-ci/2035015090146643968/artifacts/ci/005-test/artifacts/test-artifacts/000__llm_d_testing/

[llm-d] testing: config: update the test flavors

37ea554

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

kpouget force-pushed the download-model3 branch from 3c586e1 to 8487263 Compare March 20, 2026 20:28

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

projects/llm-d/testing/config.yaml Show resolved Hide resolved

kpouget added 7 commits March 21, 2026 08:55

[llm-d] visualizations: llmd_inference: improve the visualization

bd8e697

[llm-d] Remove the non-guidellm multi-turn benchmark

02d8fc2

[llm-d] toolbox: llmd: rewrite guidellm argument passing

434c66f

[llm-d] testing: config: exclude another node

6c9b63c

[llm-d] testing: allow rate-proportional guidellm configs

cd1a2fe

[llm-d] testing: config: update the llmisvc basename

0cb0382

[llm-d] testing: config: update the vllm/guidellm config

c9fd4ea

kpouget force-pushed the download-model3 branch from 8487263 to c9fd4ea Compare March 21, 2026 07:55

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py Show resolved Hide resolved

projects/llm-d/visualizations/llmd_inference/store/parsers.py Show resolved Hide resolved

[llm-d] testing: config: add the gpt-oss preset

7dba32a

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

openshift-psap deleted a comment from openshift-ci bot Mar 23, 2026

openshift-psap deleted a comment from topsail-bot bot Mar 23, 2026

kpouget merged commit 51c57bd into openshift-psap:main Mar 23, 2026
17 of 18 checks passed

kpouget deleted the download-model3 branch March 23, 2026 16:34

coderabbitai bot mentioned this pull request Mar 28, 2026

[llm-d] Keep working #913

Merged

Conversation

kpouget commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

kpouget commented Mar 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

topsail-bot bot commented Mar 21, 2026

Uh oh!

kpouget commented Mar 21, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

topsail-bot bot commented Mar 21, 2026

Uh oh!

kpouget commented Mar 21, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

topsail-bot bot commented Mar 22, 2026

Uh oh!

kpouget commented Mar 23, 2026

Uh oh!

topsail-bot bot commented Mar 23, 2026

Uh oh!

kpouget commented Mar 23, 2026

Uh oh!

topsail-bot bot commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kpouget commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading