Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThe PR removes multi‑turn benchmark support and related Ansible role/templates, models, parsers, and visualizations, and refactors the GuideLLM benchmark flow to accept a unified Changes
Sequence DiagramsequenceDiagram
participant Test as Test Orchestration
participant Config as Test Config
participant Toolbox as Llmd Toolbox
participant Ansible as Ansible Role
participant Guidellm as GuideLLM Job
participant Storage as Artifact Storage
Test->>Config: load benchmark preset
Test->>Test: apply_rate_scaleup(data, max_requests, rate)
loop per rate_value
Test->>Test: build guidellm_args (CLI list)
Test->>Toolbox: run_guidellm_benchmark(endpoint, guidellm_args, artifact_dir_suffix)
Toolbox->>Ansible: RunAnsibleRole(locals())
Ansible->>Guidellm: create K8s Job with container args from guidellm_args
Guidellm->>Storage: emit benchmark logs/artifacts
Ansible->>Storage: collect job manifests & logs
Ansible-->>Toolbox: role returns status
Toolbox-->>Test: return per-rate result
end
Test->>Storage: aggregate per-rate artifacts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can use OpenGrep to find security vulnerabilities and bugs across 17+ programming languages.OpenGrep is compatible with Semgrep configurations. Add an |
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
projects/llm-d/testing/config.yaml (1)
198-206:⚠️ Potential issue | 🟡 MinorDuplicate
--enable-prefix-cachingflag.Line 200 already specifies
--enable-prefix-caching, and it's duplicated at line 206. Remove the duplicate entry.🐛 Proposed fix
- "--gpu-memory-utilization=0.9" - - "--enable-prefix-caching"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/config.yaml` around lines 198 - 206, The vllm_args list contains a duplicate flag "--enable-prefix-caching" (appears in the vllm_args array twice); remove the redundant entry so the vllm_args array only includes a single "--enable-prefix-caching" element (locate the vllm_args block in projects/llm-d/testing/config.yaml and delete the duplicate instance).
🧹 Nitpick comments (4)
projects/llm-d/toolbox/llmd.py (1)
56-57: Consider usingNoneas default instead of mutable[].Using a mutable default argument (
[]) is a Python anti-pattern (Ruff B006). While the current usage withRunAnsibleRole(locals())likely doesn't mutate the list, usingNonewith an in-function default is safer and more idiomatic.♻️ Proposed fix
`@AnsibleRole`("llmd_run_guidellm_benchmark") `@AnsibleMappedParams` def run_guidellm_benchmark( self, endpoint_url, name="guidellm-benchmark", namespace="", image="ghcr.io/vllm-project/guidellm", version="pr-590", timeout=900, - guidellm_args=[], + guidellm_args=None, ): """ ... guidellm_args: List of additional guidellm arguments (e.g., ["--rate=10", "--max-seconds=30"]) """ + if guidellm_args is None: + guidellm_args = [] + return RunAnsibleRole(locals())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/toolbox/llmd.py` around lines 56 - 57, The default argument guidellm_args is a mutable list; change its function or call-site default from [] to None and inside the function (the place that constructs or uses RunAnsibleRole(locals())) set guidellm_args = [] if guidellm_args is None before use; update any references to guidellm_args (e.g., in the RunAnsibleRole(locals()) call) so the code passes the canonical list and avoids the mutable-default anti-pattern.projects/llm-d/testing/test_llmd.py (1)
847-856: Security:eval()usage with restricted context.While the
eval()is restricted to only theratevariable with no builtins, this pattern can be fragile. Consider a safer approach using a simple expression parser for the limited arithmetic needed ({N*rate}).♻️ Safer alternative using regex-based arithmetic
def apply_rate_scaleup(value, rate): """ Apply rate-based scaling to configuration values. Evaluates expressions like: - "{10*rate}" with rate=32 -> "320" - "prefix_count={2*rate}" with rate=32 -> "prefix_count=64" """ if not isinstance(value, str): return value import re # Find all expressions in curly braces pattern = r'\{([^}]+)\}' def evaluate_expression(match): expression = match.group(1) try: - # Create a safe evaluation context with only 'rate' variable - context = {"rate": rate} - result = eval(expression, {"__builtins__": {}}, context) - return str(result) + # Parse simple arithmetic expressions like "10*rate" or "rate*2" + mult_match = re.match(r'^\s*(\d+)\s*\*\s*rate\s*$', expression) + if mult_match: + return str(int(mult_match.group(1)) * rate) + mult_match = re.match(r'^\s*rate\s*\*\s*(\d+)\s*$', expression) + if mult_match: + return str(rate * int(mult_match.group(1))) + # Fallback for unsupported expressions + logging.warning(f"Unsupported expression '{expression}', returning as-is") + return match.group(0) except Exception as e: logging.warning(f"Failed to evaluate expression '{expression}' with rate={rate}: {e}") return match.group(0) # Return original if evaluation fails # Replace all expressions with their evaluated results return re.sub(pattern, evaluate_expression, value)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/test_llmd.py` around lines 847 - 856, The evaluate_expression function currently uses eval(expression, {"__builtins__": {}}, {"rate": rate}) which is fragile; replace it with a safe evaluator that parses the expression with ast.parse and only permits a whitelist of nodes (Module/Expr/BinOp/UnaryOp/Constant/Name) and allowed operators (+, -, *, /, //, %, **, unary +/-, parentheses), only allows the Name "rate" and numeric constants, and then evaluates the AST deterministically (e.g., using a small recursive evaluator function) returning the computed string; keep the existing exception handling/logging behavior and fall back to returning match.group(0) on any validation or evaluation error. Reference: evaluate_expression function and its use of rate.projects/llm-d/visualizations/llmd_inference/plotting/report.py (1)
303-307: Consider prefixing unused unpacked variables with underscore.The
ordered_varsandcfgvariables are unpacked but never used in this method. Prefixing them with underscores would make the intent clearer and silence linter warnings.♻️ Proposed fix
def do_plot(self, *args): """ Generate focused report showing token throughput and TTFT performance analysis """ - ordered_vars, settings, setting_lists, variables, cfg = args + _ordered_vars, settings, setting_lists, variables, _cfg = args entries = list(common.Matrix.all_records(settings, setting_lists))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 303 - 307, The do_plot method unpacks ordered_vars and cfg but never uses them; to indicate they are intentionally unused and silence linters, rename those two unpacked variables to _ordered_vars and _cfg (i.e., change the unpacking in def do_plot(self, *args): ordered_vars, settings, setting_lists, variables, cfg = args to use _ordered_vars and _cfg), leaving the remaining variable names and logic unchanged.projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (1)
517-546: Unused loop variablei.The
enumerate()is unnecessary sinceiis not used in the loop body. Iterate directly overconfigurations.♻️ Proposed fix
- for i, config_name in enumerate(configurations): + for config_name in configurations: config_df = df[df['Test Configuration'] == config_name].sort_values('Concurrency')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 517 - 546, The loop uses an unused enumerate variable i; change the loop header from "for i, config_name in enumerate(configurations):" to iterate directly over configurations (e.g., "for config_name in configurations:") so only the needed variable config_name is used when building traces (see usages in the fig.add_trace blocks referencing config_df, color_map and f'{config["name"]} P50/P95 (ms)').
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 47-50: Fix the typo in the inline comment for the guidellm
multiturn config: change "supporte" to "supported" in the comment adjacent to
the tests.llmd.benchmarks.guidellm.rate entry so it reads "multi-rate not
supported by guidellm-multiturn".
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 825-829: The rate parsing only handles list/tuple so
comma-separated strings like "1,10,50" become a single element; update the logic
in the block that sets rate_values (used by apply_rate_scaleup) to also detect
str and split on commas (and strip whitespace) to produce a list of individual
rate strings/numbers, and convert elements to the expected numeric type before
downstream arithmetic in apply_rate_scaleup().
- Line 881: The CLI is passing rate_value as a string which breaks numeric
expressions in apply_rate_scaleup; convert rate_value to a numeric type before
adding it to guidellm_args (or ensure apply_rate_scaleup coerces rate to a
number). Locate where guidellm_args is built and change the append of
f"--rate={rate_value}" to supply a numeric conversion (e.g., float(rate_value)
or int where appropriate) so that apply_rate_scaleup and its eval context
receive a numeric rate; alternatively, update apply_rate_scaleup to explicitly
cast the context["rate"] to a number before evaluating expressions.
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 363-369: The code builds entry_name using
entry.get_name(variables) or an f-string literal f"Unknown Configuration";
remove the unnecessary f prefix so the fallback is just "Unknown Configuration"
to avoid misleading usage of an f-string in the block handling entries ->
entry.results.guidellm_benchmarks and appending to all_benchmarks and
benchmark_configs.
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 378-381: The current color assignment uses
px.colors.qualitative.Set1 which has only 9 colors and can raise a KeyError when
len(configurations) > 9; fix by cycling the palette when assigning colors to
each entry in configurations (use itertools.cycle or modulo indexing) so
color_map = {config: next(color_cycle) for config in configurations} (or
color_map = {config: colors[i % len(colors)] for i, config in
enumerate(configurations)}) and ensure you import itertools if using cycle;
reference the variables configurations, colors, color_map and the palette
px.colors.qualitative.Set1.
- Around line 512-515: The current color selection uses
px.colors.qualitative.Set1 sliced to the number of configurations, which fails
when there are >9 configurations; update the logic that builds colors and
color_map (variables configurations, colors, color_map) to cycle/repeat the base
qualitative palette when len(configurations) exceeds the palette length (e.g.,
use an iterator or repeat/extend the palette until it covers
len(configurations)), so each configuration gets a deterministic color even for
>9 entries while preserving existing colors for <=9.
---
Outside diff comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 198-206: The vllm_args list contains a duplicate flag
"--enable-prefix-caching" (appears in the vllm_args array twice); remove the
redundant entry so the vllm_args array only includes a single
"--enable-prefix-caching" element (locate the vllm_args block in
projects/llm-d/testing/config.yaml and delete the duplicate instance).
---
Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 847-856: The evaluate_expression function currently uses
eval(expression, {"__builtins__": {}}, {"rate": rate}) which is fragile; replace
it with a safe evaluator that parses the expression with ast.parse and only
permits a whitelist of nodes (Module/Expr/BinOp/UnaryOp/Constant/Name) and
allowed operators (+, -, *, /, //, %, **, unary +/-, parentheses), only allows
the Name "rate" and numeric constants, and then evaluates the AST
deterministically (e.g., using a small recursive evaluator function) returning
the computed string; keep the existing exception handling/logging behavior and
fall back to returning match.group(0) on any validation or evaluation error.
Reference: evaluate_expression function and its use of rate.
In `@projects/llm-d/toolbox/llmd.py`:
- Around line 56-57: The default argument guidellm_args is a mutable list;
change its function or call-site default from [] to None and inside the function
(the place that constructs or uses RunAnsibleRole(locals())) set guidellm_args =
[] if guidellm_args is None before use; update any references to guidellm_args
(e.g., in the RunAnsibleRole(locals()) call) so the code passes the canonical
list and avoids the mutable-default anti-pattern.
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 303-307: The do_plot method unpacks ordered_vars and cfg but never
uses them; to indicate they are intentionally unused and silence linters, rename
those two unpacked variables to _ordered_vars and _cfg (i.e., change the
unpacking in def do_plot(self, *args): ordered_vars, settings, setting_lists,
variables, cfg = args to use _ordered_vars and _cfg), leaving the remaining
variable names and logic unchanged.
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 517-546: The loop uses an unused enumerate variable i; change the
loop header from "for i, config_name in enumerate(configurations):" to iterate
directly over configurations (e.g., "for config_name in configurations:") so
only the needed variable config_name is used when building traces (see usages in
the fig.add_trace blocks referencing config_df, color_map and f'{config["name"]}
P50/P95 (ms)').
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c2983814-6c88-43cd-b600-2a3a76db5a5c
📒 Files selected for processing (19)
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rstdocs/toolbox.generated/index.rstprojects/llm-d/testing/config.yamlprojects/llm-d/testing/test_llmd.pyprojects/llm-d/toolbox/llmd.pyprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.ymlprojects/llm-d/visualizations/llmd_inference/analyze/__init__.pyprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/models/lts.pyprojects/llm-d/visualizations/llmd_inference/plotting/error_report.pyprojects/llm-d/visualizations/llmd_inference/plotting/report.pyprojects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.pyprojects/llm-d/visualizations/llmd_inference/store/lts_parser.pyprojects/llm-d/visualizations/llmd_inference/store/parsers.py
💤 Files with no reviewable changes (8)
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
- docs/toolbox.generated/index.rst
- projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
- projects/llm-d/visualizations/llmd_inference/analyze/init.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
- projects/llm-d/visualizations/llmd_inference/models/lts.py
|
|
||
| failed = False | ||
| # Add rate parameter | ||
| guidellm_args.append(f"--rate={rate_value}") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify how rate values are used in the codebase
ast-grep --pattern 'apply_rate_scaleup($_, $_)'Repository: openshift-psap/topsail
Length of output: 335
🏁 Script executed:
# First, let's look at the context around line 881 to understand how rate_value is assigned
sed -n '860,900p' projects/llm-d/testing/test_llmd.py | cat -nRepository: openshift-psap/topsail
Length of output: 1807
🏁 Script executed:
# Find the apply_rate_scaleup function definition
rg 'def apply_rate_scaleup' -A 5Repository: openshift-psap/topsail
Length of output: 473
🏁 Script executed:
# Look for where rate_values is initially parsed/assigned
rg 'rate_values.*=' projects/llm-d/testing/test_llmd.py | head -20Repository: openshift-psap/topsail
Length of output: 121
🏁 Script executed:
# Get the full apply_rate_scaleup function definition
rg 'def apply_rate_scaleup' -A 20 projects/llm-d/testing/test_llmd.pyRepository: openshift-psap/topsail
Length of output: 717
🏁 Script executed:
# Find where rate values are parsed/loaded (search for rate parsing in run_guidellm_benchmark)
sed -n '800,880p' projects/llm-d/testing/test_llmd.py | cat -nRepository: openshift-psap/topsail
Length of output: 3597
🏁 Script executed:
# Check the function signature to see how rate parameter comes in
rg 'def run_guidellm_benchmark' -A 3 projects/llm-d/testing/test_llmd.pyRepository: openshift-psap/topsail
Length of output: 180
🏁 Script executed:
# Check if rate is converted to numeric type anywhere before use
sed -n '815,865p' projects/llm-d/testing/test_llmd.py | cat -nRepository: openshift-psap/topsail
Length of output: 2500
🏁 Script executed:
# Look for any type conversion or int/float casting of rate
rg 'int\(rate|float\(rate' projects/llm-d/testing/test_llmd.pyRepository: openshift-psap/topsail
Length of output: 48
🏁 Script executed:
# Check if there are any config conversions that might handle type casting
rg 'rate.*int|rate.*float' projects/llm-d/testing/test_llmd.pyRepository: openshift-psap/topsail
Length of output: 175
Rate value must be numeric type when used in apply_rate_scaleup expressions.
The --rate={rate_value} CLI argument on line 881 works with string or numeric types. However, apply_rate_scaleup (used at lines 888 and 892) evaluates expressions like {10*rate} using eval() with the rate in the context. If rate is a string from YAML configuration, these expressions silently fail—the evaluation catches the TypeError and returns the original {expression} unchanged, preventing the scaling from occurring.
Ensure rate values are converted to numeric type before being passed to apply_rate_scaleup, or add explicit type conversion within the function itself.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/testing/test_llmd.py` at line 881, The CLI is passing
rate_value as a string which breaks numeric expressions in apply_rate_scaleup;
convert rate_value to a numeric type before adding it to guidellm_args (or
ensure apply_rate_scaleup coerces rate to a number). Locate where guidellm_args
is built and change the append of f"--rate={rate_value}" to supply a numeric
conversion (e.g., float(rate_value) or int where appropriate) so that
apply_rate_scaleup and its eval context receive a numeric rate; alternatively,
update apply_rate_scaleup to explicitly cast the context["rate"] to a number
before evaluating expressions.
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
Show resolved
Hide resolved
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
Show resolved
Hide resolved
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval |
3c586e1 to
8487263
Compare
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
projects/llm-d/testing/config.yaml (1)
49-49:⚠️ Potential issue | 🟡 MinorFix the typo in the inline comment.
Line 49 still says
supporte.✏️ Proposed fix
- tests.llmd.benchmarks.guidellm.rate: [32, 64, 128, 256, 512] # keep as a list, multi-rate not supporte by guidellm-multiturn + tests.llmd.benchmarks.guidellm.rate: [32, 64, 128, 256, 512] # keep as a list, multi-rate not supported by guidellm-multiturn🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/config.yaml` at line 49, The inline comment on the config key tests.llmd.benchmarks.guidellm.rate contains a typo ("supporte"); update that comment to read "supported" so it becomes "...keep as a list, multi-rate not supported by guidellm-multiturn" (edit the comment on the same line where tests.llmd.benchmarks.guidellm.rate is defined).
🧹 Nitpick comments (1)
projects/llm-d/testing/config.yaml (1)
198-206: Remove the duplicated--enable-prefix-cachingflag.The same vLLM flag appears on Line 200 and Line 206. If
vllm_argsare forwarded verbatim, the rendered command line carries a duplicate boolean for no benefit and becomes harder to audit. Please verify the launcher is not relying on upstream deduplication here.✂️ Proposed cleanup
vllm_args: - "--disable-uvicorn-access-log" - "--enable-prefix-caching" - "--uvicorn-log-level=debug" - "--trust-remote-code" - "--disable-log-requests" - "--max-model-len=40960" - "--gpu-memory-utilization=0.9" - - "--enable-prefix-caching"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/config.yaml` around lines 198 - 206, The vllm_args array contains a duplicated flag "--enable-prefix-caching"; remove the redundant entry from the vllm_args list so each flag appears only once and update any code that constructs or validates this list (look for references to the vllm_args symbol or the config loader that reads it) to ensure flags are not intentionally duplicated; after removing the duplicate, verify the launcher or command-construction path that forwards vllm_args does not rely on downstream deduplication by ensuring it passes the cleaned list verbatim (or adds explicit deduplication if needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 224-227: The base preset in projects/llm-d/testing/config.yaml
leaves both max_seconds and max_requests null, causing runs without a preset to
rely on guidellm's internal defaults (see guidellm invocation in test_llmd.py
around the runner that would pass --max-seconds/--max-requests); update the base
preset to include an explicit stop bound (e.g., set max_seconds: 900 or a
sensible max_requests) so the runner always has an explicit termination limit,
or alternatively add a comment in the base config documenting the exact guidellm
fallback behavior and why it is acceptable.
---
Duplicate comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 49: The inline comment on the config key
tests.llmd.benchmarks.guidellm.rate contains a typo ("supporte"); update that
comment to read "supported" so it becomes "...keep as a list, multi-rate not
supported by guidellm-multiturn" (edit the comment on the same line where
tests.llmd.benchmarks.guidellm.rate is defined).
---
Nitpick comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 198-206: The vllm_args array contains a duplicated flag
"--enable-prefix-caching"; remove the redundant entry from the vllm_args list so
each flag appears only once and update any code that constructs or validates
this list (look for references to the vllm_args symbol or the config loader that
reads it) to ensure flags are not intentionally duplicated; after removing the
duplicate, verify the launcher or command-construction path that forwards
vllm_args does not rely on downstream deduplication by ensuring it passes the
cleaned list verbatim (or adds explicit deduplication if needed).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bae76231-62b7-474a-82ba-dcae89cb6c21
📒 Files selected for processing (1)
projects/llm-d/testing/config.yaml
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 16 minutes 39 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
8487263 to
c9fd4ea
Compare
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (2)
518-518: Unused loop variablei.The loop index
iis not used within the loop body. Rename to_ior use direct iteration.♻️ Proposed fix
- for i, config_name in enumerate(configurations): + for config_name in configurations:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` at line 518, The loop uses an unused index variable `i` in the statement "for i, config_name in enumerate(configurations):"; replace it with a direct iteration over `configurations` (i.e., "for config_name in configurations:") or rename `i` to `_i` to indicate it's unused so linters won't flag it; update the loop in throughput_analysis.py where `configurations` is iterated (the line currently using enumerate) accordingly.
401-401: Consider defensive handling for zero concurrency.If
request_concurrencyis zero for any benchmark, the efficiency calculation will produce infinity, potentially selecting that point as "most efficient."♻️ Proposed defensive fix
- best_efficiency = df.loc[(df['Tokens/s'] / df['Concurrency']).idxmax()] + efficiency_df = df[df['Concurrency'] > 0].copy() + if not efficiency_df.empty: + best_efficiency = efficiency_df.loc[(efficiency_df['Tokens/s'] / efficiency_df['Concurrency']).idxmax()] + else: + best_efficiency = best_tokens # Fallback to best throughput🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` at line 401, The current best_efficiency selection divides df['Tokens/s'] by df['Concurrency'] without guarding against zero concurrency; update the logic in the best_efficiency selection to first filter out or ignore rows where df['Concurrency'] <= 0 (e.g., df_nonzero = df[df['Concurrency'] > 0]) and compute idxmax on df_nonzero['Tokens/s'] / df_nonzero['Concurrency']; also handle the edge case where df_nonzero is empty by providing a sensible fallback (skip selection or choose the row with max 'Tokens/s' using df['Tokens/s'].idxmax()) so the code in best_efficiency never picks an infinite value.projects/llm-d/testing/test_llmd.py (2)
847-856: Don't execute YAML expressions witheval().
max_requestsanddatacome from configuration, so this turns config data into executable Python. The supported syntax here is just arithmetic onrate; a tiny AST whitelist or dedicated parser would keep the feature without widening the runner's execution surface. I'd also fail fast on invalid expressions instead of forwarding the raw{...}placeholder.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/test_llmd.py` around lines 847 - 856, The evaluate_expression function currently uses eval(expression, ...) which executes arbitrary config-supplied code; replace it with a safe evaluator that parses the expression with ast.parse and only allows a whitelist of AST node types (e.g., Expression, BinOp, UnaryOp, Num/Constant, Name with only the 'rate' identifier, operators like Add/Sub/Mul/Div, Pow, Mod) and then evaluate the AST manually or via a visitor that computes the result using the provided rate; update evaluate_expression to raise a ValueError (fail fast) when the expression contains any disallowed nodes or invalid syntax instead of returning the raw placeholder, and ensure the surrounding code catches/logs that exception with context about the offending expression and rate.
816-818: Verify the GuideLLM profile flag name.GuideLLM's current README uses
--profile, and the v0.4.0 release notes describe--rate-typeas renamed. If the benchmark image no longer keeps a compatibility alias, emitting--rate-type=...here will fail at runtime; please verify that first or switch the generated arg to--profile. (github.com)🔧 Suggested update
- rate_type = config.project.get_config("tests.llmd.benchmarks.guidellm.rate_type") + profile = config.project.get_config("tests.llmd.benchmarks.guidellm.rate_type") @@ - if rate_type: - guidellm_args.append(f"--rate-type={rate_type}") + if profile: + guidellm_args.append(f"--profile={profile}")Also applies to: 877-878
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/test_llmd.py` around lines 816 - 818, The code retrieves benchmark flags into backend_type and rate_type and then generates CLI args for GuideLLM; verify whether GuideLLM expects --profile (per README) instead of --rate-type (renamed in v0.4.0) and update the argument generation that uses rate_type to emit --profile=... if the image no longer provides a compatibility alias; adjust both occurrences (the current uses around variables backend_type, rate_type, max_seconds and the duplicate at the later occurrence) so the produced CLI flag matches the actual GuideLLM version or make the flag selection conditional based on a verified compatibility check.projects/llm-d/toolbox/llmd.py (1)
56-57: Avoid a shared default list forguidellm_args.
guidellm_args=[]is created once at function definition time, so any future mutation will leak across calls. Default toNoneand normalize it locally beforeRunAnsibleRole(locals()).♻️ Suggested fix
def run_guidellm_benchmark( self, endpoint_url, name="guidellm-benchmark", namespace="", image="ghcr.io/vllm-project/guidellm", version="pr-590", timeout=900, - guidellm_args=[], + guidellm_args=None, ): """ Runs a Guidellm benchmark job against the LLM inference service @@ guidellm_args: List of additional guidellm arguments (e.g., ["--rate=10", "--max-seconds=30"]) """ + if guidellm_args is None: + guidellm_args = [] + return RunAnsibleRole(locals())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/toolbox/llmd.py` around lines 56 - 57, The function currently uses a mutable default guidellm_args=[] which can leak state across calls; change the function signature to default guidellm_args to None and inside the function normalize it (e.g., if guidellm_args is None: guidellm_args = []) before calling RunAnsibleRole(locals()); update any references to guidellm_args in that function to use the normalized local variable so RunAnsibleRole receives a fresh list each invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 608-618: The TTFT analysis is using unit_conversion=1 while
ttft_median and ttft_p95 are parsed in seconds; update GuidellmTTFTAnalysis
(class name) to use unit_conversion=1000 so those p50_field 'ttft_median' and
p95_field 'ttft_p95' values are converted from seconds back to milliseconds for
display, and adjust any unit label/note_text if present to remain consistent
with "(ms)".
In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py`:
- Around line 245-252: Remove the unnecessary f-string prefixes on the two
logging.debug calls that contain no placeholders: change the lines that log "
Raw table rows:" and " Problem row data:" so they use plain strings instead of
f-strings; you can find these in the parser where summary_row, latency_rows, and
throughput_rows are logged (the logging.debug lines referencing "Raw table
rows:" and "Problem row data:") and update them to logging.debug(" Raw table
rows:") and logging.debug(" Problem row data:").
---
Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 847-856: The evaluate_expression function currently uses
eval(expression, ...) which executes arbitrary config-supplied code; replace it
with a safe evaluator that parses the expression with ast.parse and only allows
a whitelist of AST node types (e.g., Expression, BinOp, UnaryOp, Num/Constant,
Name with only the 'rate' identifier, operators like Add/Sub/Mul/Div, Pow, Mod)
and then evaluate the AST manually or via a visitor that computes the result
using the provided rate; update evaluate_expression to raise a ValueError (fail
fast) when the expression contains any disallowed nodes or invalid syntax
instead of returning the raw placeholder, and ensure the surrounding code
catches/logs that exception with context about the offending expression and
rate.
- Around line 816-818: The code retrieves benchmark flags into backend_type and
rate_type and then generates CLI args for GuideLLM; verify whether GuideLLM
expects --profile (per README) instead of --rate-type (renamed in v0.4.0) and
update the argument generation that uses rate_type to emit --profile=... if the
image no longer provides a compatibility alias; adjust both occurrences (the
current uses around variables backend_type, rate_type, max_seconds and the
duplicate at the later occurrence) so the produced CLI flag matches the actual
GuideLLM version or make the flag selection conditional based on a verified
compatibility check.
In `@projects/llm-d/toolbox/llmd.py`:
- Around line 56-57: The function currently uses a mutable default
guidellm_args=[] which can leak state across calls; change the function
signature to default guidellm_args to None and inside the function normalize it
(e.g., if guidellm_args is None: guidellm_args = []) before calling
RunAnsibleRole(locals()); update any references to guidellm_args in that
function to use the normalized local variable so RunAnsibleRole receives a fresh
list each invocation.
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 518: The loop uses an unused index variable `i` in the statement "for i,
config_name in enumerate(configurations):"; replace it with a direct iteration
over `configurations` (i.e., "for config_name in configurations:") or rename `i`
to `_i` to indicate it's unused so linters won't flag it; update the loop in
throughput_analysis.py where `configurations` is iterated (the line currently
using enumerate) accordingly.
- Line 401: The current best_efficiency selection divides df['Tokens/s'] by
df['Concurrency'] without guarding against zero concurrency; update the logic in
the best_efficiency selection to first filter out or ignore rows where
df['Concurrency'] <= 0 (e.g., df_nonzero = df[df['Concurrency'] > 0]) and
compute idxmax on df_nonzero['Tokens/s'] / df_nonzero['Concurrency']; also
handle the edge case where df_nonzero is empty by providing a sensible fallback
(skip selection or choose the row with max 'Tokens/s' using
df['Tokens/s'].idxmax()) so the code in best_efficiency never picks an infinite
value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d2f0fc95-a80a-48d3-bcba-09d4f32e9851
📒 Files selected for processing (20)
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rstdocs/toolbox.generated/index.rstprojects/llm-d/testing/config.yamlprojects/llm-d/testing/llmisvcs/llama-3-1-8b-instruct-fp8.yamlprojects/llm-d/testing/test_llmd.pyprojects/llm-d/toolbox/llmd.pyprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.ymlprojects/llm-d/visualizations/llmd_inference/analyze/__init__.pyprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/models/lts.pyprojects/llm-d/visualizations/llmd_inference/plotting/error_report.pyprojects/llm-d/visualizations/llmd_inference/plotting/report.pyprojects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.pyprojects/llm-d/visualizations/llmd_inference/store/lts_parser.pyprojects/llm-d/visualizations/llmd_inference/store/parsers.py
💤 Files with no reviewable changes (9)
- projects/llm-d/testing/llmisvcs/llama-3-1-8b-instruct-fp8.yaml
- docs/toolbox.generated/index.rst
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
- projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
- projects/llm-d/visualizations/llmd_inference/analyze/init.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
- projects/llm-d/visualizations/llmd_inference/models/lts.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
🚧 Files skipped from review as they are similar to previous changes (5)
- projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
- docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
- projects/llm-d/visualizations/llmd_inference/data/plots.yaml
- projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
- projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
Show resolved
Hide resolved
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 23 minutes 13 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
projects/llm-d/testing/config.yaml (2)
228-229:⚠️ Potential issue | 🔴 CriticalGive the base GuideLLM config an explicit stop condition.
Lines 228-229 are still both
null, so any run that doesn't layer a benchmark-specific preset relies on GuideLLM defaults or the 900s job timeout to terminate. Please set one bound here or document the fallback explicitly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/config.yaml` around lines 228 - 229, The GuideLLM base config currently has both max_seconds and max_requests set to null so runs can hang until an external timeout; set an explicit stop condition by assigning a sensible non-null default (e.g., max_seconds: 600 or max_requests: 10) in the base config, or add a clear comment in the GuideLLM config stating which external fallback (e.g., 900s job timeout) will be relied on; update the entries for max_seconds and/or max_requests so the base GuideLLM preset always enforces a termination bound.
49-49:⚠️ Potential issue | 🟡 MinorFix the inline comment typo.
Line 49 still says
supporte; please change it tosupported.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/config.yaml` at line 49, Update the inline comment for the config key tests.llmd.benchmarks.guidellm.rate by correcting the typo "supporte" to "supported" so the comment reads "...multi-rate not supported by guidellm-multiturn"; the change should be made on the same line containing the tests.llmd.benchmarks.guidellm.rate list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 228-229: The GuideLLM base config currently has both max_seconds
and max_requests set to null so runs can hang until an external timeout; set an
explicit stop condition by assigning a sensible non-null default (e.g.,
max_seconds: 600 or max_requests: 10) in the base config, or add a clear comment
in the GuideLLM config stating which external fallback (e.g., 900s job timeout)
will be relied on; update the entries for max_seconds and/or max_requests so the
base GuideLLM preset always enforces a termination bound.
- Line 49: Update the inline comment for the config key
tests.llmd.benchmarks.guidellm.rate by correcting the typo "supporte" to
"supported" so the comment reads "...multi-rate not supported by
guidellm-multiturn"; the change should be made on the same line containing the
tests.llmd.benchmarks.guidellm.rate list.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 41f1a6f1-aeda-44b3-bd91-3accf9795725
📒 Files selected for processing (1)
projects/llm-d/testing/config.yaml
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 20 minutes 36 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 12 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 22 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
Summary by CodeRabbit
Refactor
New Features
Documentation
Chores