Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces multiturn benchmarking with GuideLLM-driven benchmarks across testing, toolbox, parsing, models, visualizations, and docs; adds new GuideLLM performance report and five analysis/stat classes, removes multiturn models/parsers/templates, and switches toolbox/job invocation to accept CLI-style Changes
Sequence Diagram(s)sequenceDiagram
participant CI as CI/Test Runner
participant Toolbox as Llmd Toolbox
participant Job as Kubernetes Job (guidellm)
participant Artifacts as Job Artifacts/Logs
participant Parser as LTS Parser
participant Viz as Visualization/Report
CI->>Toolbox: call run_guidellm_benchmark(name, guidellm_args)
Toolbox->>Job: render job template with guidellm_args and create Job
Job->>Artifacts: produce artifacts (guidellm_benchmark_job.logs, outputs.json)
Parser->>Artifacts: discover and parse guidellm benchmark dirs
Parser->>Parser: aggregate results.guidellm_benchmarks
Viz->>Parser: request entry.results.guidellm_benchmarks
Viz->>Viz: run GuidellmTokensConcurrency & Latency analyses
Viz->>CI: produce report (HTML, plots, Key Insights)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can customize the high-level summary generated by CodeRabbit.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
projects/llm-d/visualizations/llmd_inference/plotting/report.py (2)
678-678: Remove extraneousfprefix from string without placeholders.🔧 Proposed fix
- entry_name = entry.get_name(variables) if variables else f"Unknown Configuration" + entry_name = entry.get_name(variables) if variables else "Unknown Configuration"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` at line 678, The code sets entry_name using entry.get_name(variables) if variables else f"Unknown Configuration" — remove the unnecessary f-string prefix because the fallback string has no placeholders; update the expression so the fallback is a plain string ("Unknown Configuration") while keeping the branch and the call to entry.get_name(variables) unchanged (look for entry_name and entry.get_name to locate the line).
648-668: Remove extraneousfprefixes from strings without placeholders.The static analysis correctly identifies that these f-strings have no placeholders. They should be regular strings.
🔧 Proposed fix
- header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) + header += report.Plot_and_Text("Guidellm Tokens vs Concurrency", args) ... - header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args) + header += report.Plot_and_Text("Guidellm TTFT Analysis", args) ... - header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args) + header += report.Plot_and_Text("Guidellm TPOT Analysis", args) ... - header += report.Plot_and_Text(f"Guidellm ITL Analysis", args) + header += report.Plot_and_Text("Guidellm ITL Analysis", args) ... - header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) + header += report.Plot_and_Text("Guidellm E2E Latency Analysis", args)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 648 - 668, The calls to report.Plot_and_Text use unnecessary f-strings with no placeholders (e.g., report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args)); change these to plain string literals (remove the leading f) for each occurrence (Tokens vs Concurrency, TTFT Analysis, TPOT Analysis, ITL Analysis, E2E Latency Analysis) so report.Plot_and_Text and the header construction use regular strings while leaving header, args and surrounding HTML elements unchanged.projects/llm-d/testing/config.yaml (1)
48-48: Inconsistent spacing in rate values.The rate string uses spaces after commas (
32, 64, 128, 256, 512), while other rate definitions in this file (lines 43 and 53) use no spaces (1,10,50and1,10,50,100,200,300). Consider using consistent formatting.🔧 Suggested fix for consistency
- tests.llmd.benchmarks.guidellm.rate: 32, 64, 128, 256, 512 + tests.llmd.benchmarks.guidellm.rate: 32,64,128,256,512🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/config.yaml` at line 48, The rate list for tests.llmd.benchmarks.guidellm.rate uses spaces after commas ("32, 64, 128, 256, 512") which is inconsistent with other rate keys (e.g., tests.llmd.benchmarks.someOther.rate values like "1,10,50" and "1,10,50,100,200,300"); update the value of tests.llmd.benchmarks.guidellm.rate to remove spaces after commas ("32,64,128,256,512") so formatting matches the rest of the file and maintain uniform CSV style across rate definitions.projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (2)
518-518: Minor: Unused loop variablei.The static analysis correctly identifies that
iis not used in the loop body. Consider renaming to_ior using_to indicate it's intentionally unused.🔧 Proposed fix
- for i, config_name in enumerate(configurations): + for _i, config_name in enumerate(configurations):Or simply:
- for i, config_name in enumerate(configurations): + for config_name in configurations:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` at line 518, The loop uses an unused index variable `i` in the line "for i, config_name in enumerate(configurations):" — replace it with a clear unused name (e.g., "for _i, config_name in enumerate(configurations):" or "for _, config_name in enumerate(configurations):") or simply iterate without enumerate ("for config_name in configurations:") to remove the unused variable; update the loop header where `configurations` is iterated (the for-loop in throughput_analysis.py) accordingly.
660-949: Significant code duplication in standalone TPOT and ITL analysis classes.These two standalone classes (lines 660-803 and 806-949) duplicate almost all logic from
GuidellmLatencyAnalysisBase.do_plot(). If the intent is to use a consistent factorized pattern, consider removing these standalone classes and keeping only the base subclass versions (after fixing the dead code issue above).If there's a specific reason to keep standalone implementations, please document it. Otherwise, the base class pattern used by
GuidellmTTFTAnalysisandGuidellmE2ELatencyAnalysisshould be applied consistently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 660 - 949, The two classes GuidellmTPOTAnalysis and GuidellmITLAnalysis duplicate the logic already implemented in GuidellmLatencyAnalysisBase.do_plot(); remove these standalone classes (or refactor them to subclass GuidellmLatencyAnalysisBase and override only what differs) so you reuse the base implementation used by GuidellmTTFTAnalysis and GuidellmE2ELatencyAnalysis; ensure any unique behavior (field names like 'TPOT P50 (ms)', 'ITL P50 (ms)', hover text, and conversion from seconds to ms) is handled via small overrides or parameters in the base class rather than duplicated code, and add a brief comment documenting why a standalone implementation is retained if you choose not to refactor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 49: The config key tests.llmd.benchmarks.guidellm.max_requests is
declared but not used; update run_guidellm_benchmark to read max_requests from
the config (same place it reads rate, max_seconds, data, timeout) and wire it
into the benchmark invocation or loop that limits total requests (e.g., enforce
a hard cap on sent requests or pass as an argument to the runner), referencing
the run_guidellm_benchmark function and whatever helper/runner it calls; also
normalize the formatting of the rate list to match the other entries (remove or
add spaces consistently so lists like 32,64,128,256,512 match 1,10,50 style).
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 621-657: The file contains dead duplicate class definitions:
GuidellmTPOTAnalysis and GuidellmITLAnalysis are defined twice (once as small
subclasses of GuidellmLatencyAnalysisBase around the first block and again as
full standalone classes later), so remove the redundant earlier subclass
definitions (the small factorized ones) OR alternatively delete the later
standalone implementations and keep the base-subclass pattern — ensure any
external references use the remaining class definitions; update imports/registry
usage if needed to point to the preserved classes (refer to
GuidellmTPOTAnalysis, GuidellmITLAnalysis, GuidellmLatencyAnalysisBase to locate
the duplicates).
---
Nitpick comments:
In `@projects/llm-d/testing/config.yaml`:
- Line 48: The rate list for tests.llmd.benchmarks.guidellm.rate uses spaces
after commas ("32, 64, 128, 256, 512") which is inconsistent with other rate
keys (e.g., tests.llmd.benchmarks.someOther.rate values like "1,10,50" and
"1,10,50,100,200,300"); update the value of tests.llmd.benchmarks.guidellm.rate
to remove spaces after commas ("32,64,128,256,512") so formatting matches the
rest of the file and maintain uniform CSV style across rate definitions.
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Line 678: The code sets entry_name using entry.get_name(variables) if
variables else f"Unknown Configuration" — remove the unnecessary f-string prefix
because the fallback string has no placeholders; update the expression so the
fallback is a plain string ("Unknown Configuration") while keeping the branch
and the call to entry.get_name(variables) unchanged (look for entry_name and
entry.get_name to locate the line).
- Around line 648-668: The calls to report.Plot_and_Text use unnecessary
f-strings with no placeholders (e.g., report.Plot_and_Text(f"Guidellm Tokens vs
Concurrency", args)); change these to plain string literals (remove the leading
f) for each occurrence (Tokens vs Concurrency, TTFT Analysis, TPOT Analysis, ITL
Analysis, E2E Latency Analysis) so report.Plot_and_Text and the header
construction use regular strings while leaving header, args and surrounding HTML
elements unchanged.
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 518: The loop uses an unused index variable `i` in the line "for i,
config_name in enumerate(configurations):" — replace it with a clear unused name
(e.g., "for _i, config_name in enumerate(configurations):" or "for _,
config_name in enumerate(configurations):") or simply iterate without enumerate
("for config_name in configurations:") to remove the unused variable; update the
loop header where `configurations` is iterated (the for-loop in
throughput_analysis.py) accordingly.
- Around line 660-949: The two classes GuidellmTPOTAnalysis and
GuidellmITLAnalysis duplicate the logic already implemented in
GuidellmLatencyAnalysisBase.do_plot(); remove these standalone classes (or
refactor them to subclass GuidellmLatencyAnalysisBase and override only what
differs) so you reuse the base implementation used by GuidellmTTFTAnalysis and
GuidellmE2ELatencyAnalysis; ensure any unique behavior (field names like 'TPOT
P50 (ms)', 'ITL P50 (ms)', hover text, and conversion from seconds to ms) is
handled via small overrides or parameters in the base class rather than
duplicated code, and add a brief comment documenting why a standalone
implementation is retained if you choose not to refactor.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8882473f-4a34-452c-a7c6-4e1ecdacf9b9
📒 Files selected for processing (4)
projects/llm-d/testing/config.yamlprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/plotting/report.pyprojects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
Show resolved
Hide resolved
|
🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 21 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d multi-flavor guidellm_heterogeneous_eval |
|
🔴 Test of 'llm-d test test_ci' failed after 00 hours 00 minutes 01 seconds. 🔴 • Link to the test results. • No reports index generated... Test configuration: Failure indicator: Empty. (See run.log) |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 686-714: The code currently clamps request_concurrency to 1 when
computing and displaying efficiency, which hides low (<1) mean concurrency
values from parsers.py (concurrency_mean) and can mis-rank "Most efficient";
update the selection and display to use the actual benchmark.request_concurrency
(remove the max(..., 1) clamp) when computing the key for best_efficiency_idx
and when formatting the efficiency string for best_efficiency, and add a minimal
guard (e.g., treat zero concurrency explicitly) to avoid division-by-zero rather
than forcing a floor of 1.
- Around line 648-668: The reported lint failures come from using unnecessary
f-strings with no placeholders; in the calls to report.Plot_and_Text (e.g.,
report.Plot_and_Text("Guidellm Tokens vs Concurrency", args),
report.Plot_and_Text("Guidellm TTFT Analysis", args),
report.Plot_and_Text("Guidellm TPOT Analysis", args),
report.Plot_and_Text("Guidellm ITL Analysis", args), and
report.Plot_and_Text("Guidellm E2E Latency Analysis", args)) remove the leading
f prefix so they are plain string literals, and likewise replace the f-prefixed
"Unknown Configuration" literal with a normal string; no other logic changes are
needed.
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 369-370: The TTFT values are stored in seconds but are being
labeled as milliseconds; update every place that sets 'TTFT (ms)' (and any hover
column that uses benchmark.ttft_median or benchmark.ttft_p95) to multiply those
values by 1000 before assigning or displaying them. Locate usages in
plotting/throughput_analysis.py (e.g., the dict entry 'TTFT (ms)':
benchmark.ttft_median and the other block around lines 608-618) and in the
GuidellmTTFTAnalysis hover column code and change benchmark.ttft_median and
benchmark.ttft_p95 to benchmark.ttft_median * 1000 and benchmark.ttft_p95 * 1000
so the label 'ms' matches the numeric units.
- Around line 334-344: The loop dereferences entry.results.guidellm_benchmarks
without guarding against missing GuideLLM data; change the checks before
iterating so you skip entries where guidellm_benchmarks is None or missing
(e.g., replace the current if not entry.results.guidellm_benchmarks: with a
guarded form like if not getattr(entry.results, "guidellm_benchmarks", None):
continue) and use that guarded attribute when iterating benchmarks; apply the
same guarded check to the second occurrence around the block handling benchmarks
(the loop that inspects benchmark.strategy) so both places safely skip entries
lacking guidellm_benchmarks.
- Around line 384-393: The scatter currently groups points only by 'Test
Configuration' so update the plotting call to split traces by strategy to avoid
drawing a single line across different strategies: include the strategy column
(e.g., 'benchmark.strategy' or whatever column name holds strategy) as an
explicit grouping dimension (use PX arguments like symbol='benchmark.strategy'
and/or line_group=['Test Configuration','benchmark.strategy']) or, if you prefer
no connecting lines, change the trace mode to markers only; apply the same
change to the other plot block referenced around lines 518-547 so traces are
created per (configuration, strategy) rather than per configuration alone.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6b314f50-2878-4b3a-9a99-0054834cf70c
📒 Files selected for processing (3)
projects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/plotting/report.pyprojects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
✅ Files skipped from review due to trivial changes (1)
- projects/llm-d/visualizations/llmd_inference/data/plots.yaml
| header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) | ||
|
|
||
| # TTFT Analysis Section | ||
| header.append(html.H3("⚡ Time to First Token (TTFT) Analysis")) | ||
| header.append(html.P("Latency analysis showing response time percentiles across different concurrency levels")) | ||
| header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args) | ||
|
|
||
| # TPOT Analysis Section | ||
| header.append(html.H3("🔄 Time Per Output Token (TPOT) Analysis")) | ||
| header.append(html.P("Token generation speed analysis showing how quickly individual tokens are produced")) | ||
| header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args) | ||
|
|
||
| # ITL Analysis Section | ||
| header.append(html.H3("⏱️ Inter-Token Latency (ITL) Analysis")) | ||
| header.append(html.P("Streaming responsiveness analysis showing the delay between consecutive tokens")) | ||
| header += report.Plot_and_Text(f"Guidellm ITL Analysis", args) | ||
|
|
||
| # E2E Latency Analysis Section | ||
| header.append(html.H3("🎯 End-to-End (E2E) Latency Analysis")) | ||
| header.append(html.P("Complete request duration analysis from request initiation to final response completion")) | ||
| header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "report.py" -path "*/llm-d/visualizations/*" -type fRepository: openshift-psap/topsail
Length of output: 1858
🌐 Web query:
Ruff F541 rule unused f-string prefix
💡 Result:
The Ruff rule F541, named "f-string-missing-placeholders", detects f-strings that do not contain any placeholder expressions (e.g., f"Hello, world!"). It flags these as unnecessary since a regular string literal should be used instead, as f-strings without placeholders can confuse readers or indicate a forgotten placeholder. The rule is derived from Pyflakes, added in Ruff v0.0.18, and is automatically fixable by removing the 'f' prefix. Example violation: f"Hello, world!" Suggested fix: "Hello, world!" Note on implicit concatenation: To match Pyflakes, F541 only flags f-strings without placeholders if they are part of an implicit concatenation where NONE of the f-string segments contain placeholders. For instance: ( f"Hello," f" {name}!" # Not flagged, due to the placeholder in the second. ) ( f"Hello," f" World!" # Flagged, no placeholders in either. )
Citations:
- 1: https://docs.astral.sh/ruff/rules/f-string-missing-placeholders/
- 2: [pyflakes] Identify f-string without a placeholder in concatenated strings (F541) astral-sh/ruff#10886
- 3: https://docs.astral.sh/ruff/rules/
- 4: F541 does not find violations in concatenations astral-sh/ruff#10885
Remove the no-op f prefixes from these string literals.
These f-strings lack placeholder expressions and will trigger Ruff's F541 rule (f-string-missing-placeholders), causing the file to fail linting. Remove the f prefix from:
f"Guidellm Tokens vs Concurrency"f"Guidellm TTFT Analysis"f"Guidellm TPOT Analysis"f"Guidellm ITL Analysis"f"Guidellm E2E Latency Analysis"f"Unknown Configuration"(line 678)
🧰 Tools
🪛 Ruff (0.15.6)
[error] 648-648: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 653-653: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 658-658: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 663-663: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 668-668: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
648 - 668, The reported lint failures come from using unnecessary f-strings with
no placeholders; in the calls to report.Plot_and_Text (e.g.,
report.Plot_and_Text("Guidellm Tokens vs Concurrency", args),
report.Plot_and_Text("Guidellm TTFT Analysis", args),
report.Plot_and_Text("Guidellm TPOT Analysis", args),
report.Plot_and_Text("Guidellm ITL Analysis", args), and
report.Plot_and_Text("Guidellm E2E Latency Analysis", args)) remove the leading
f prefix so they are plain string literals, and likewise replace the f-prefixed
"Unknown Configuration" literal with a normal string; no other logic changes are
needed.
| best_tokens_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second) | ||
| best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1)) | ||
|
|
||
| best_tokens = all_benchmarks[best_tokens_idx] | ||
| best_efficiency = all_benchmarks[best_efficiency_idx] | ||
|
|
||
| # Configuration analysis | ||
| configurations = {} | ||
| for i, benchmark in enumerate(all_benchmarks): | ||
| config = benchmark_configs[i] | ||
| if config not in configurations: | ||
| configurations[config] = { | ||
| 'max_tokens': 0, | ||
| 'optimal_concurrency': 0, | ||
| 'strategies': 0 | ||
| } | ||
|
|
||
| if benchmark.tokens_per_second > configurations[config]['max_tokens']: | ||
| configurations[config]['max_tokens'] = benchmark.tokens_per_second | ||
| configurations[config]['optimal_concurrency'] = benchmark.request_concurrency | ||
|
|
||
| configurations[config]['strategies'] += 1 | ||
|
|
||
| # Sort configurations by performance | ||
| sorted_configs = sorted(configurations.items(), key=lambda x: x[1]['max_tokens'], reverse=True) | ||
|
|
||
| insights = [ | ||
| f"🏆 Best token throughput: {best_tokens.strategy} in {benchmark_configs[best_tokens_idx]} ({best_tokens.tokens_per_second:.0f} tok/s)", | ||
| f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)", |
There was a problem hiding this comment.
Use the real mean concurrency when ranking efficiency.
request_concurrency is populated from concurrency_mean in projects/llm-d/visualizations/llmd_inference/store/parsers.py:249-314, so values below 1 are valid. Clamping it to 1 here can pick the wrong "Most efficient" run and understate its score for low-rate benchmarks.
🛠️ Proposed fix
- best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1))
+ best_efficiency_idx = max(
+ range(len(all_benchmarks)),
+ key=lambda i: (
+ all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency
+ if all_benchmarks[i].request_concurrency > 0
+ else float("-inf")
+ ),
+ )
best_tokens = all_benchmarks[best_tokens_idx]
best_efficiency = all_benchmarks[best_efficiency_idx]
+ best_efficiency_score = (
+ best_efficiency.tokens_per_second / best_efficiency.request_concurrency
+ if best_efficiency.request_concurrency > 0
+ else 0
+ )- f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)",
+ f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({best_efficiency_score:.0f} tok/s per concurrency unit)",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| best_tokens_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second) | |
| best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1)) | |
| best_tokens = all_benchmarks[best_tokens_idx] | |
| best_efficiency = all_benchmarks[best_efficiency_idx] | |
| # Configuration analysis | |
| configurations = {} | |
| for i, benchmark in enumerate(all_benchmarks): | |
| config = benchmark_configs[i] | |
| if config not in configurations: | |
| configurations[config] = { | |
| 'max_tokens': 0, | |
| 'optimal_concurrency': 0, | |
| 'strategies': 0 | |
| } | |
| if benchmark.tokens_per_second > configurations[config]['max_tokens']: | |
| configurations[config]['max_tokens'] = benchmark.tokens_per_second | |
| configurations[config]['optimal_concurrency'] = benchmark.request_concurrency | |
| configurations[config]['strategies'] += 1 | |
| # Sort configurations by performance | |
| sorted_configs = sorted(configurations.items(), key=lambda x: x[1]['max_tokens'], reverse=True) | |
| insights = [ | |
| f"🏆 Best token throughput: {best_tokens.strategy} in {benchmark_configs[best_tokens_idx]} ({best_tokens.tokens_per_second:.0f} tok/s)", | |
| f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)", | |
| best_tokens_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second) | |
| best_efficiency_idx = max( | |
| range(len(all_benchmarks)), | |
| key=lambda i: ( | |
| all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency | |
| if all_benchmarks[i].request_concurrency > 0 | |
| else float("-inf") | |
| ), | |
| ) | |
| best_tokens = all_benchmarks[best_tokens_idx] | |
| best_efficiency = all_benchmarks[best_efficiency_idx] | |
| best_efficiency_score = ( | |
| best_efficiency.tokens_per_second / best_efficiency.request_concurrency | |
| if best_efficiency.request_concurrency > 0 | |
| else 0 | |
| ) | |
| # Configuration analysis | |
| configurations = {} | |
| for i, benchmark in enumerate(all_benchmarks): | |
| config = benchmark_configs[i] | |
| if config not in configurations: | |
| configurations[config] = { | |
| 'max_tokens': 0, | |
| 'optimal_concurrency': 0, | |
| 'strategies': 0 | |
| } | |
| if benchmark.tokens_per_second > configurations[config]['max_tokens']: | |
| configurations[config]['max_tokens'] = benchmark.tokens_per_second | |
| configurations[config]['optimal_concurrency'] = benchmark.request_concurrency | |
| configurations[config]['strategies'] += 1 | |
| # Sort configurations by performance | |
| sorted_configs = sorted(configurations.items(), key=lambda x: x[1]['max_tokens'], reverse=True) | |
| insights = [ | |
| f"🏆 Best token throughput: {best_tokens.strategy} in {benchmark_configs[best_tokens_idx]} ({best_tokens.tokens_per_second:.0f} tok/s)", | |
| f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({best_efficiency_score:.0f} tok/s per concurrency unit)", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines
686 - 714, The code currently clamps request_concurrency to 1 when computing and
displaying efficiency, which hides low (<1) mean concurrency values from
parsers.py (concurrency_mean) and can mis-rank "Most efficient"; update the
selection and display to use the actual benchmark.request_concurrency (remove
the max(..., 1) clamp) when computing the key for best_efficiency_idx and when
formatting the efficiency string for best_efficiency, and add a minimal guard
(e.g., treat zero concurrency explicitly) to avoid division-by-zero rather than
forcing a floor of 1.
| for entry in entries: | ||
| if not entry.results.guidellm_benchmarks: | ||
| continue | ||
|
|
||
| # Get unique name for this entry (includes flavor info) | ||
| entry_name = entry.get_name(variables) | ||
|
|
||
| # Include all strategies - let's show the full picture | ||
| for benchmark in entry.results.guidellm_benchmarks: | ||
| if benchmark.strategy == "throughput": | ||
| continue |
There was a problem hiding this comment.
Guard guidellm_benchmarks before reading it.
GuidellmPerformanceAnalysisReport already treats missing GuideLLM data as a normal case. These helpers still dereference entry.results.guidellm_benchmarks unconditionally, so one mixed matrix entry will raise AttributeError before the plot can skip it.
🛡️ Proposed fix
- if not entry.results.guidellm_benchmarks:
+ benchmarks = getattr(entry.results, "guidellm_benchmarks", None)
+ if not benchmarks:
continue
...
- for benchmark in entry.results.guidellm_benchmarks:
+ for benchmark in benchmarks:- if not entry.results.guidellm_benchmarks:
+ benchmarks = getattr(entry.results, "guidellm_benchmarks", None)
+ if not benchmarks:
continue
...
- for benchmark in entry.results.guidellm_benchmarks:
+ for benchmark in benchmarks:Also applies to: 472-481
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 334 - 344, The loop dereferences entry.results.guidellm_benchmarks
without guarding against missing GuideLLM data; change the checks before
iterating so you skip entries where guidellm_benchmarks is None or missing
(e.g., replace the current if not entry.results.guidellm_benchmarks: with a
guarded form like if not getattr(entry.results, "guidellm_benchmarks", None):
continue) and use that guarded attribute when iterating benchmarks; apply the
same guarded check to the second occurrence around the block handling benchmarks
(the loop that inspects benchmark.strategy) so both places safely skip entries
lacking guidellm_benchmarks.
| 'TTFT (ms)': benchmark.ttft_median, | ||
| 'Latency (ms)': benchmark.request_latency_median * 1000 |
There was a problem hiding this comment.
Convert TTFT back to milliseconds before labeling it as ms.
projects/llm-d/visualizations/llmd_inference/store/parsers.py:249-314 stores ttft_median and ttft_p95 in seconds. The new hover column and GuidellmTTFTAnalysis currently display those values with an ms label without multiplying by 1000.
⏱️ Proposed fix
- 'TTFT (ms)': benchmark.ttft_median,
+ 'TTFT (ms)': benchmark.ttft_median * 1000,- 'unit_conversion': 1, # Already in ms
+ 'unit_conversion': 1000, # Convert from seconds to msAlso applies to: 608-618
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 369 - 370, The TTFT values are stored in seconds but are being
labeled as milliseconds; update every place that sets 'TTFT (ms)' (and any hover
column that uses benchmark.ttft_median or benchmark.ttft_p95) to multiply those
values by 1000 before assigning or displaying them. Locate usages in
plotting/throughput_analysis.py (e.g., the dict entry 'TTFT (ms)':
benchmark.ttft_median and the other block around lines 608-618) and in the
GuidellmTTFTAnalysis hover column code and change benchmark.ttft_median and
benchmark.ttft_p95 to benchmark.ttft_median * 1000 and benchmark.ttft_p95 * 1000
so the label 'ms' matches the numeric units.
| fig = px.scatter(df, | ||
| hover_data=df.columns, | ||
| x='Concurrency', | ||
| y='Tokens/s', | ||
| color='Test Configuration', | ||
| color_discrete_map=color_map, | ||
| title='Token Throughput vs Concurrency by Configuration<br><sub>Higher is better</sub>') | ||
|
|
||
| fig.update_traces(mode='lines+markers') | ||
| fig.update_layout(showlegend=True) |
There was a problem hiding this comment.
Don't draw a single line across different strategies.
Both plots collect multiple benchmark.strategy values per configuration, but the traces are built only per configuration. That joins unrelated strategies into one synthetic curve and can misstate the scaling trend. Split traces by (configuration, strategy) or keep these as markers only.
Also applies to: 518-547
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 384 - 393, The scatter currently groups points only by 'Test
Configuration' so update the plotting call to split traces by strategy to avoid
drawing a single line across different strategies: include the strategy column
(e.g., 'benchmark.strategy' or whatever column name holds strategy) as an
explicit grouping dimension (use PX arguments like symbol='benchmark.strategy'
and/or line_group=['Test Configuration','benchmark.strategy']) or, if you prefer
no connecting lines, change the trace mode to markers only; apply the same
change to the other plot block referenced around lines 518-547 so traces are
created per (configuration, strategy) rather than per configuration alone.
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
projects/llm-d/visualizations/llmd_inference/plotting/report.py (3)
678-678:⚠️ Potential issue | 🟡 MinorRemove extraneous
fprefix.Line 678 uses
f"Unknown Configuration"without placeholders.🔧 Proposed fix
- entry_name = entry.get_name(variables) if variables else f"Unknown Configuration" + entry_name = entry.get_name(variables) if variables else "Unknown Configuration",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` at line 678, The code assigns entry_name using an unnecessary f-string when no placeholders are present; in the assignment that calls entry.get_name(variables) if variables else f"Unknown Configuration", remove the extraneous `f` prefix so the fallback is a plain string ("Unknown Configuration")—update the expression in the same function/method where entry_name is computed (the conditional using entry.get_name and variables) to use a regular string literal for the else branch.
648-668:⚠️ Potential issue | 🟡 MinorRemove extraneous
fprefixes from string literals.These f-strings have no placeholders, triggering Ruff F541. Remove the
fprefix from each:
- Line 648:
f"Guidellm Tokens vs Concurrency"- Line 653:
f"Guidellm TTFT Analysis"- Line 658:
f"Guidellm TPOT Analysis"- Line 663:
f"Guidellm ITL Analysis"- Line 668:
f"Guidellm E2E Latency Analysis"🔧 Proposed fix
- header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) + header += report.Plot_and_Text("Guidellm Tokens vs Concurrency", args) ... - header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args) + header += report.Plot_and_Text("Guidellm TTFT Analysis", args) ... - header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args) + header += report.Plot_and_Text("Guidellm TPOT Analysis", args) ... - header += report.Plot_and_Text(f"Guidellm ITL Analysis", args) + header += report.Plot_and_Text("Guidellm ITL Analysis", args) ... - header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) + header += report.Plot_and_Text("Guidellm E2E Latency Analysis", args),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 648 - 668, The string literals passed to report.Plot_and_Text and header.append are incorrectly using f-strings with no placeholders (e.g., calls to report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) and similar for TTFT/TPOT/ITL/E2E sections); remove the unnecessary leading "f" on each string so they become plain string literals (e.g., report.Plot_and_Text("Guidellm Tokens vs Concurrency", args) and likewise for "Guidellm TTFT Analysis", "Guidellm TPOT Analysis", "Guidellm ITL Analysis", and "Guidellm E2E Latency Analysis") to satisfy Ruff F541.
687-687:⚠️ Potential issue | 🟠 MajorEfficiency calculation clamps concurrency to 1, potentially misranking low-rate benchmarks.
The
max(..., 1)clamp distorts efficiency scores whenrequest_concurrencyis legitimately below 1 (which is valid perparsers.pyusingconcurrency_mean). This could select the wrong "Most efficient" run.🛠️ Proposed fix
- best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1)) + best_efficiency_idx = max( + range(len(all_benchmarks)), + key=lambda i: ( + all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency + if all_benchmarks[i].request_concurrency > 0 + else float("-inf") + ), + )And update line 714 similarly:
- f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)", + f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / best_efficiency.request_concurrency if best_efficiency.request_concurrency > 0 else 0):.0f} tok/s per concurrency unit)",,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` at line 687, The efficiency calculation currently forces a minimum concurrency of 1, which skews ranking for benchmarks with request_concurrency < 1; in the selection logic (best_efficiency_idx using all_benchmarks, tokens_per_second and request_concurrency) remove the hard clamp to 1 and instead guard against division-by-zero by either skipping entries with request_concurrency == 0 or dividing by a very small epsilon (e.g., float_min) so fractional concurrencies are honored; apply the same fix to the other efficiency calculation referenced near the second occurrence (the similar expression around line 714) so both use the true request_concurrency when computing tokens_per_second / request_concurrency.
🧹 Nitpick comments (1)
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (1)
518-518: Unused loop variablei.Per Ruff B007, the loop variable
iis unused. Rename to_ior use_.🔧 Proposed fix
- for i, config_name in enumerate(configurations): + for _i, config_name in enumerate(configurations):Or simply:
- for i, config_name in enumerate(configurations): + for config_name in configurations:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` at line 518, The loop in throughput_analysis.py uses an unused index variable `i` in the line `for i, config_name in enumerate(configurations):`; rename `i` to `_` or `_i` to satisfy Ruff B007 and avoid the unused-variable warning, ensuring there are no other references to `i` elsewhere in the function (e.g., inside any inner scope of the same loop such as in functions/methods used there like plot_throughput or aggregate_metrics) before committing the rename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 48-49: The rate value is being left as a YAML list and thus
rendered with brackets in the Jinja2 template; update toolbox/llmd.py (the logic
around the tuple conversion at lines handling llmd_run_guidellm_benchmark_rate)
to also detect lists and serialize them into the comma-separated string expected
by the guidellm CLI (e.g., join elements into "32,64,128,256,512") or explicitly
cast to the same format used for tuples; additionally, remove the unused
tests.llmd.benchmarks.guidellm.max_requests: 320 entry from config.yaml unless
you intend to add support for it—if adding support, expose a max_requests
parameter in run_guidellm_benchmark() and consume it in test_llmd.py
accordingly.
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 608-618: The GuidellmTTFTAnalysis initializer currently sets
unit_conversion to 1 with a misleading comment "Already in ms" but TTFT values
are stored in seconds (parsed as ttft_median / 1000.0); update the
unit_conversion in the GuidellmTTFTAnalysis constructor to 1000 so displayed
values convert seconds back to milliseconds (or alternatively move conversion to
the display layer), and adjust the comment to reflect that stored values are in
seconds; refer to GuidellmTTFTAnalysis and its 'unit_conversion' field to locate
the change.
---
Duplicate comments:
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Line 678: The code assigns entry_name using an unnecessary f-string when no
placeholders are present; in the assignment that calls entry.get_name(variables)
if variables else f"Unknown Configuration", remove the extraneous `f` prefix so
the fallback is a plain string ("Unknown Configuration")—update the expression
in the same function/method where entry_name is computed (the conditional using
entry.get_name and variables) to use a regular string literal for the else
branch.
- Around line 648-668: The string literals passed to report.Plot_and_Text and
header.append are incorrectly using f-strings with no placeholders (e.g., calls
to report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) and similar for
TTFT/TPOT/ITL/E2E sections); remove the unnecessary leading "f" on each string
so they become plain string literals (e.g., report.Plot_and_Text("Guidellm
Tokens vs Concurrency", args) and likewise for "Guidellm TTFT Analysis",
"Guidellm TPOT Analysis", "Guidellm ITL Analysis", and "Guidellm E2E Latency
Analysis") to satisfy Ruff F541.
- Line 687: The efficiency calculation currently forces a minimum concurrency of
1, which skews ranking for benchmarks with request_concurrency < 1; in the
selection logic (best_efficiency_idx using all_benchmarks, tokens_per_second and
request_concurrency) remove the hard clamp to 1 and instead guard against
division-by-zero by either skipping entries with request_concurrency == 0 or
dividing by a very small epsilon (e.g., float_min) so fractional concurrencies
are honored; apply the same fix to the other efficiency calculation referenced
near the second occurrence (the similar expression around line 714) so both use
the true request_concurrency when computing tokens_per_second /
request_concurrency.
---
Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 518: The loop in throughput_analysis.py uses an unused index variable `i`
in the line `for i, config_name in enumerate(configurations):`; rename `i` to
`_` or `_i` to satisfy Ruff B007 and avoid the unused-variable warning, ensuring
there are no other references to `i` elsewhere in the function (e.g., inside any
inner scope of the same loop such as in functions/methods used there like
plot_throughput or aggregate_metrics) before committing the rename.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aeb662b9-94bf-4107-94e4-9ea5c29cece2
📒 Files selected for processing (4)
projects/llm-d/testing/config.yamlprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/plotting/report.pyprojects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py
✅ Files skipped from review due to trivial changes (1)
- projects/llm-d/visualizations/llmd_inference/data/plots.yaml
| class GuidellmTTFTAnalysis(GuidellmLatencyAnalysisBase): | ||
| def __init__(self): | ||
| super().__init__({ | ||
| 'name': 'TTFT', | ||
| 'description': 'TTFT Latency', | ||
| 'p50_field': 'ttft_median', | ||
| 'p95_field': 'ttft_p95', | ||
| 'unit_conversion': 1, # Already in ms | ||
| 'note_text': 'P1 and P90 percentiles require additional data extraction from logs.', | ||
| 'insight_text': 'Lower TTFT values indicate better responsiveness. Look for configurations with consistently low and stable TTFT across concurrency levels.' | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how ttft_median is parsed/stored in parsers.py
rg -n 'ttft_median' --type=py -C5 projects/llm-d/visualizations/llmd_inference/store/Repository: openshift-psap/topsail
Length of output: 2934
Fix TTFT unit conversion - stored values are in seconds, not milliseconds.
The parsers.py file (lines 303-304) explicitly converts TTFT from milliseconds to seconds: ttft_median / 1000.0. The comment "Already in ms" at line 615 is incorrect. Since benchmark values are stored in seconds, unit_conversion should be 1000 to correctly convert seconds back to milliseconds for display, or this conversion should be handled separately in the display logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`
around lines 608 - 618, The GuidellmTTFTAnalysis initializer currently sets
unit_conversion to 1 with a misleading comment "Already in ms" but TTFT values
are stored in seconds (parsed as ttft_median / 1000.0); update the
unit_conversion in the GuidellmTTFTAnalysis constructor to 1000 so displayed
values convert seconds back to milliseconds (or alternatively move conversion to
the display layer), and adjust the comment to reflect that stored values are in
seconds; refer to GuidellmTTFTAnalysis and its 'unit_conversion' field to locate
the change.
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 14 minutes 00 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (7)
projects/llm-d/visualizations/llmd_inference/plotting/report.py (2)
335-355:⚠️ Potential issue | 🟡 MinorDrop the no-op
fprefixes in these literals.These strings have no placeholders, so Ruff F541 will keep flagging the file until they are plain string literals.
Suggested fix
- header += report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args) + header += report.Plot_and_Text("Guidellm Tokens vs Concurrency", args) @@ - header += report.Plot_and_Text(f"Guidellm TTFT Analysis", args) + header += report.Plot_and_Text("Guidellm TTFT Analysis", args) @@ - header += report.Plot_and_Text(f"Guidellm TPOT Analysis", args) + header += report.Plot_and_Text("Guidellm TPOT Analysis", args) @@ - header += report.Plot_and_Text(f"Guidellm ITL Analysis", args) + header += report.Plot_and_Text("Guidellm ITL Analysis", args) @@ - header += report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) + header += report.Plot_and_Text("Guidellm E2E Latency Analysis", args) @@ - entry_name = entry.get_name(variables) if variables else f"Unknown Configuration" + entry_name = entry.get_name(variables) if variables else "Unknown Configuration"Also applies to: 365-365
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 335 - 355, Remove the unnecessary f-string prefixes on literal strings passed to report.Plot_and_Text and any other plain string literals in this block (e.g., calls like report.Plot_and_Text(f"Guidellm Tokens vs Concurrency", args), report.Plot_and_Text(f"Guidellm TTFT Analysis", args), report.Plot_and_Text(f"Guidellm TPOT Analysis", args), report.Plot_and_Text(f"Guidellm ITL Analysis", args), report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) and similar at the other location), replacing them with normal string literals so Ruff F541 no longer flags the file.
374-375:⚠️ Potential issue | 🟠 MajorUse the real mean concurrency when ranking efficiency.
request_concurrencyhere is already the parsed mean concurrency. Flooring it to1hides valid sub-1 values, which can mis-rank the most efficient run and understate its score.Suggested fix
- best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1)) + best_efficiency_idx = max( + range(len(all_benchmarks)), + key=lambda i: ( + all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency + if all_benchmarks[i].request_concurrency > 0 + else float("-inf") + ), + ) best_tokens = all_benchmarks[best_tokens_idx] best_efficiency = all_benchmarks[best_efficiency_idx] + best_efficiency_score = ( + best_efficiency.tokens_per_second / best_efficiency.request_concurrency + if best_efficiency.request_concurrency > 0 + else 0 + ) @@ - f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)", + f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({best_efficiency_score:.0f} tok/s per concurrency unit)",Also applies to: 400-401
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 374 - 375, The ranking currently uses max(..., key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1)) which floors request_concurrency to 1 and hides sub-1 mean concurrencies; change the comparator in best_efficiency_idx (and the similar logic around the blocks at the other occurrence) to divide tokens_per_second by the actual all_benchmarks[i].request_concurrency value instead, but guard against a zero value by using a tiny epsilon (e.g., treat 0 as 1e-9) or skip zero-concurrency entries to avoid division-by-zero—update only the comparator used in best_efficiency_idx and the matching comparator at the other spot.projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (5)
607-617:⚠️ Potential issue | 🟠 MajorVerify TTFT unit conversion - comment may be incorrect.
The comment states "Already in ms" but past review identified that
parsers.pyconverts TTFT to seconds (ttft_median / 1000.0). If stored in seconds,unit_conversionshould be1000to display in milliseconds.⏱️ Proposed fix (if values are in seconds)
- 'unit_conversion': 1, # Already in ms + 'unit_conversion': 1000, # Convert from seconds to ms, ,
#!/bin/bash # Check how ttft_median is stored in parsers.py rg -n 'ttft_median' --type=py -C5 projects/llm-d/visualizations/llmd_inference/store/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 607 - 617, The TTFT unit_conversion is likely wrong: verify how ttft_median is stored (check parsers.py where ttft_median is computed/converted) and if ttft_median is stored in seconds (e.g., divided by 1000.0), update GuidellmTTFTAnalysis to set unit_conversion to 1000 (instead of 1) and correct the comment accordingly; if ttft_median is indeed already in milliseconds, leave unit_conversion as 1 but update the comment to confirm units. Ensure you modify the GuidellmTTFTAnalysis constructor (unit_conversion and comment) to reflect the actual stored unit.
383-392:⚠️ Potential issue | 🟡 MinorLines connect points across different strategies within the same configuration.
The scatter plot with
mode='lines+markers'connects all data points perTest Configuration, but each configuration may contain multiple strategies. This creates misleading connecting lines between unrelated strategies.Consider either:
- Remove lines (
mode='markers'only), or- Add strategy as an additional grouping dimension
♻️ Option 1: Markers only
- fig.update_traces(mode='lines+markers') + fig.update_traces(mode='markers'),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 383 - 392, The plotted lines are connecting points across different strategies within the same 'Test Configuration'; update the px.scatter call or trace settings so lines only connect comparable points: either change the trace drawing to markers-only by removing lines (set mode to 'markers' via fig.update_traces or px.scatter parameters) or add an explicit grouping by strategy (e.g., include 'Strategy' as an additional grouping dimension such as line_group='Strategy' or use symbol/facet for 'Strategy') so lines are drawn per strategy instead of across strategies; adjust references around px.scatter, fig.update_traces, 'Test Configuration', 'Strategy', 'Concurrency', and 'Tokens/s' accordingly.
472-473:⚠️ Potential issue | 🟠 MajorGuard
guidellm_benchmarksbefore reading it.Same issue as
GuidellmTokensConcurrency- attribute access without guard.🛡️ Proposed fix
- if not entry.results.guidellm_benchmarks: + benchmarks = getattr(entry.results, "guidellm_benchmarks", None) + if not benchmarks: continue ... - for benchmark in entry.results.guidellm_benchmarks: + for benchmark in benchmarks:,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 472 - 473, The current code reads entry.results.guidellm_benchmarks without ensuring the attribute exists, which can raise AttributeError; change the guard to safely probe the attribute (e.g., use getattr(entry.results, "guidellm_benchmarks", None) or hasattr(entry.results, "guidellm_benchmarks")) before using it so the loop continues when the attribute is missing or falsy — apply the same pattern used in GuidellmTokensConcurrency to guard access to entry.results.guidellm_benchmarks.
334-335:⚠️ Potential issue | 🟠 MajorGuard
guidellm_benchmarksbefore reading it.
entry.results.guidellm_benchmarksis accessed unconditionally. Ifentry.resultslacks this attribute, anAttributeErrorwill be raised before the check executes.🛡️ Proposed fix
- if not entry.results.guidellm_benchmarks: + benchmarks = getattr(entry.results, "guidellm_benchmarks", None) + if not benchmarks: continue ... - for benchmark in entry.results.guidellm_benchmarks: + for benchmark in benchmarks:,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 334 - 335, The code reads entry.results.guidellm_benchmarks before checking it which can raise AttributeError if results lacks that attribute; change the guard to safely test for the attribute and its truthiness (e.g., use hasattr(entry.results, "guidellm_benchmarks") or getattr(entry.results, "guidellm_benchmarks", None) and then continue if missing or falsy) so the loop skips entries without guidellm_benchmarks instead of throwing; update the conditional that currently references entry.results.guidellm_benchmarks to use one of these safe checks.
368-369:⚠️ Potential issue | 🟠 MajorConvert TTFT to milliseconds before labeling it as
ms.Based on the parser (
parsers.py),ttft_medianis stored in seconds (converted viattft_median / 1000.0). The hover data labels this asmsbut uses the raw value in seconds.⏱️ Proposed fix
- 'TTFT (ms)': benchmark.ttft_median, + 'TTFT (ms)': benchmark.ttft_median * 1000,,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 368 - 369, The hover label 'TTFT (ms)' is populated with benchmark.ttft_median which is in seconds; convert it to milliseconds to match the unit label by replacing benchmark.ttft_median with benchmark.ttft_median * 1000 (similar to how request_latency_median is converted) where the dict with key 'TTFT (ms)' is constructed in throughput_analysis.py so the displayed value matches the "ms" unit.
🧹 Nitpick comments (1)
projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (1)
517-517: Unused loop variablei.The loop variable
iis not used. Rename to_to indicate it's intentionally unused.- for i, config_name in enumerate(configurations): + for _, config_name in enumerate(configurations):Or simply iterate without
enumerate:- for i, config_name in enumerate(configurations): + for config_name in configurations:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` at line 517, The loop uses an unused index variable `i` in "for i, config_name in enumerate(configurations):"; remove the unused variable by either changing the loop to "for config_name in configurations:" or replacing `i` with `_` (i.e., "for _, config_name in enumerate(configurations):") to make intent clear; update any references to `i` if present (there should be none) and keep `config_name` as the iteration variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 864-865: The suffix computation uses len(rate) which can be a
scalar and causes a TypeError; change the condition to inspect the normalized
list rate_values instead (i.e., set suffix = f"_rate{rate_value}" if
len(rate_values) > 1 else None) so the artifact suffix logic uses the list
created earlier (rate_values and rate_value).
---
Duplicate comments:
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 335-355: Remove the unnecessary f-string prefixes on literal
strings passed to report.Plot_and_Text and any other plain string literals in
this block (e.g., calls like report.Plot_and_Text(f"Guidellm Tokens vs
Concurrency", args), report.Plot_and_Text(f"Guidellm TTFT Analysis", args),
report.Plot_and_Text(f"Guidellm TPOT Analysis", args),
report.Plot_and_Text(f"Guidellm ITL Analysis", args),
report.Plot_and_Text(f"Guidellm E2E Latency Analysis", args) and similar at the
other location), replacing them with normal string literals so Ruff F541 no
longer flags the file.
- Around line 374-375: The ranking currently uses max(..., key=lambda i:
all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency,
1)) which floors request_concurrency to 1 and hides sub-1 mean concurrencies;
change the comparator in best_efficiency_idx (and the similar logic around the
blocks at the other occurrence) to divide tokens_per_second by the actual
all_benchmarks[i].request_concurrency value instead, but guard against a zero
value by using a tiny epsilon (e.g., treat 0 as 1e-9) or skip zero-concurrency
entries to avoid division-by-zero—update only the comparator used in
best_efficiency_idx and the matching comparator at the other spot.
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 607-617: The TTFT unit_conversion is likely wrong: verify how
ttft_median is stored (check parsers.py where ttft_median is computed/converted)
and if ttft_median is stored in seconds (e.g., divided by 1000.0), update
GuidellmTTFTAnalysis to set unit_conversion to 1000 (instead of 1) and correct
the comment accordingly; if ttft_median is indeed already in milliseconds, leave
unit_conversion as 1 but update the comment to confirm units. Ensure you modify
the GuidellmTTFTAnalysis constructor (unit_conversion and comment) to reflect
the actual stored unit.
- Around line 383-392: The plotted lines are connecting points across different
strategies within the same 'Test Configuration'; update the px.scatter call or
trace settings so lines only connect comparable points: either change the trace
drawing to markers-only by removing lines (set mode to 'markers' via
fig.update_traces or px.scatter parameters) or add an explicit grouping by
strategy (e.g., include 'Strategy' as an additional grouping dimension such as
line_group='Strategy' or use symbol/facet for 'Strategy') so lines are drawn per
strategy instead of across strategies; adjust references around px.scatter,
fig.update_traces, 'Test Configuration', 'Strategy', 'Concurrency', and
'Tokens/s' accordingly.
- Around line 472-473: The current code reads entry.results.guidellm_benchmarks
without ensuring the attribute exists, which can raise AttributeError; change
the guard to safely probe the attribute (e.g., use getattr(entry.results,
"guidellm_benchmarks", None) or hasattr(entry.results, "guidellm_benchmarks"))
before using it so the loop continues when the attribute is missing or falsy —
apply the same pattern used in GuidellmTokensConcurrency to guard access to
entry.results.guidellm_benchmarks.
- Around line 334-335: The code reads entry.results.guidellm_benchmarks before
checking it which can raise AttributeError if results lacks that attribute;
change the guard to safely test for the attribute and its truthiness (e.g., use
hasattr(entry.results, "guidellm_benchmarks") or getattr(entry.results,
"guidellm_benchmarks", None) and then continue if missing or falsy) so the loop
skips entries without guidellm_benchmarks instead of throwing; update the
conditional that currently references entry.results.guidellm_benchmarks to use
one of these safe checks.
- Around line 368-369: The hover label 'TTFT (ms)' is populated with
benchmark.ttft_median which is in seconds; convert it to milliseconds to match
the unit label by replacing benchmark.ttft_median with benchmark.ttft_median *
1000 (similar to how request_latency_median is converted) where the dict with
key 'TTFT (ms)' is constructed in throughput_analysis.py so the displayed value
matches the "ms" unit.
---
Nitpick comments:
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Line 517: The loop uses an unused index variable `i` in "for i, config_name in
enumerate(configurations):"; remove the unused variable by either changing the
loop to "for config_name in configurations:" or replacing `i` with `_` (i.e.,
"for _, config_name in enumerate(configurations):") to make intent clear; update
any references to `i` if present (there should be none) and keep `config_name`
as the iteration variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b6bd5929-619d-486d-a1b4-021f7fb0da3a
📒 Files selected for processing (19)
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rstdocs/toolbox.generated/index.rstprojects/llm-d/testing/config.yamlprojects/llm-d/testing/test_llmd.pyprojects/llm-d/toolbox/llmd.pyprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.ymlprojects/llm-d/visualizations/llmd_inference/analyze/__init__.pyprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/models/lts.pyprojects/llm-d/visualizations/llmd_inference/plotting/error_report.pyprojects/llm-d/visualizations/llmd_inference/plotting/report.pyprojects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.pyprojects/llm-d/visualizations/llmd_inference/store/lts_parser.pyprojects/llm-d/visualizations/llmd_inference/store/parsers.py
💤 Files with no reviewable changes (8)
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
- docs/toolbox.generated/index.rst
- projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
- projects/llm-d/visualizations/llmd_inference/analyze/init.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
- projects/llm-d/visualizations/llmd_inference/models/lts.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
✅ Files skipped from review due to trivial changes (2)
- projects/llm-d/visualizations/llmd_inference/data/plots.yaml
- docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
|
🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 29 minutes 23 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
72eaad9 to
f30dd18
Compare
|
/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval |
There was a problem hiding this comment.
♻️ Duplicate comments (6)
projects/llm-d/visualizations/llmd_inference/plotting/report.py (2)
363-365:⚠️ Potential issue | 🟡 MinorDrop the no-op
fprefix here.
f"Unknown Configuration"has no placeholders and still triggers Ruff F541.🧹 Suggested fix
- entry_name = entry.get_name(variables) if variables else f"Unknown Configuration" + entry_name = entry.get_name(variables) if variables else "Unknown Configuration"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 363 - 365, The code creates entry_name using entry.get_name(variables) or an f-string with no placeholders; remove the no-op f prefix to avoid Ruff F541 by changing the fallback from f"Unknown Configuration" to a plain string "Unknown Configuration" where entry_name is assigned inside the loop that iterates over entries (see variables, entry.get_name, and entry_name).
373-375:⚠️ Potential issue | 🟠 MajorUse the real mean concurrency when ranking efficiency.
In
projects/llm-d/visualizations/llmd_inference/store/parsers.py, Lines 178-179 and 201,request_concurrencycomes from GuideLLM's mean concurrency, so values below1are valid. Clamping it to1here can pick the wrong “Most efficient” run and understate its score; guard only against zero.🛠️ Suggested fix
- best_efficiency_idx = max(range(len(all_benchmarks)), key=lambda i: all_benchmarks[i].tokens_per_second / max(all_benchmarks[i].request_concurrency, 1)) + best_efficiency_idx = max( + range(len(all_benchmarks)), + key=lambda i: ( + all_benchmarks[i].tokens_per_second / all_benchmarks[i].request_concurrency + if all_benchmarks[i].request_concurrency > 0 + else float("-inf") + ), + ) @@ - f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / max(best_efficiency.request_concurrency, 1)):.0f} tok/s per concurrency unit)", + f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} ({(best_efficiency.tokens_per_second / best_efficiency.request_concurrency):.0f} tok/s per concurrency unit)" if best_efficiency.request_concurrency > 0 else + f"⚡ Most efficient: {best_efficiency.strategy} in {benchmark_configs[best_efficiency_idx]} (undefined for zero concurrency)",Also applies to: 399-401
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py` around lines 373 - 375, The efficiency ranking currently clamps all_benchmarks[i].request_concurrency to 1 which incorrectly treats valid fractional mean concurrencies as >=1; update the computation of best_efficiency_idx to avoid clamping to 1 and instead only guard against zero (e.g., divide by max(all_benchmarks[i].request_concurrency, tiny_eps) where tiny_eps is a very small positive value) so that fractional concurrencies are preserved; change the same pattern where used (the other occurrence around best_tokens_idx/best_efficiency_idx).projects/llm-d/testing/test_llmd.py (1)
894-895:⚠️ Potential issue | 🔴 CriticalUse
rate_valueswhen deciding whether to suffix artifacts.With the default
tests.llmd.benchmarks.guidellm.rate: 1,len(rate)raisesTypeErrorbefore the first benchmark starts. This branch needs to check the normalized list, not the raw config value.🐛 Suggested fix
- suffix = f"_rate{rate_value}" if len(rate) > 1\ + suffix = f"_rate{rate_value}" if len(rate_values) > 1\ else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/test_llmd.py` around lines 894 - 895, The suffix logic is checking len(rate) which can be a non-iterable and raise TypeError; change the condition to inspect the normalized list (rate_values) instead of the raw config (rate) so the branch uses len(rate_values) when deciding the suffix for rate_value (i.e., set suffix = f"_rate{rate_value}" if len(rate_values) > 1 else None).projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py (3)
333-345:⚠️ Potential issue | 🟠 MajorGuard
guidellm_benchmarksbefore iterating.One mixed matrix entry without parsed GuideLLM results will raise
AttributeErrorhere and abort the stat. These new paths need the same guarded access pattern as the report layer.🛡️ Suggested fix
- for entry in entries: - if not entry.results.guidellm_benchmarks: + for entry in entries: + benchmarks = getattr(entry.results, "guidellm_benchmarks", None) + if not benchmarks: continue @@ - for benchmark in entry.results.guidellm_benchmarks: + for benchmark in benchmarks: if benchmark.strategy == "throughput": continue- for entry in entries: - if not entry.results.guidellm_benchmarks: + for entry in entries: + benchmarks = getattr(entry.results, "guidellm_benchmarks", None) + if not benchmarks: continue @@ - for benchmark in entry.results.guidellm_benchmarks: + for benchmark in benchmarks: if benchmark.strategy == "throughput": continueAlso applies to: 471-479
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 333 - 345, The loop over entry.results.guidellm_benchmarks assumes that attribute exists and raises AttributeError for mixed entries; before iterating, guard access by checking the attribute (e.g., if not getattr(entry.results, "guidellm_benchmarks", None): continue) so only entries with parsed GuideLLM results are processed; apply the same guarded pattern to the second loop that checks benchmark.strategy (the similar block around where benchmark.strategy is inspected) and keep using entry.get_name(variables) only for entries that pass this guard.
359-369:⚠️ Potential issue | 🟠 MajorConvert TTFT back to milliseconds before labeling it
ms.
projects/llm-d/visualizations/llmd_inference/store/parsers.py, Lines 213-214, storesttft_medianandttft_p95in seconds. These displays currently label seconds as milliseconds, so TTFT is understated by 1000x.⏱️ Suggested fix
data.append({ 'Test Configuration': entry_name, 'Concurrency': benchmark.request_concurrency, 'Tokens/s': benchmark.tokens_per_second, 'Input Tokens/s': benchmark.input_tokens_per_second, 'Output Tokens/s': benchmark.output_tokens_per_second, 'Strategy': benchmark.strategy, 'Strategy Type': strategy_type, 'Request Rate (req/s)': benchmark.request_rate, - 'TTFT (ms)': benchmark.ttft_median, + 'TTFT (ms)': benchmark.ttft_median * 1000, 'Latency (ms)': benchmark.request_latency_median * 1000 })class GuidellmTTFTAnalysis(GuidellmLatencyAnalysisBase): def __init__(self): super().__init__({ 'name': 'TTFT', 'description': 'TTFT Latency', 'p50_field': 'ttft_median', 'p95_field': 'ttft_p95', - 'unit_conversion': 1, # Already in ms + 'unit_conversion': 1000, # Stored in seconds 'note_text': 'P1 and P90 percentiles require additional data extraction from logs.', 'insight_text': 'Lower TTFT values indicate better responsiveness. Look for configurations with consistently low and stable TTFT across concurrency levels.' })Also applies to: 607-617
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 359 - 369, The TTFT values stored in parsers.py (ttft_median and ttft_p95 are in seconds) are being labeled as milliseconds in throughput_analysis.py; update the construction of the data dict in the function that builds rows (the block that appends entries with keys 'TTFT (ms)' and later the similar block around lines 607-617) to multiply benchmark.ttft_median and benchmark.ttft_p95 by 1000 before assigning them to the 'TTFT (ms)' and corresponding p95 fields so the displayed values are in milliseconds while keeping the label unchanged.
383-392:⚠️ Potential issue | 🟠 MajorDon't draw a single line across different strategies.
Both plots collect multiple
benchmark.strategyvalues per configuration, but each trace is built only per configuration. That stitches unrelated strategies into one synthetic curve and misstates the scaling trend. Split traces by(configuration, strategy)or fall back to markers-only.📉 Minimal safe fix
- fig.update_traces(mode='lines+markers') + fig.update_traces(mode='markers')- mode='lines+markers', + mode='markers', @@ - mode='lines+markers', + mode='markers',Also applies to: 517-546
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py` around lines 383 - 392, The scatter currently stitches different benchmark.strategy values into one continuous line because traces are only grouped by 'Test Configuration'; update the px.scatter call in throughput_analysis.py (the fig = px.scatter(...) call and subsequent fig.update_traces(...)) to either (a) create separate traces per (configuration, strategy) by adding a grouping like line_group='benchmark.strategy' and/or symbol='benchmark.strategy' (or include benchmark.strategy in the color/symbol combination) so each (Test Configuration, benchmark.strategy) pair gets its own trace, or (b) if you prefer a minimal safe change, remove the line component and use markers-only by changing fig.update_traces(mode='markers') to avoid connecting unrelated strategies into one curve.
🧹 Nitpick comments (2)
projects/llm-d/toolbox/llmd.py (1)
55-57: Avoid a shared defaultguidellm_argslist.This default is allocated once and then reused across calls. If any downstream code mutates it, later invocations inherit stale args.
♻️ Suggested fix
def run_guidellm_benchmark( self, endpoint_url, name="guidellm-benchmark", namespace="", image="ghcr.io/vllm-project/guidellm", version="pr-590", timeout=900, - guidellm_args=[], + guidellm_args=None, ): @@ - return RunAnsibleRole(locals()) + if guidellm_args is None: + guidellm_args = [] + + return RunAnsibleRole(locals())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/toolbox/llmd.py` around lines 55 - 57, The default parameter guidellm_args is a shared mutable list (guidellm_args=[]) which can be mutated across invocations; change the function/class signature to use guidellm_args=None and inside the function (where image, version, timeout are used) set guidellm_args = [] if guidellm_args is None so each call gets a fresh list; update any callers that rely on the previous default behavior only if they passed nothing (no other code changes needed).projects/llm-d/testing/test_llmd.py (1)
831-859: Replaceeval()with a constrained expression evaluator.These placeholders only need arithmetic on
rate, buteval()still executes arbitrary Python and the fallback returns the raw{...}token on errors. That turns config typos into late, opaque CLI failures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/test_llmd.py` around lines 831 - 859, The function apply_rate_scaleup uses eval() inside evaluate_expression which allows arbitrary code execution and silently returns the raw "{...}" token on errors; replace eval() with a safe AST-based evaluator: in evaluate_expression, parse expression with ast.parse(expression, mode='eval'), walk/validate nodes to only allow Expr, BinOp, UnaryOp, Constant/Num, Name (only 'rate'), and allowed operators (+, -, *, /, //, %, **, <<, >>, |, &, ^), then evaluate the AST recursively using a small operator mapping and the context {"rate": rate}; on parse/validation/evaluation errors, log the error via logging.warning and raise a ValueError (do not return the original token) so config typos fail fast; update evaluate_expression and references in apply_rate_scaleup accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 894-895: The suffix logic is checking len(rate) which can be a
non-iterable and raise TypeError; change the condition to inspect the normalized
list (rate_values) instead of the raw config (rate) so the branch uses
len(rate_values) when deciding the suffix for rate_value (i.e., set suffix =
f"_rate{rate_value}" if len(rate_values) > 1 else None).
In `@projects/llm-d/visualizations/llmd_inference/plotting/report.py`:
- Around line 363-365: The code creates entry_name using
entry.get_name(variables) or an f-string with no placeholders; remove the no-op
f prefix to avoid Ruff F541 by changing the fallback from f"Unknown
Configuration" to a plain string "Unknown Configuration" where entry_name is
assigned inside the loop that iterates over entries (see variables,
entry.get_name, and entry_name).
- Around line 373-375: The efficiency ranking currently clamps
all_benchmarks[i].request_concurrency to 1 which incorrectly treats valid
fractional mean concurrencies as >=1; update the computation of
best_efficiency_idx to avoid clamping to 1 and instead only guard against zero
(e.g., divide by max(all_benchmarks[i].request_concurrency, tiny_eps) where
tiny_eps is a very small positive value) so that fractional concurrencies are
preserved; change the same pattern where used (the other occurrence around
best_tokens_idx/best_efficiency_idx).
In
`@projects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.py`:
- Around line 333-345: The loop over entry.results.guidellm_benchmarks assumes
that attribute exists and raises AttributeError for mixed entries; before
iterating, guard access by checking the attribute (e.g., if not
getattr(entry.results, "guidellm_benchmarks", None): continue) so only entries
with parsed GuideLLM results are processed; apply the same guarded pattern to
the second loop that checks benchmark.strategy (the similar block around where
benchmark.strategy is inspected) and keep using entry.get_name(variables) only
for entries that pass this guard.
- Around line 359-369: The TTFT values stored in parsers.py (ttft_median and
ttft_p95 are in seconds) are being labeled as milliseconds in
throughput_analysis.py; update the construction of the data dict in the function
that builds rows (the block that appends entries with keys 'TTFT (ms)' and later
the similar block around lines 607-617) to multiply benchmark.ttft_median and
benchmark.ttft_p95 by 1000 before assigning them to the 'TTFT (ms)' and
corresponding p95 fields so the displayed values are in milliseconds while
keeping the label unchanged.
- Around line 383-392: The scatter currently stitches different
benchmark.strategy values into one continuous line because traces are only
grouped by 'Test Configuration'; update the px.scatter call in
throughput_analysis.py (the fig = px.scatter(...) call and subsequent
fig.update_traces(...)) to either (a) create separate traces per (configuration,
strategy) by adding a grouping like line_group='benchmark.strategy' and/or
symbol='benchmark.strategy' (or include benchmark.strategy in the color/symbol
combination) so each (Test Configuration, benchmark.strategy) pair gets its own
trace, or (b) if you prefer a minimal safe change, remove the line component and
use markers-only by changing fig.update_traces(mode='markers') to avoid
connecting unrelated strategies into one curve.
---
Nitpick comments:
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 831-859: The function apply_rate_scaleup uses eval() inside
evaluate_expression which allows arbitrary code execution and silently returns
the raw "{...}" token on errors; replace eval() with a safe AST-based evaluator:
in evaluate_expression, parse expression with ast.parse(expression,
mode='eval'), walk/validate nodes to only allow Expr, BinOp, UnaryOp,
Constant/Num, Name (only 'rate'), and allowed operators (+, -, *, /, //, %, **,
<<, >>, |, &, ^), then evaluate the AST recursively using a small operator
mapping and the context {"rate": rate}; on parse/validation/evaluation errors,
log the error via logging.warning and raise a ValueError (do not return the
original token) so config typos fail fast; update evaluate_expression and
references in apply_rate_scaleup accordingly.
In `@projects/llm-d/toolbox/llmd.py`:
- Around line 55-57: The default parameter guidellm_args is a shared mutable
list (guidellm_args=[]) which can be mutated across invocations; change the
function/class signature to use guidellm_args=None and inside the function
(where image, version, timeout are used) set guidellm_args = [] if guidellm_args
is None so each call gets a fresh list; update any callers that rely on the
previous default behavior only if they passed nothing (no other code changes
needed).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f1ae502b-4e9a-4221-9c96-bc6ff6dd221c
📒 Files selected for processing (19)
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rstdocs/toolbox.generated/index.rstprojects/llm-d/testing/config.yamlprojects/llm-d/testing/test_llmd.pyprojects/llm-d/toolbox/llmd.pyprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.ymlprojects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.ymlprojects/llm-d/visualizations/llmd_inference/analyze/__init__.pyprojects/llm-d/visualizations/llmd_inference/data/plots.yamlprojects/llm-d/visualizations/llmd_inference/models/lts.pyprojects/llm-d/visualizations/llmd_inference/plotting/error_report.pyprojects/llm-d/visualizations/llmd_inference/plotting/report.pyprojects/llm-d/visualizations/llmd_inference/plotting/throughput_analysis.pyprojects/llm-d/visualizations/llmd_inference/store/lts_parser.pyprojects/llm-d/visualizations/llmd_inference/store/parsers.py
💤 Files with no reviewable changes (8)
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/templates/multiturn_benchmark_job.yaml.j2
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/vars/main/resources.yml
- docs/toolbox.generated/index.rst
- projects/llm-d/visualizations/llmd_inference/analyze/init.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/defaults/main/config.yml
- projects/llm-d/visualizations/llmd_inference/plotting/error_report.py
- projects/llm-d/visualizations/llmd_inference/models/lts.py
- projects/llm-d/toolbox/llmd_run_multiturn_benchmark/tasks/main.yml
✅ Files skipped from review due to trivial changes (1)
- projects/llm-d/visualizations/llmd_inference/data/plots.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- projects/llm-d/visualizations/llmd_inference/store/lts_parser.py
- docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
- projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
- projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
|
🟢 Test of 'llm-d test test_ci' succeeded after 02 hours 51 minutes 54 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval |
|
/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval |
|
/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval |
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 18 minutes 48 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval |
|
🔴 Test of 'llm-d test test_ci' failed after 01 hours 19 minutes 09 seconds. 🔴 • Link to the test results. • No reports index generated... Test configuration: Failure indicator: Empty. (See run.log) |
|
/test jump-ci llm-d cks multi-flavor guidellm_multiturn_eval gpt-oss |
|
🟢 Test of 'llm-d test test_ci' succeeded after 00 hours 41 minutes 30 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 22 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
🟢 Test of 'llm-d test generate_plots_from_pr_args' succeeded after 00 hours 00 minutes 37 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
Summary by CodeRabbit
New Features
Chores
Documentation