Skip to content

Commit 51c57bd

Browse files
authored
[llm-d] Keep working (#907)
<!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a GuideLLM Performance Analysis report with new throughput/latency charts and ranked key-insights. * **Chores** * CI presets and flavor matrix updated; added heterogeneous evaluation and adjusted rates/max-requests. * Test and toolbox workflows now run GuideLLM via CLI-style guidellm_args and no longer execute the multi-turn benchmark. * Visualization and parsing pipelines removed multi-turn outputs and now consume GuideLLM benchmark data. * **Documentation** * Replaced separate benchmark params with a single guidellm_args CLI-argument list; multiturn doc entry removed. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents 337a112 + 7dba32a commit 51c57bd

File tree

20 files changed

+624
-917
lines changed

20 files changed

+624
-917
lines changed

docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst

Lines changed: 2 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -56,23 +56,7 @@ Parameters
5656
* default value: ``900``
5757

5858

59-
``rate``
59+
``guidellm_args``
6060

61-
* Request rate for the benchmark
62-
63-
* default value: ``1``
64-
65-
66-
``max_seconds``
67-
68-
* Maximum seconds to run benchmark
69-
70-
* default value: ``30``
71-
72-
73-
``data``
74-
75-
* Data configuration
76-
77-
* default value: ``prompt_tokens=256,output_tokens=128``
61+
* List of additional guidellm arguments (e.g., ["--rate=10", "--max-seconds=30"])
7862

docs/toolbox.generated/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,6 @@ Toolbox Documentation
191191
* :doc:`deploy_gateway <Llmd.deploy_gateway>` Deploys a GatewayClass and Gateway object
192192
* :doc:`deploy_llm_inference_service <Llmd.deploy_llm_inference_service>` Deploys an LLM InferenceService from a YAML file
193193
* :doc:`run_guidellm_benchmark <Llmd.run_guidellm_benchmark>` Runs a Guidellm benchmark job against the LLM inference service
194-
* :doc:`run_multiturn_benchmark <Llmd.run_multiturn_benchmark>` Runs a multi-turn benchmark job against the LLM inference service
195194

196195
``local_ci``
197196
************

projects/llm-d/testing/config.yaml

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -30,24 +30,33 @@ ci_presets:
3030
operator: NotIn
3131
values:
3232
- gf48e48
33+
- gf4334a
3334
prepare.preload.extra_images:
3435
vllm-cuda-rhel9: registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:094db84a1da5e8a575d0c9eade114fa30f4a2061064a338e3e032f3578f8082a
3536
llm-d-inference-scheduler: ghcr.io/opendatahub-io/rhaii-on-xks/llm-d-inference-scheduler:e6b5db0@sha256:43e8b8edc158f31535c8b23d77629f8cde111cc762a8f4ee5f2f884470566211
3637
guidellm: ghcr.io/vllm-project/guidellm:v0.5.4
3738

3839
multi-flavor:
39-
tests.llmd.flavors: [simple-tp8-x2, intelligentrouting-x2-tp8, simple, simple-x2, simple-tp8, intelligentrouting-tp8]
40+
tests.llmd.flavors: [simple-tp4, simple-tp2-x4, intelligentrouting-tp2-x4]
4041

4142
guidellm_light:
4243
tests.llmd.benchmarks.guidellm.data: prompt_tokens=256,output_tokens=128
43-
tests.llmd.benchmarks.guidellm.rate: 1,10,50
44+
tests.llmd.benchmarks.guidellm.rate: "1,10,50"
4445
tests.llmd.benchmarks.guidellm.max_seconds: 30
4546

4647
guidellm_multiturn_eval:
47-
tests.llmd.benchmarks.guidellm.data: prompt_tokens=8000,prompt_tokens_stdev=4500,prompt_tokens_min=50,prompt_tokens_max=30000,output_tokens=800,output_tokens_stdev=1500,output_tokens_min=20,output_tokens_max=8000
48-
tests.llmd.benchmarks.guidellm.rate: 1,10,50,100,200,300
48+
tests.llmd.benchmarks.guidellm.data: "prompt_tokens=128,output_tokens=128,turns=5,prefix_tokens=10000,prefix_count={2*rate}"
49+
tests.llmd.benchmarks.guidellm.rate: [32, 64, 128, 256, 512] # keep as a list, multi-rate not supporte by guidellm-multiturn
50+
tests.llmd.benchmarks.guidellm.max_requests: "{10*rate}"
51+
52+
guidellm_heterogeneous_eval:
53+
tests.llmd.benchmarks.guidellm.data: prompt_tokens=8000,prompt_tokens_stdev=8500,prompt_tokens_min=50,prompt_tokens_max=30000,output_tokens=800,output_tokens_stdev=1500,output_tokens_min=20,output_tokens_max=8000
54+
tests.llmd.benchmarks.guidellm.rate: "1,10,50,100,200,300"
4955
tests.llmd.benchmarks.guidellm.max_seconds: 600
5056

57+
gpt-oss:
58+
tests.llmd.inference_service.model: gpt-oss-120
59+
5160
clusters:
5261
cleanup_on_exit: false
5362

@@ -181,7 +190,7 @@ tests:
181190

182191
inference_service:
183192
skip_deployment: false
184-
name: llama-llm-d
193+
name: llm-d
185194
yaml_file: llama-3-1-8b-instruct-fp8.yaml
186195
timeout: 900
187196
do_simple_test: true
@@ -196,7 +205,7 @@ tests:
196205
- "--trust-remote-code"
197206
- "--disable-log-requests"
198207
- "--max-model-len=40960"
199-
- "--gpu-memory-utilization=0.9"
208+
- "--gpu-memory-utilization=0.92"
200209

201210
kueue:
202211
enabled: false
@@ -211,17 +220,13 @@ tests:
211220
extra_properties: {}
212221

213222
benchmarks:
214-
multiturn:
215-
enabled: false
216-
name: multiturn-benchmark
217-
parallel: 9
218-
timeout: 900
219-
220223
guidellm:
221224
enabled: true
222225
name: guidellm-benchmark
223-
rate-type: concurrent
224-
max_seconds: 60
226+
backend_type: openai_http
227+
rate_type: concurrent
228+
max_seconds: null
229+
max_requests: null
225230
timeout: 900
226231
data: prompt_tokens=256,output_tokens=128
227232
rate: 1

projects/llm-d/testing/llmisvcs/llama-3-1-8b-instruct-fp8.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,6 @@ spec:
4949
apiVersion: inference.networking.x-k8s.io/v1alpha1
5050
kind: EndpointPickerConfig
5151
plugins:
52-
- type: single-profile-handler
5352
- type: queue-scorer
5453
- type: kv-cache-utilization-scorer
5554
- type: prefix-cache-scorer

projects/llm-d/testing/test_llmd.py

Lines changed: 86 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -67,9 +67,6 @@ def test_single_flavor(flavor, flavor_index, total_flavors, namespace):
6767
raise RuntimeError("Simple inference test failed :/")
6868

6969
# Run benchmarks
70-
if config.project.get_config("tests.llmd.benchmarks.multiturn.enabled"):
71-
flavor_failed |= run_multiturn_benchmark(endpoint_url, llmisvc_name, namespace)
72-
7370
if config.project.get_config("tests.llmd.benchmarks.guidellm.enabled"):
7471
flavor_failed |= run_guidellm_benchmark(endpoint_url, llmisvc_name, namespace)
7572

@@ -804,74 +801,114 @@ def get_llm_inference_url(llmisvc_name, namespace, flavor):
804801
return endpoint_url
805802

806803

807-
def run_multiturn_benchmark(endpoint_url, llmisvc_name, namespace):
804+
def run_guidellm_benchmark(endpoint_url, llmisvc_name, namespace):
808805
"""
809-
Runs the multi-turn benchmark
806+
Runs the Guidellm benchmark
810807
"""
811808

812-
if not config.project.get_config("tests.llmd.benchmarks.multiturn.enabled"):
809+
if not config.project.get_config("tests.llmd.benchmarks.guidellm.enabled"):
813810
return False
814811

815-
logging.info("Running multi-turn benchmark")
812+
logging.info("Running Guidellm benchmark")
816813

817-
benchmark_name = config.project.get_config("tests.llmd.benchmarks.multiturn.name")
818-
parallel = config.project.get_config("tests.llmd.benchmarks.multiturn.parallel")
819-
timeout = config.project.get_config("tests.llmd.benchmarks.multiturn.timeout")
814+
benchmark_name = config.project.get_config("tests.llmd.benchmarks.guidellm.name")
815+
rate = config.project.get_config("tests.llmd.benchmarks.guidellm.rate")
816+
backend_type = config.project.get_config("tests.llmd.benchmarks.guidellm.backend_type")
817+
rate_type = config.project.get_config("tests.llmd.benchmarks.guidellm.rate_type")
818+
max_seconds = config.project.get_config("tests.llmd.benchmarks.guidellm.max_seconds")
819+
max_requests = config.project.get_config("tests.llmd.benchmarks.guidellm.max_requests")
820+
timeout = config.project.get_config("tests.llmd.benchmarks.guidellm.timeout")
821+
data = config.project.get_config("tests.llmd.benchmarks.guidellm.data")
820822

821823
failed = False
822824

823-
endpoint_url = f"{endpoint_url}/v1"
825+
# Handle rate as list/tuple - iterate over each rate value
826+
if isinstance(rate, (list, tuple)):
827+
rate_values = rate
828+
else:
829+
rate_values = [rate]
824830

825-
try:
826-
run.run_toolbox("llmd", "run_multiturn_benchmark",
827-
endpoint_url=endpoint_url,
828-
name=benchmark_name,
829-
namespace=namespace,
830-
parallel=parallel,
831-
timeout=timeout)
831+
def apply_rate_scaleup(value, rate):
832+
"""
833+
Apply rate-based scaling to configuration values.
832834
833-
logging.info("Multi-turn benchmark completed successfully")
835+
Evaluates expressions like:
836+
- "{10*rate}" with rate=32 -> "320"
837+
- "prefix_count={2*rate}" with rate=32 -> "prefix_count=64"
838+
"""
839+
if not isinstance(value, str):
840+
return value
834841

835-
except Exception as e:
836-
logging.error(f"Multi-turn benchmark failed: {e}")
837-
failed = True
842+
import re
838843

839-
return failed
844+
# Find all expressions in curly braces
845+
pattern = r'\{([^}]+)\}'
840846

847+
def evaluate_expression(match):
848+
expression = match.group(1)
849+
try:
850+
# Create a safe evaluation context with only 'rate' variable
851+
context = {"rate": rate}
852+
result = eval(expression, {"__builtins__": {}}, context)
853+
return str(result)
854+
except Exception as e:
855+
logging.warning(f"Failed to evaluate expression '{expression}' with rate={rate}: {e}")
856+
return match.group(0) # Return original if evaluation fails
857+
858+
# Replace all expressions with their evaluated results
859+
return re.sub(pattern, evaluate_expression, value)
860+
861+
for rate_value in rate_values:
862+
try:
863+
logging.info(f"Running Guidellm benchmark with rate: {rate_value}")
841864

842-
def run_guidellm_benchmark(endpoint_url, llmisvc_name, namespace):
843-
"""
844-
Runs the Guidellm benchmark
845-
"""
865+
# Create unique name for each rate if multiple rates
866+
current_name = benchmark_name
867+
if len(rate_values) > 1:
868+
current_name = f"{benchmark_name}-rate-{rate_value}"
846869

847-
if not config.project.get_config("tests.llmd.benchmarks.guidellm.enabled"):
848-
return False
870+
# Construct guidellm arguments list
871+
guidellm_args = []
849872

850-
logging.info("Running Guidellm benchmark")
873+
# Add default parameters from config
874+
if backend_type:
875+
guidellm_args.append(f"--backend-type={backend_type}")
851876

852-
benchmark_name = config.project.get_config("tests.llmd.benchmarks.guidellm.name")
853-
rate = config.project.get_config("tests.llmd.benchmarks.guidellm.rate")
854-
max_seconds = config.project.get_config("tests.llmd.benchmarks.guidellm.max_seconds")
855-
timeout = config.project.get_config("tests.llmd.benchmarks.guidellm.timeout")
856-
data = config.project.get_config("tests.llmd.benchmarks.guidellm.data")
877+
if rate_type:
878+
guidellm_args.append(f"--rate-type={rate_type}")
857879

858-
failed = False
880+
# Add rate parameter
881+
guidellm_args.append(f"--rate={rate_value}")
859882

860-
try:
861-
run.run_toolbox("llmd", "run_guidellm_benchmark",
862-
endpoint_url=endpoint_url,
863-
name=benchmark_name,
864-
namespace=namespace,
865-
rate=rate,
866-
max_seconds=max_seconds,
867-
timeout=timeout,
868-
data=data)
883+
# Add optional parameters if provided
884+
if max_seconds is not None:
885+
guidellm_args.append(f"--max-seconds={max_seconds}")
869886

870-
logging.info("Guidellm benchmark completed successfully")
887+
if max_requests is not None:
888+
guidellm_args.append(f"--max-requests={apply_rate_scaleup(max_requests, rate_value)}")
871889

872-
except Exception as e:
873-
logging.error(f"Guidellm benchmark failed: {e}")
874-
failed = True
890+
# Add data parameter
891+
if data:
892+
guidellm_args.append(f"--data={apply_rate_scaleup(data, rate_value)}")
893+
894+
suffix = f"_rate{rate_value}" if len(rate_values) > 1\
895+
else None
896+
897+
run.run_toolbox(
898+
"llmd", "run_guidellm_benchmark",
899+
endpoint_url=endpoint_url,
900+
name=current_name,
901+
namespace=namespace,
902+
timeout=timeout,
903+
guidellm_args=guidellm_args,
904+
artifact_dir_suffix=suffix,
905+
)
906+
907+
logging.info(f"Guidellm benchmark completed successfully for rate: {rate_value}")
908+
909+
except Exception as e:
910+
logging.error(f"Guidellm benchmark failed for rate {rate_value}: {e}")
911+
failed = True
875912

876913
return failed
877914

projects/llm-d/toolbox/llmd.py

Lines changed: 3 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -46,39 +46,15 @@ def deploy_llm_inference_service(self, name, namespace, yaml_file):
4646

4747
return RunAnsibleRole(locals())
4848

49-
@AnsibleRole("llmd_run_multiturn_benchmark")
50-
@AnsibleMappedParams
51-
def run_multiturn_benchmark(
52-
self,
53-
endpoint_url,
54-
name="multi-turn-benchmark", namespace="",
55-
image="quay.io/hayesphilip/multi-turn-benchmark", version="0.0.1",
56-
timeout=900, parallel=9
57-
):
58-
"""
59-
Runs a multi-turn benchmark job against the LLM inference service
60-
61-
Args:
62-
endpoint_url: Endpoint URL for the LLM inference service to benchmark
63-
name: Name of the benchmark job
64-
namespace: Namespace to run the benchmark job in (empty string auto-detects current namespace)
65-
image: Container image for the benchmark
66-
version: Version tag for the benchmark image
67-
timeout: Timeout in seconds to wait for job completion
68-
parallel: Number of parallel connections
69-
"""
70-
71-
return RunAnsibleRole(locals())
72-
7349
@AnsibleRole("llmd_run_guidellm_benchmark")
7450
@AnsibleMappedParams
7551
def run_guidellm_benchmark(
7652
self,
7753
endpoint_url,
7854
name="guidellm-benchmark", namespace="",
7955
image="ghcr.io/vllm-project/guidellm", version="pr-590",
80-
timeout=900, rate=1, max_seconds=30,
81-
data="prompt_tokens=256,output_tokens=128"
56+
timeout=900,
57+
guidellm_args=[],
8258
):
8359
"""
8460
Runs a Guidellm benchmark job against the LLM inference service
@@ -90,14 +66,9 @@ def run_guidellm_benchmark(
9066
image: Container image for the benchmark
9167
version: Version tag for the benchmark image
9268
timeout: Timeout in seconds to wait for job completion
93-
rate: Request rate for the benchmark
94-
max_seconds: Maximum seconds to run benchmark
95-
data: Data configuration
69+
guidellm_args: List of additional guidellm arguments (e.g., ["--rate=10", "--max-seconds=30"])
9670
"""
9771

98-
if isinstance(rate, tuple):
99-
rate = ",".join(map(str, rate))
100-
10172
return RunAnsibleRole(locals())
10273

10374
@AnsibleRole("llmd_capture_isvc_state")

projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,8 @@ llmd_run_guidellm_benchmark_version: pr-590
2222
# Timeout in seconds to wait for job completion
2323
llmd_run_guidellm_benchmark_timeout: 900
2424

25-
# Request rate for the benchmark
26-
llmd_run_guidellm_benchmark_rate: 1
27-
28-
# Maximum seconds to run benchmark
29-
llmd_run_guidellm_benchmark_max_seconds: 30
30-
31-
# Data configuration
32-
llmd_run_guidellm_benchmark_data: prompt_tokens=256,output_tokens=128
25+
# List of additional guidellm arguments (e.g., ["--rate=10", "--max-seconds=30"])
26+
llmd_run_guidellm_benchmark_guidellm_args: []
3327

3428
# Default Ansible variables
3529
# Default value for ansible_os_family to ensure role remains standalone

projects/llm-d/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,9 @@ spec:
1818
- benchmark
1919
- run
2020
- --target={{ benchmark_endpoint_url }}
21-
- --backend-type=openai_http
22-
- --rate-type=concurrent
23-
- --rate={{ llmd_run_guidellm_benchmark_rate }}
24-
- --max-seconds={{ llmd_run_guidellm_benchmark_max_seconds }}
25-
- --data={{ llmd_run_guidellm_benchmark_data }}
21+
{% for arg in llmd_run_guidellm_benchmark_guidellm_args %}
22+
- {{ arg }}
23+
{% endfor %}
2624
- --outputs=json
2725
env:
2826
- name: USER # version 0.6.0-pr590 currently needs that ...

0 commit comments

Comments
 (0)