Skip to content

Commit 7411e9c

Browse files
Update to Offline Benchmarking (#102)
* Added new blueprints for offline and online inference benchmarks using vLLM and LLMPerf. Included sample JSON and YAML configuration files, along with detailed README documentation for each blueprint. Removed outdated files related to previous inference methods to streamline the documentation. * Update README for offline inference to include link to shared node pool sample * Update README for offline inference to specify that logged metrics will be visible in MlFlow * Refine README for offline inference by clarifying model mounting from Object Storage and enhancing input/output object storage sections for better clarity.
1 parent 0177652 commit 7411e9c

File tree

14 files changed

+228
-300
lines changed

14 files changed

+228
-300
lines changed

docs/custom_blueprints/blueprint_json_schema.json

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -439,10 +439,7 @@
439439
"description": "Local filesystem path to mount to container. This will be read / write path, and is local to the node the container runs on. Any written data will persist to node, and will be subject to available storage on node.",
440440
"items": {
441441
"additionalProperties": false,
442-
"required": [
443-
"node_directory_path",
444-
"mount_location"
445-
],
442+
"required": ["node_directory_path", "mount_location"],
446443
"properties": {
447444
"node_directory_path": {
448445
"type": "string",
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Offline Inference Blueprint - Infra (vLLM)
2+
3+
#### Run offline LLM inference benchmarks using vLLM with automated performance tracking and MLflow logging.
4+
5+
This blueprint provides a configurable framework to run **offline LLM inference benchmarks** using vLLM as the inference engine. It is designed for cloud GPU environments and supports automated performance benchmarking with MLflow logging.
6+
7+
This blueprint enables you to:
8+
9+
- Run inference locally on GPU nodes using pre-loaded models
10+
- Benchmark token throughput, latency, and request performance
11+
- Push results to MLflow for comparison and analysis
12+
13+
---
14+
15+
## Pre-Filled Samples
16+
17+
| Feature Showcase | Title | Description | Blueprint File |
18+
| -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
19+
| Spin up GPU infrastructure via shared node pool to avoid hardware recycle times between different blueprint benchmark runs | Create VM A10 Shared Node Pool | Creates a shared node pool using a selector `shared_pool` and `VM.GPU.A10.2` shape | [shared_node_pool_a10.json](shared_node_pool_a10.json) |
20+
| Benchmark LLM performance using vLLM backend with offline inference for token throughput analysis | Offline inference with LLAMA 3- vLLM | Benchmarks Meta-Llama-3.1-8B model using vLLM on VM.GPU.A10.2 with 2 GPUs. | [offline-benchmark-blueprint.json](offline-benchmark-blueprint.json) |
21+
22+
You can access these pre-filled samples from the OCI AI Blueprint portal.
23+
24+
---
25+
26+
## When to use Offline inference
27+
28+
Offline inference is ideal for:
29+
30+
- Accurate performance benchmarking (no API or network bottlenecks)
31+
- Comparing GPU hardware performance (A10, A100, H100, MI300X)
32+
- Evaluating backend frameworks (inference engines) like vLLM
33+
34+
---
35+
36+
## Supported Backends
37+
38+
| Backend | Description |
39+
| ------- | ------------------------------------------------------------------- |
40+
| vllm | Token streaming inference engine for LLMs with speculative decoding |
41+
42+
---
43+
44+
## Running the Benchmark
45+
46+
- Things need to run the benchmark
47+
- Your MLFlow URL (this can be found via the GET `workspace/` endpoint or under the `Deployments` tab if using the portal)
48+
- A node pool with GPU hardware (this can be done by deploying the shared node pool pre-filled sample [here](shared_node_pool_a10.json))
49+
- Model checkpoints pre-downloaded and stored in an object storage.
50+
- Make sure to get a PAR for the object storage where the models are saved. With listing, write and read perimissions
51+
- Configured benchmarking blueprint - make sure to update the MLFlow URL (ex `https://mlflow.121-158-72-41.nip.io`)
52+
53+
This blueprint supports benchmark execution via job-mode (the benchmarking container will spin up, benchmark, then spin down once the benchmarking is complete). The recipe mounts a model from Object Storage (hence the need for a PAR link), runs offline inference, and logs metrics to MlFlow.
54+
55+
---
56+
57+
## Metrics Logged (will be visible in MlFlow)
58+
59+
- `requests_per_second`
60+
- `input_tokens_per_second`
61+
- `output_tokens_per_second`
62+
- `total_tokens_per_second`
63+
- `elapsed_time`
64+
- `total_input_tokens`
65+
- `total_output_tokens`
66+
67+
### Top-level Deployment Keys
68+
69+
| Key | Description |
70+
| ------------------- | ---------------------------------------------------------------------------- |
71+
| `recipe_id` | Identifier of the recipe to run; here, it's an offline SGLang benchmark job. |
72+
| `recipe_mode` | Specifies this is a `job`, meaning it runs to completion and exits. |
73+
| `deployment_name` | Human-readable name for the job. |
74+
| `recipe_image_uri` | Docker image containing the benchmark code and dependencies. |
75+
| `recipe_node_shape` | Shape of the VM or GPU node to run the job (e.g., VM.GPU.A10.2). |
76+
77+
### Input Object Storage
78+
79+
| Key | Description |
80+
| ---------------------- | -------------------------------------------------------- |
81+
| `input_object_storage` | List of inputs to mount from Object Storage. |
82+
| `par` | Pre-Authenticated Request (PAR) link to a bucket/folder. |
83+
| `mount_location` | Files are mounted to this path inside the container. |
84+
| `volume_size_in_gbs` | Size of the mount volume. |
85+
86+
### Runtime & Infra Settings
87+
88+
| Key | Description |
89+
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
90+
| `recipe_container_command_args` | Necessary command args that are passed to benchmarking container (and are assigned to the `recipe_container_env` values) |
91+
| `recipe_container_env` | The container environment variables that we are using to configure the benchmarking run (see below for specifics) |
92+
| `recipe_replica_count` | Number of job replicas to run (usually 1 for inference). |
93+
| `recipe_container_port` | Port (optional for offline mode; required if API is exposed). |
94+
| `recipe_nvidia_gpu_count` | Number of GPUs allocated to this job. |
95+
| `recipe_node_pool_size` | Number of nodes in the pool (1 means 1 VM). |
96+
| `recipe_node_boot_volume_size_in_gbs` | Disk size for OS + dependencies. |
97+
| `recipe_ephemeral_storage_size` | Local scratch space in GBs. |
98+
| `recipe_shared_memory_volume_size_limit_in_mb` | Shared memory (used by some inference engines). |
99+
100+
### Recipe Container Environment Variables
101+
102+
These are the environment variables set in the `recipe_container_env` field and are used as values to the command args that are passed to the benchmarking container via the `recipe_container_command_args` field of the blueprint.
103+
104+
| Key | Description |
105+
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
106+
| `backend` | Set to `vllm` since we are using vLLM as the inference engine backend |
107+
| `model` | Name of the model - note this should be the same as the path to the model directory (already mounted via Object Storage). |
108+
| `tokenizer` | Name of the tokenizer - this will almost always be the same as the model name and the path to the model (usually same as model path). |
109+
| `input-len` | Number of tokens in the input prompt. |
110+
| `output-len` | Number of tokens to generate. |
111+
| `num-prompts` | Number of total prompts to run (e.g., 64 prompts x 128 output tokens). |
112+
| `tensor-parallel-size` | Number of parallelism groups to partition tensors across the GPUs to enable parallel computation. This should almost always be set to the number of GPUs per node |
113+
| `max-model-len` | Largest context length (prompt and output) allowed for the given model |
114+
| `dtype` | Precision (e.g., float16, bfloat16, auto). |
115+
| `mlflow_uri` | MLflow server to log performance metrics. Make sure to include `https://` before the url but do not include the port it is listening on such as `:5000` |
116+
| `experiment_name` | Experiment name to group runs in MLflow UI. |
117+
| `run_name` | Custom name to identify this particular run. |
118+
119+
---
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
{
2+
"recipe_id": "offline-benchmark-vllm",
3+
"recipe_mode": "job",
4+
"deployment_name": "offline-benchmark-vllm",
5+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-bm-offline-082025-v1",
6+
"recipe_node_shape": "VM.GPU.A10.2",
7+
"recipe_use_shared_node_pool": true,
8+
"recipe_shared_node_pool_selector": "shared_pool",
9+
"input_object_storage": [
10+
{
11+
"par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/0T99iRADcM08aVpumM6smqMIcnIJTFtV2D8ZIIWidUP9eL8GSRyDMxOb9Va9rmRc/n/iduyx1qnmway/b/mymodels/o/",
12+
"mount_location": "/models",
13+
"volume_size_in_gbs": 500,
14+
"include": ["NousResearch/Meta-Llama-3.1-8B"]
15+
}
16+
],
17+
"recipe_container_env": [
18+
{
19+
"key": "backend",
20+
"value": "vllm"
21+
},
22+
{
23+
"key": "model",
24+
"value": "NousResearch/Meta-Llama-3.1-8B"
25+
},
26+
{
27+
"key": "tokenizer",
28+
"value": "NousResearch/Meta-Llama-3.1-8B"
29+
},
30+
{
31+
"key": "input-len",
32+
"value": "12"
33+
},
34+
{
35+
"key": "output-len",
36+
"value": "12"
37+
},
38+
{
39+
"key": "num-prompts",
40+
"value": "2"
41+
},
42+
{
43+
"key": "tensor-parallel-size",
44+
"value": "2"
45+
},
46+
{
47+
"key": "max-model-len",
48+
"value": "2048"
49+
},
50+
{
51+
"key": "dtype",
52+
"value": "float16"
53+
},
54+
{
55+
"key": "mlflow-uri",
56+
"value": "https://<YOUR_MLFLOW_URL>"
57+
},
58+
{
59+
"key": "experiment-name",
60+
"value": "experiment-1"
61+
},
62+
{
63+
"key": "run-name",
64+
"value": "initial-run"
65+
}
66+
],
67+
"recipe_container_command_args": [
68+
"--backend",
69+
"$(backend)",
70+
"--model",
71+
"$(model)",
72+
"--tokenizer",
73+
"$(tokenizer)",
74+
"--input-len",
75+
"$(input-len)",
76+
"--output-len",
77+
"$(output-len)",
78+
"--num-prompts",
79+
"$(num-prompts)",
80+
"--tensor-parallel-size",
81+
"$(tensor-parallel-size)",
82+
"--max-model-len",
83+
"$(max-model-len)",
84+
"--dtype",
85+
"$(dtype)",
86+
"--mlflow-uri",
87+
"$(mlflow-uri)",
88+
"--experiment-name",
89+
"$(experiment-name)",
90+
"--run-name",
91+
"$(run-name)"
92+
],
93+
"recipe_replica_count": 1,
94+
"recipe_container_port": "8000",
95+
"recipe_nvidia_gpu_count": 2,
96+
"recipe_node_pool_size": 1,
97+
"recipe_node_boot_volume_size_in_gbs": 200,
98+
"recipe_ephemeral_storage_size": 100,
99+
"recipe_shared_memory_volume_size_limit_in_mb": 32000
100+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"deployment_name": "VM A10 shared pool",
3+
"recipe_mode": "shared_node_pool",
4+
"shared_node_pool_size": 1,
5+
"shared_node_pool_shape": "VM.GPU.A10.2",
6+
"shared_node_pool_boot_volume_size_in_gbs": 500,
7+
"shared_node_pool_selector": "shared_pool"
8+
}
File renamed without changes.

0 commit comments

Comments
 (0)