Skip to content

Commit 8ec78b4

Browse files
better readme with extra pre-filled samples for offline inference
1 parent 66ecb28 commit 8ec78b4

File tree

1 file changed

+165
-27
lines changed
  • docs/sample_blueprints/offline-inference-infra

1 file changed

+165
-27
lines changed

docs/sample_blueprints/offline-inference-infra/README.md

Lines changed: 165 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ This blueprint enables you to:
1414
| Title | Description |
1515
|------------------------------|-----------------------------------------------------------------------------|
1616
|Offline inference with LLaMA 3|Benchmarks Meta-Llama-3.1-8B model using SGLang on VM.GPU.A10.2 with 2 GPUs. |
17+
|Offline inference with LLAMA 3- vLLM| Benchmarks Meta-Llama-3.1-8B model using vLLM on VM.GPU.A10.2 with 2 GPUs.|
1718

1819
You can access these pre-filled samples from the OCI AI Blueprint portal.
1920

@@ -46,33 +47,41 @@ This blueprint supports benchmark execution via a job-mode recipe using a YAML c
4647

4748
```json
4849
{
49-
"recipe_id": "offline_inference_sglang",
50-
"recipe_mode": "job",
51-
"deployment_name": "Offline Inference Benchmark",
52-
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2",
53-
"recipe_node_shape": "VM.GPU.A10.2",
54-
"input_object_storage": [
55-
{
56-
"par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/",
57-
"mount_location": "/models",
58-
"volume_size_in_gbs": 500,
59-
"include": [
60-
"example_sglang.yaml",
61-
"NousResearch/Meta-Llama-3.1-8B"
62-
]
63-
}
64-
],
65-
"recipe_container_command_args": [
66-
"/models/example_sglang.yaml"
67-
],
68-
"recipe_replica_count": 1,
69-
"recipe_container_port": "8000",
70-
"recipe_nvidia_gpu_count": 2,
71-
"recipe_node_pool_size": 1,
72-
"recipe_node_boot_volume_size_in_gbs": 200,
73-
"recipe_ephemeral_storage_size": 100,
74-
"recipe_shared_memory_volume_size_limit_in_mb": 200
75-
}
50+
"recipe_id": "offline_inference_sglang",
51+
"recipe_mode": "job",
52+
"deployment_name": "Offline Inference Benchmark",
53+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v4",
54+
"recipe_node_shape": "VM.GPU.A10.2",
55+
"input_object_storage": [
56+
{
57+
"par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/0T99iRADcM08aVpumM6smqMIcnIJTFtV2D8ZIIWidUP9eL8GSRyDMxOb9Va9rmRc/n/iduyx1qnmway/b/mymodels/o/",
58+
"mount_location": "/models",
59+
"volume_size_in_gbs": 500,
60+
"include": [
61+
"new_example_sglang.yaml",
62+
"NousResearch/Meta-Llama-3.1-8B"
63+
]
64+
}
65+
],
66+
"output_object_storage": [
67+
{
68+
"bucket_name": "inference_output",
69+
"mount_location": "/mlcommons_output",
70+
"volume_size_in_gbs": 200
71+
}
72+
],
73+
"recipe_container_command_args": [
74+
"/models/new_example_sglang.yaml"
75+
],
76+
"recipe_replica_count": 1,
77+
"recipe_container_port": "8000",
78+
"recipe_nvidia_gpu_count": 2,
79+
"recipe_node_pool_size": 1,
80+
"recipe_node_boot_volume_size_in_gbs": 200,
81+
"recipe_ephemeral_storage_size": 100,
82+
"recipe_shared_memory_volume_size_limit_in_mb": 200
83+
}
84+
7685
```
7786

7887
---
@@ -100,6 +109,43 @@ top_p: 0.9
100109
mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000
101110
experiment_name: "sglang-bench-doc-test-new"
102111
run_name: "llama3-8b-sglang-test"
112+
113+
114+
save_metrics_path: /mlcommons_output/benchmark_output_llama3_sglang.json
115+
116+
```
117+
118+
```yaml
119+
benchmark_type: offline
120+
model: /models/NousResearch/Meta-Llama-3.1-8B
121+
tokenizer: /models/NousResearch/Meta-Llama-3.1-8B
122+
123+
input_len: 12
124+
output_len: 12
125+
num_prompts: 2
126+
seed: 42
127+
tensor_parallel_size: 8
128+
129+
# vLLM-specific
130+
#quantization: awq
131+
dtype: half
132+
gpu_memory_utilization: 0.99
133+
num_scheduler_steps: 10
134+
device: cuda
135+
enforce_eager: true
136+
kv_cache_dtype: auto
137+
enable_prefix_caching: true
138+
distributed_executor_backend: mp
139+
140+
# Output
141+
#output_json: ./128_128.json
142+
143+
# MLflow
144+
mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000
145+
experiment_name: test-bm-suite-doc
146+
run_name: llama3-vllm-test
147+
save_metrics_path: /mlcommons_output/benchmark_output_llama3_vllm.json
148+
103149
```
104150
105151
---
@@ -116,3 +162,95 @@ run_name: "llama3-8b-sglang-test"
116162

117163
If a dataset is provided:
118164
- `accuracy`
165+
166+
167+
### Top-level Deployment Keys
168+
169+
| Key | Description |
170+
|-----|-------------|
171+
| `recipe_id` | Identifier of the recipe to run; here, it's an offline SGLang benchmark job. |
172+
| `recipe_mode` | Specifies this is a `job`, meaning it runs to completion and exits. |
173+
| `deployment_name` | Human-readable name for the job. |
174+
| `recipe_image_uri` | Docker image containing the benchmark code and dependencies. |
175+
| `recipe_node_shape` | Shape of the VM or GPU node to run the job (e.g., VM.GPU.A10.2). |
176+
177+
### Input Object Storage
178+
179+
| Key | Description |
180+
|-----|-------------|
181+
| `input_object_storage` | List of inputs to mount from Object Storage. |
182+
| `par` | Pre-Authenticated Request (PAR) link to a bucket/folder. |
183+
| `mount_location` | Files are mounted to this path inside the container. |
184+
| `volume_size_in_gbs` | Size of the mount volume. |
185+
| `include` | Only these files/folders from the bucket are mounted (e.g., model + config). |
186+
187+
### Output Object Storage
188+
189+
| Key | Description |
190+
|-----|-------------|
191+
| `output_object_storage` | Where to store outputs like benchmark logs or results. |
192+
| `bucket_name` | Name of the output bucket in OCI Object Storage. |
193+
| `mount_location` | Mount point inside container where outputs are written. |
194+
| `volume_size_in_gbs` | Size of this volume in GBs. |
195+
196+
### Runtime & Infra Settings
197+
198+
| Key | Description |
199+
|-----|-------------|
200+
| `recipe_container_command_args` | Path to the YAML config that defines benchmark parameters. |
201+
| `recipe_replica_count` | Number of job replicas to run (usually 1 for inference). |
202+
| `recipe_container_port` | Port (optional for offline mode; required if API is exposed). |
203+
| `recipe_nvidia_gpu_count` | Number of GPUs allocated to this job. |
204+
| `recipe_node_pool_size` | Number of nodes in the pool (1 means 1 VM). |
205+
| `recipe_node_boot_volume_size_in_gbs` | Disk size for OS + dependencies. |
206+
| `recipe_ephemeral_storage_size` | Local scratch space in GBs. |
207+
| `recipe_shared_memory_volume_size_limit_in_mb` | Shared memory (used by some inference engines). |
208+
209+
---
210+
211+
## **Sample Config File (`example_sglang.yaml`)**
212+
213+
This file is consumed by the container during execution to configure the benchmark run.
214+
215+
### Inference Setup
216+
217+
| Key | Description |
218+
|-----|-------------|
219+
| `benchmark_type` | Set to `offline` to indicate local execution with no HTTP server. |
220+
| `offline_backend` | Backend engine to use (`sglang` or `vllm`). |
221+
| `model_path` | Path to the model directory (already mounted via Object Storage). |
222+
| `tokenizer_path` | Path to the tokenizer (usually same as model path). |
223+
| `trust_remote_code` | Enables loading models that require custom code (Hugging Face). |
224+
| `conv_template` | Prompt formatting template to use (e.g., `llama-2`). |
225+
226+
### Benchmark Parameters
227+
228+
| Key | Description |
229+
|-----|-------------|
230+
| `input_len` | Number of tokens in the input prompt. |
231+
| `output_len` | Number of tokens to generate. |
232+
| `num_prompts` | Number of total prompts to run (e.g., 64 prompts x 128 output tokens). |
233+
| `max_seq_len` | Max sequence length supported by the model (e.g., 4096). |
234+
| `max_batch_size` | Max batch size per inference run (depends on GPU memory). |
235+
| `dtype` | Precision (e.g., float16, bfloat16, auto). |
236+
237+
### Sampling Settings
238+
239+
| Key | Description |
240+
|-----|-------------|
241+
| `temperature` | Controls randomness in generation (lower = more deterministic). |
242+
| `top_p` | Top-p sampling for diversity (0.9 keeps most probable tokens). |
243+
244+
### MLflow Logging
245+
246+
| Key | Description |
247+
|-----|-------------|
248+
| `mlflow_uri` | MLflow server to log performance metrics. |
249+
| `experiment_name` | Experiment name to group runs in MLflow UI. |
250+
| `run_name` | Custom name to identify this particular run. |
251+
252+
### Output
253+
254+
| Key | Description |
255+
|-----|-------------|
256+
| `save_metrics_path` | Path inside the container where metrics will be saved as JSON. |

0 commit comments

Comments
 (0)