@@ -14,6 +14,7 @@ This blueprint enables you to:
1414| Title | Description |
1515| ------------------------------| -----------------------------------------------------------------------------|
1616| Offline inference with LLaMA 3| Benchmarks Meta-Llama-3.1-8B model using SGLang on VM.GPU.A10.2 with 2 GPUs. |
17+ | Offline inference with LLAMA 3- vLLM| Benchmarks Meta-Llama-3.1-8B model using vLLM on VM.GPU.A10.2 with 2 GPUs.|
1718
1819You can access these pre-filled samples from the OCI AI Blueprint portal.
1920
@@ -46,33 +47,41 @@ This blueprint supports benchmark execution via a job-mode recipe using a YAML c
4647
4748``` json
4849{
49- "recipe_id" : " offline_inference_sglang" ,
50- "recipe_mode" : " job" ,
51- "deployment_name" : " Offline Inference Benchmark" ,
52- "recipe_image_uri" : " iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2" ,
53- "recipe_node_shape" : " VM.GPU.A10.2" ,
54- "input_object_storage" : [
55- {
56- "par" : " https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/" ,
57- "mount_location" : " /models" ,
58- "volume_size_in_gbs" : 500 ,
59- "include" : [
60- " example_sglang.yaml" ,
61- " NousResearch/Meta-Llama-3.1-8B"
62- ]
63- }
64- ],
65- "recipe_container_command_args" : [
66- " /models/example_sglang.yaml"
67- ],
68- "recipe_replica_count" : 1 ,
69- "recipe_container_port" : " 8000" ,
70- "recipe_nvidia_gpu_count" : 2 ,
71- "recipe_node_pool_size" : 1 ,
72- "recipe_node_boot_volume_size_in_gbs" : 200 ,
73- "recipe_ephemeral_storage_size" : 100 ,
74- "recipe_shared_memory_volume_size_limit_in_mb" : 200
75- }
50+ "recipe_id" : " offline_inference_sglang" ,
51+ "recipe_mode" : " job" ,
52+ "deployment_name" : " Offline Inference Benchmark" ,
53+ "recipe_image_uri" : " iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v4" ,
54+ "recipe_node_shape" : " VM.GPU.A10.2" ,
55+ "input_object_storage" : [
56+ {
57+ "par" : " https://objectstorage.ap-melbourne-1.oraclecloud.com/p/0T99iRADcM08aVpumM6smqMIcnIJTFtV2D8ZIIWidUP9eL8GSRyDMxOb9Va9rmRc/n/iduyx1qnmway/b/mymodels/o/" ,
58+ "mount_location" : " /models" ,
59+ "volume_size_in_gbs" : 500 ,
60+ "include" : [
61+ " new_example_sglang.yaml" ,
62+ " NousResearch/Meta-Llama-3.1-8B"
63+ ]
64+ }
65+ ],
66+ "output_object_storage" : [
67+ {
68+ "bucket_name" : " inference_output" ,
69+ "mount_location" : " /mlcommons_output" ,
70+ "volume_size_in_gbs" : 200
71+ }
72+ ],
73+ "recipe_container_command_args" : [
74+ " /models/new_example_sglang.yaml"
75+ ],
76+ "recipe_replica_count" : 1 ,
77+ "recipe_container_port" : " 8000" ,
78+ "recipe_nvidia_gpu_count" : 2 ,
79+ "recipe_node_pool_size" : 1 ,
80+ "recipe_node_boot_volume_size_in_gbs" : 200 ,
81+ "recipe_ephemeral_storage_size" : 100 ,
82+ "recipe_shared_memory_volume_size_limit_in_mb" : 200
83+ }
84+
7685```
7786
7887---
@@ -100,6 +109,43 @@ top_p: 0.9
100109mlflow_uri : http://mlflow-benchmarking.corrino-oci.com:5000
101110experiment_name : " sglang-bench-doc-test-new"
102111run_name : " llama3-8b-sglang-test"
112+
113+
114+ save_metrics_path : /mlcommons_output/benchmark_output_llama3_sglang.json
115+
116+ ` ` `
117+
118+ ` ` ` yaml
119+ benchmark_type : offline
120+ model : /models/NousResearch/Meta-Llama-3.1-8B
121+ tokenizer : /models/NousResearch/Meta-Llama-3.1-8B
122+
123+ input_len : 12
124+ output_len : 12
125+ num_prompts : 2
126+ seed : 42
127+ tensor_parallel_size : 8
128+
129+ # vLLM-specific
130+ # quantization: awq
131+ dtype : half
132+ gpu_memory_utilization : 0.99
133+ num_scheduler_steps : 10
134+ device : cuda
135+ enforce_eager : true
136+ kv_cache_dtype : auto
137+ enable_prefix_caching : true
138+ distributed_executor_backend : mp
139+
140+ # Output
141+ # output_json: ./128_128.json
142+
143+ # MLflow
144+ mlflow_uri : http://mlflow-benchmarking.corrino-oci.com:5000
145+ experiment_name : test-bm-suite-doc
146+ run_name : llama3-vllm-test
147+ save_metrics_path : /mlcommons_output/benchmark_output_llama3_vllm.json
148+
103149` ` `
104150
105151---
@@ -116,3 +162,95 @@ run_name: "llama3-8b-sglang-test"
116162
117163If a dataset is provided :
118164- ` accuracy`
165+
166+
167+ # ## Top-level Deployment Keys
168+
169+ | Key | Description |
170+ |-----|-------------|
171+ | `recipe_id` | Identifier of the recipe to run; here, it's an offline SGLang benchmark job. |
172+ | `recipe_mode` | Specifies this is a `job`, meaning it runs to completion and exits. |
173+ | `deployment_name` | Human-readable name for the job. |
174+ | `recipe_image_uri` | Docker image containing the benchmark code and dependencies. |
175+ | `recipe_node_shape` | Shape of the VM or GPU node to run the job (e.g., VM.GPU.A10.2). |
176+
177+ # ## Input Object Storage
178+
179+ | Key | Description |
180+ |-----|-------------|
181+ | `input_object_storage` | List of inputs to mount from Object Storage. |
182+ | `par` | Pre-Authenticated Request (PAR) link to a bucket/folder. |
183+ | `mount_location` | Files are mounted to this path inside the container. |
184+ | `volume_size_in_gbs` | Size of the mount volume. |
185+ | `include` | Only these files/folders from the bucket are mounted (e.g., model + config). |
186+
187+ # ## Output Object Storage
188+
189+ | Key | Description |
190+ |-----|-------------|
191+ | `output_object_storage` | Where to store outputs like benchmark logs or results. |
192+ | `bucket_name` | Name of the output bucket in OCI Object Storage. |
193+ | `mount_location` | Mount point inside container where outputs are written. |
194+ | `volume_size_in_gbs` | Size of this volume in GBs. |
195+
196+ # ## Runtime & Infra Settings
197+
198+ | Key | Description |
199+ |-----|-------------|
200+ | `recipe_container_command_args` | Path to the YAML config that defines benchmark parameters. |
201+ | `recipe_replica_count` | Number of job replicas to run (usually 1 for inference). |
202+ | `recipe_container_port` | Port (optional for offline mode; required if API is exposed). |
203+ | `recipe_nvidia_gpu_count` | Number of GPUs allocated to this job. |
204+ | `recipe_node_pool_size` | Number of nodes in the pool (1 means 1 VM). |
205+ | `recipe_node_boot_volume_size_in_gbs` | Disk size for OS + dependencies. |
206+ | `recipe_ephemeral_storage_size` | Local scratch space in GBs. |
207+ | `recipe_shared_memory_volume_size_limit_in_mb` | Shared memory (used by some inference engines). |
208+
209+ ---
210+
211+ # # **Sample Config File (`example_sglang.yaml`)**
212+
213+ This file is consumed by the container during execution to configure the benchmark run.
214+
215+ # ## Inference Setup
216+
217+ | Key | Description |
218+ |-----|-------------|
219+ | `benchmark_type` | Set to `offline` to indicate local execution with no HTTP server. |
220+ | `offline_backend` | Backend engine to use (`sglang` or `vllm`). |
221+ | `model_path` | Path to the model directory (already mounted via Object Storage). |
222+ | `tokenizer_path` | Path to the tokenizer (usually same as model path). |
223+ | `trust_remote_code` | Enables loading models that require custom code (Hugging Face). |
224+ | `conv_template` | Prompt formatting template to use (e.g., `llama-2`). |
225+
226+ # ## Benchmark Parameters
227+
228+ | Key | Description |
229+ |-----|-------------|
230+ | `input_len` | Number of tokens in the input prompt. |
231+ | `output_len` | Number of tokens to generate. |
232+ | `num_prompts` | Number of total prompts to run (e.g., 64 prompts x 128 output tokens). |
233+ | `max_seq_len` | Max sequence length supported by the model (e.g., 4096). |
234+ | `max_batch_size` | Max batch size per inference run (depends on GPU memory). |
235+ | `dtype` | Precision (e.g., float16, bfloat16, auto). |
236+
237+ # ## Sampling Settings
238+
239+ | Key | Description |
240+ |-----|-------------|
241+ | `temperature` | Controls randomness in generation (lower = more deterministic). |
242+ | `top_p` | Top-p sampling for diversity (0.9 keeps most probable tokens). |
243+
244+ # ## MLflow Logging
245+
246+ | Key | Description |
247+ |-----|-------------|
248+ | `mlflow_uri` | MLflow server to log performance metrics. |
249+ | `experiment_name` | Experiment name to group runs in MLflow UI. |
250+ | `run_name` | Custom name to identify this particular run. |
251+
252+ # ## Output
253+
254+ | Key | Description |
255+ |-----|-------------|
256+ | `save_metrics_path` | Path inside the container where metrics will be saved as JSON. |
0 commit comments