Skip to content

Commit cecdde8

Browse files
docs for offline inference
1 parent 28f9190 commit cecdde8

File tree

1 file changed

+120
-0
lines changed
  • docs/sample_blueprints/offline-inference-infra

1 file changed

+120
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
Here’s your rewritten `README.md`, styled similarly to the CPU inference blueprint but focused on **offline GPU inference using the SGLang backend**.
2+
3+
# Offline Inference Blueprint - Infra (SGLang + vLLM)
4+
5+
This blueprint provides a configurable framework to run **offline LLM inference benchmarks** using either the SGLang or vLLM backends. It is designed for cloud GPU environments and supports automated performance benchmarking with MLflow logging.
6+
7+
This blueprint enables you to:
8+
- Run inference locally on GPU nodes using pre-loaded models
9+
- Benchmark token throughput, latency, and request performance
10+
- Push results to MLflow for comparison and analysis
11+
12+
---
13+
14+
## Pre-Filled Samples
15+
16+
| Title | Description |
17+
|------------------------------|-----------------------------------------------------------------------------|
18+
|Offline inference with LLaMA 3|Benchmarks Meta-Llama-3.1-8B model using SGLang on VM.GPU.A10.2 with 2 GPUs. |
19+
20+
You can access these pre-filled samples from the OCI AI Blueprint portal.
21+
22+
---
23+
## When to use Offline inference
24+
25+
Offline inference is ideal for:
26+
- Accurate performance benchmarking (no API or network bottlenecks)
27+
- Comparing GPU hardware performance (A10, A100, H100, MI300X)
28+
- Evaluating backend frameworks like vLLM and SGLang
29+
30+
---
31+
32+
## Supported Backends
33+
34+
| Backend | Description |
35+
|----------|--------------------------------------------------------------|
36+
| sglang | Fast multi-modal LLM backend with optimized throughput |
37+
| vllm | Token streaming inference engine for LLMs with speculative decoding |
38+
39+
---
40+
41+
## Running the Benchmark
42+
43+
This blueprint supports benchmark execution via a job-mode recipe using a YAML config file. The recipe mounts a model and config file from Object Storage, runs offline inference, and logs metrics.
44+
45+
---
46+
47+
### Sample Recipe (Job Mode for Offline SGLang Inference)
48+
49+
```json
50+
{
51+
"recipe_id": "offline_inference_sglang",
52+
"recipe_mode": "job",
53+
"deployment_name": "Offline Inference Benchmark",
54+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2",
55+
"recipe_node_shape": "VM.GPU.A10.2",
56+
"input_object_storage": [
57+
{
58+
"par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/",
59+
"mount_location": "/models",
60+
"volume_size_in_gbs": 500,
61+
"include": [
62+
"example_sglang.yaml",
63+
"NousResearch/Meta-Llama-3.1-8B"
64+
]
65+
}
66+
],
67+
"recipe_container_command_args": [
68+
"/models/example_sglang.yaml"
69+
],
70+
"recipe_replica_count": 1,
71+
"recipe_container_port": "8000",
72+
"recipe_nvidia_gpu_count": 2,
73+
"recipe_node_pool_size": 1,
74+
"recipe_node_boot_volume_size_in_gbs": 200,
75+
"recipe_ephemeral_storage_size": 100,
76+
"recipe_shared_memory_volume_size_limit_in_mb": 200
77+
}
78+
```
79+
80+
---
81+
82+
## Sample Config File (`example_sglang.yaml`)
83+
84+
```yaml
85+
benchmark_type: offline
86+
offline_backend: sglang
87+
88+
model_path: /models/NousResearch/Meta-Llama-3.1-8B
89+
tokenizer_path: /models/NousResearch/Meta-Llama-3.1-8B
90+
trust_remote_code: true
91+
conv_template: llama-2
92+
93+
input_len: 128
94+
output_len: 128
95+
num_prompts: 64
96+
max_seq_len: 4096
97+
max_batch_size: 8
98+
dtype: auto
99+
temperature: 0.7
100+
top_p: 0.9
101+
102+
mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000
103+
experiment_name: "sglang-bench-doc-test-new"
104+
run_name: "llama3-8b-sglang-test"
105+
```
106+
107+
---
108+
109+
## Metrics Logged
110+
111+
- `requests_per_second`
112+
- `input_tokens_per_second`
113+
- `output_tokens_per_second`
114+
- `total_tokens_per_second`
115+
- `elapsed_time`
116+
- `total_input_tokens`
117+
- `total_output_tokens`
118+
119+
If a dataset is provided:
120+
- `accuracy`

0 commit comments

Comments
 (0)