Skip to content

Commit 66ecb28

Browse files
online inference readme
1 parent 011f7fe commit 66ecb28

File tree

1 file changed

+104
-0
lines changed
  • docs/sample_blueprints/online-inference-infra

1 file changed

+104
-0
lines changed
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Online Inference Blueprint (LLMPerf)
2+
3+
This blueprint benchmarks **online inference performance** of large language models using **LLMPerf**, a standardized benchmarking tool. It is designed to evaluate LLM APIs served via platforms such as OpenAI-compatible interfaces, including self-hosted LLM inference endpoints.
4+
5+
This blueprint helps:
6+
- Simulate real-time request load on a running model server
7+
- Measure end-to-end latency, throughput, and completion performance
8+
- Push results to MLflow for visibility and tracking
9+
10+
---
11+
12+
## Pre-Filled Samples
13+
14+
| Title | Description |
15+
|----------------------------------------|-----------------------------------------------------------------------------|
16+
|Online inference on LLaMA 3 using LLMPerf|Benchmark of meta/llama3-8b-instruct via a local OpenAI-compatible endpoint |
17+
18+
These can be accessed directly from the OCI AI Blueprint portal.
19+
20+
---
21+
22+
## Prerequisites
23+
24+
Before running this blueprint:
25+
- You **must have an inference server already running**, compatible with the OpenAI API format.
26+
- Ensure the endpoint and model name match what’s defined in the config.
27+
28+
---
29+
30+
## Supported Scenarios
31+
32+
| Use Case | Description |
33+
|-----------------------|-------------------------------------------------------|
34+
| Local LLM APIs | Benchmark your own self-hosted models (e.g., vLLM) |
35+
| Remote OpenAI API | Benchmark OpenAI deployments for throughput analysis |
36+
| Multi-model endpoints | Test latency/throughput across different configurations |
37+
38+
---
39+
40+
### Sample Recipe (Job Mode for Online Benchmarking)
41+
42+
```json
43+
{
44+
"recipe_id": "online_inference_benchmark",
45+
"recipe_mode": "job",
46+
"deployment_name": "Online Inference Benchmark",
47+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2",
48+
"recipe_node_shape": "VM.GPU.A10.2",
49+
"input_object_storage": [
50+
{
51+
"par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/",
52+
"mount_location": "/models",
53+
"volume_size_in_gbs": 100,
54+
"include": [
55+
"example_online.yaml"
56+
]
57+
}
58+
],
59+
"recipe_container_command_args": [
60+
"/models/example_online.yaml"
61+
],
62+
"recipe_replica_count": 1,
63+
"recipe_container_port": "8000",
64+
"recipe_node_pool_size": 1,
65+
"recipe_node_boot_volume_size_in_gbs": 200,
66+
"recipe_ephemeral_storage_size": 100
67+
}
68+
```
69+
70+
---
71+
72+
## Sample Config File (`example_online.yaml`)
73+
74+
```yaml
75+
benchmark_type: online
76+
77+
model: meta/llama3-8b-instruct
78+
input_len: 64
79+
output_len: 32
80+
max_requests: 5
81+
timeout: 300
82+
num_concurrent: 1
83+
results_dir: /workspace/results_on
84+
llm_api: openai
85+
llm_api_key: dummy-key
86+
llm_api_base: http://localhost:8001/v1
87+
88+
experiment_name: local-bench
89+
run_name: llama3-test
90+
mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000
91+
llmperf_path: /opt/llmperf-src
92+
metadata: test=localhost
93+
```
94+
95+
---
96+
97+
## Metrics Logged
98+
99+
- `output_tokens_per_second`
100+
- `requests_per_minute`
101+
- `overall_output_throughput`
102+
- All raw metrics from the `_summary.json` output of LLMPerf
103+
104+
---

0 commit comments

Comments
 (0)