|
| 1 | +# Online Inference Blueprint (LLMPerf) |
| 2 | + |
| 3 | +This blueprint benchmarks **online inference performance** of large language models using **LLMPerf**, a standardized benchmarking tool. It is designed to evaluate LLM APIs served via platforms such as OpenAI-compatible interfaces, including self-hosted LLM inference endpoints. |
| 4 | + |
| 5 | +This blueprint helps: |
| 6 | +- Simulate real-time request load on a running model server |
| 7 | +- Measure end-to-end latency, throughput, and completion performance |
| 8 | +- Push results to MLflow for visibility and tracking |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## Pre-Filled Samples |
| 13 | + |
| 14 | +| Title | Description | |
| 15 | +|----------------------------------------|-----------------------------------------------------------------------------| |
| 16 | +|Online inference on LLaMA 3 using LLMPerf|Benchmark of meta/llama3-8b-instruct via a local OpenAI-compatible endpoint | |
| 17 | + |
| 18 | +These can be accessed directly from the OCI AI Blueprint portal. |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +## Prerequisites |
| 23 | + |
| 24 | +Before running this blueprint: |
| 25 | +- You **must have an inference server already running**, compatible with the OpenAI API format. |
| 26 | +- Ensure the endpoint and model name match what’s defined in the config. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## Supported Scenarios |
| 31 | + |
| 32 | +| Use Case | Description | |
| 33 | +|-----------------------|-------------------------------------------------------| |
| 34 | +| Local LLM APIs | Benchmark your own self-hosted models (e.g., vLLM) | |
| 35 | +| Remote OpenAI API | Benchmark OpenAI deployments for throughput analysis | |
| 36 | +| Multi-model endpoints | Test latency/throughput across different configurations | |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +### Sample Recipe (Job Mode for Online Benchmarking) |
| 41 | + |
| 42 | +```json |
| 43 | +{ |
| 44 | + "recipe_id": "online_inference_benchmark", |
| 45 | + "recipe_mode": "job", |
| 46 | + "deployment_name": "Online Inference Benchmark", |
| 47 | + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2", |
| 48 | + "recipe_node_shape": "VM.GPU.A10.2", |
| 49 | + "input_object_storage": [ |
| 50 | + { |
| 51 | + "par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/", |
| 52 | + "mount_location": "/models", |
| 53 | + "volume_size_in_gbs": 100, |
| 54 | + "include": [ |
| 55 | + "example_online.yaml" |
| 56 | + ] |
| 57 | + } |
| 58 | + ], |
| 59 | + "recipe_container_command_args": [ |
| 60 | + "/models/example_online.yaml" |
| 61 | + ], |
| 62 | + "recipe_replica_count": 1, |
| 63 | + "recipe_container_port": "8000", |
| 64 | + "recipe_node_pool_size": 1, |
| 65 | + "recipe_node_boot_volume_size_in_gbs": 200, |
| 66 | + "recipe_ephemeral_storage_size": 100 |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## Sample Config File (`example_online.yaml`) |
| 73 | + |
| 74 | +```yaml |
| 75 | +benchmark_type: online |
| 76 | + |
| 77 | +model: meta/llama3-8b-instruct |
| 78 | +input_len: 64 |
| 79 | +output_len: 32 |
| 80 | +max_requests: 5 |
| 81 | +timeout: 300 |
| 82 | +num_concurrent: 1 |
| 83 | +results_dir: /workspace/results_on |
| 84 | +llm_api: openai |
| 85 | +llm_api_key: dummy-key |
| 86 | +llm_api_base: http://localhost:8001/v1 |
| 87 | + |
| 88 | +experiment_name: local-bench |
| 89 | +run_name: llama3-test |
| 90 | +mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000 |
| 91 | +llmperf_path: /opt/llmperf-src |
| 92 | +metadata: test=localhost |
| 93 | +``` |
| 94 | +
|
| 95 | +--- |
| 96 | +
|
| 97 | +## Metrics Logged |
| 98 | +
|
| 99 | +- `output_tokens_per_second` |
| 100 | +- `requests_per_minute` |
| 101 | +- `overall_output_throughput` |
| 102 | +- All raw metrics from the `_summary.json` output of LLMPerf |
| 103 | + |
| 104 | +--- |
0 commit comments