|
| 1 | +Here’s your rewritten `README.md`, styled similarly to the CPU inference blueprint but focused on **offline GPU inference using the SGLang backend**. |
| 2 | + |
| 3 | +# Offline Inference Blueprint - Infra (SGLang + vLLM) |
| 4 | + |
| 5 | +This blueprint provides a configurable framework to run **offline LLM inference benchmarks** using either the SGLang or vLLM backends. It is designed for cloud GPU environments and supports automated performance benchmarking with MLflow logging. |
| 6 | + |
| 7 | +This blueprint enables you to: |
| 8 | +- Run inference locally on GPU nodes using pre-loaded models |
| 9 | +- Benchmark token throughput, latency, and request performance |
| 10 | +- Push results to MLflow for comparison and analysis |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Pre-Filled Samples |
| 15 | + |
| 16 | +| Title | Description | |
| 17 | +|------------------------------|-----------------------------------------------------------------------------| |
| 18 | +|Offline inference with LLaMA 3|Benchmarks Meta-Llama-3.1-8B model using SGLang on VM.GPU.A10.2 with 2 GPUs. | |
| 19 | + |
| 20 | +You can access these pre-filled samples from the OCI AI Blueprint portal. |
| 21 | + |
| 22 | +--- |
| 23 | +## When to use Offline inference |
| 24 | + |
| 25 | +Offline inference is ideal for: |
| 26 | +- Accurate performance benchmarking (no API or network bottlenecks) |
| 27 | +- Comparing GPU hardware performance (A10, A100, H100, MI300X) |
| 28 | +- Evaluating backend frameworks like vLLM and SGLang |
| 29 | + |
| 30 | +--- |
| 31 | + |
| 32 | +## Supported Backends |
| 33 | + |
| 34 | +| Backend | Description | |
| 35 | +|----------|--------------------------------------------------------------| |
| 36 | +| sglang | Fast multi-modal LLM backend with optimized throughput | |
| 37 | +| vllm | Token streaming inference engine for LLMs with speculative decoding | |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## Running the Benchmark |
| 42 | + |
| 43 | +This blueprint supports benchmark execution via a job-mode recipe using a YAML config file. The recipe mounts a model and config file from Object Storage, runs offline inference, and logs metrics. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +### Sample Recipe (Job Mode for Offline SGLang Inference) |
| 48 | + |
| 49 | +```json |
| 50 | +{ |
| 51 | + "recipe_id": "offline_inference_sglang", |
| 52 | + "recipe_mode": "job", |
| 53 | + "deployment_name": "Offline Inference Benchmark", |
| 54 | + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:llm-benchmark-0409-v2", |
| 55 | + "recipe_node_shape": "VM.GPU.A10.2", |
| 56 | + "input_object_storage": [ |
| 57 | + { |
| 58 | + "par": "https://objectstorage.ap-melbourne-1.oraclecloud.com/p/Z2q73uuLCAxCbGXJ99CIeTxnCTNipsE-1xHE9HYfCz0RBYPTcCbqi9KHViUEH-Wq/n/iduyx1qnmway/b/mymodels/o/", |
| 59 | + "mount_location": "/models", |
| 60 | + "volume_size_in_gbs": 500, |
| 61 | + "include": [ |
| 62 | + "example_sglang.yaml", |
| 63 | + "NousResearch/Meta-Llama-3.1-8B" |
| 64 | + ] |
| 65 | + } |
| 66 | + ], |
| 67 | + "recipe_container_command_args": [ |
| 68 | + "/models/example_sglang.yaml" |
| 69 | + ], |
| 70 | + "recipe_replica_count": 1, |
| 71 | + "recipe_container_port": "8000", |
| 72 | + "recipe_nvidia_gpu_count": 2, |
| 73 | + "recipe_node_pool_size": 1, |
| 74 | + "recipe_node_boot_volume_size_in_gbs": 200, |
| 75 | + "recipe_ephemeral_storage_size": 100, |
| 76 | + "recipe_shared_memory_volume_size_limit_in_mb": 200 |
| 77 | +} |
| 78 | +``` |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Sample Config File (`example_sglang.yaml`) |
| 83 | + |
| 84 | +```yaml |
| 85 | +benchmark_type: offline |
| 86 | +offline_backend: sglang |
| 87 | + |
| 88 | +model_path: /models/NousResearch/Meta-Llama-3.1-8B |
| 89 | +tokenizer_path: /models/NousResearch/Meta-Llama-3.1-8B |
| 90 | +trust_remote_code: true |
| 91 | +conv_template: llama-2 |
| 92 | + |
| 93 | +input_len: 128 |
| 94 | +output_len: 128 |
| 95 | +num_prompts: 64 |
| 96 | +max_seq_len: 4096 |
| 97 | +max_batch_size: 8 |
| 98 | +dtype: auto |
| 99 | +temperature: 0.7 |
| 100 | +top_p: 0.9 |
| 101 | + |
| 102 | +mlflow_uri: http://mlflow-benchmarking.corrino-oci.com:5000 |
| 103 | +experiment_name: "sglang-bench-doc-test-new" |
| 104 | +run_name: "llama3-8b-sglang-test" |
| 105 | +``` |
| 106 | +
|
| 107 | +--- |
| 108 | +
|
| 109 | +## Metrics Logged |
| 110 | +
|
| 111 | +- `requests_per_second` |
| 112 | +- `input_tokens_per_second` |
| 113 | +- `output_tokens_per_second` |
| 114 | +- `total_tokens_per_second` |
| 115 | +- `elapsed_time` |
| 116 | +- `total_input_tokens` |
| 117 | +- `total_output_tokens` |
| 118 | + |
| 119 | +If a dataset is provided: |
| 120 | +- `accuracy` |
0 commit comments