|
| 1 | +# Offline Inference Blueprint - Infra (vLLM) |
| 2 | + |
| 3 | +#### Run offline LLM inference benchmarks using vLLM with automated performance tracking and MLflow logging. |
| 4 | + |
| 5 | +This blueprint provides a configurable framework to run **offline LLM inference benchmarks** using vLLM as the inference engine. It is designed for cloud GPU environments and supports automated performance benchmarking with MLflow logging. |
| 6 | + |
| 7 | +This blueprint enables you to: |
| 8 | + |
| 9 | +- Run inference locally on GPU nodes using pre-loaded models |
| 10 | +- Benchmark token throughput, latency, and request performance |
| 11 | +- Push results to MLflow for comparison and analysis |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Pre-Filled Samples |
| 16 | + |
| 17 | +| Feature Showcase | Title | Description | Blueprint File | |
| 18 | +| -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------- | |
| 19 | +| Spin up GPU infrastructure via shared node pool to avoid hardware recycle times between different blueprint benchmark runs | Create VM A10 Shared Node Pool | Creates a shared node pool using a selector `shared_pool` and `VM.GPU.A10.2` shape | [shared_node_pool_a10.json](shared_node_pool_a10.json) | |
| 20 | +| Benchmark LLM performance using vLLM backend with offline inference for token throughput analysis | Offline inference with LLAMA 3- vLLM | Benchmarks Meta-Llama-3.1-8B model using vLLM on VM.GPU.A10.2 with 2 GPUs. | [offline-benchmark-blueprint.json](offline-benchmark-blueprint.json) | |
| 21 | + |
| 22 | +You can access these pre-filled samples from the OCI AI Blueprint portal. |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## When to use Offline inference |
| 27 | + |
| 28 | +Offline inference is ideal for: |
| 29 | + |
| 30 | +- Accurate performance benchmarking (no API or network bottlenecks) |
| 31 | +- Comparing GPU hardware performance (A10, A100, H100, MI300X) |
| 32 | +- Evaluating backend frameworks (inference engines) like vLLM |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Supported Backends |
| 37 | + |
| 38 | +| Backend | Description | |
| 39 | +| ------- | ------------------------------------------------------------------- | |
| 40 | +| vllm | Token streaming inference engine for LLMs with speculative decoding | |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## Running the Benchmark |
| 45 | + |
| 46 | +- Things need to run the benchmark |
| 47 | + - Your MLFlow URL (this can be found via the GET `workspace/` endpoint or under the `Deployments` tab if using the portal) |
| 48 | + - A node pool with GPU hardware (this can be done by deploying the shared node pool pre-filled sample [here](shared_node_pool_a10.json)) |
| 49 | + - Model checkpoints pre-downloaded and stored in an object storage. |
| 50 | + - Make sure to get a PAR for the object storage where the models are saved. With listing, write and read perimissions |
| 51 | + - Configured benchmarking blueprint - make sure to update the MLFlow URL (ex `https://mlflow.121-158-72-41.nip.io`) |
| 52 | + |
| 53 | +This blueprint supports benchmark execution via job-mode (the benchmarking container will spin up, benchmark, then spin down once the benchmarking is complete). The recipe mounts a model from Object Storage (hence the need for a PAR link), runs offline inference, and logs metrics to MlFlow. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## Metrics Logged (will be visible in MlFlow) |
| 58 | + |
| 59 | +- `requests_per_second` |
| 60 | +- `input_tokens_per_second` |
| 61 | +- `output_tokens_per_second` |
| 62 | +- `total_tokens_per_second` |
| 63 | +- `elapsed_time` |
| 64 | +- `total_input_tokens` |
| 65 | +- `total_output_tokens` |
| 66 | + |
| 67 | +### Top-level Deployment Keys |
| 68 | + |
| 69 | +| Key | Description | |
| 70 | +| ------------------- | ---------------------------------------------------------------------------- | |
| 71 | +| `recipe_id` | Identifier of the recipe to run; here, it's an offline SGLang benchmark job. | |
| 72 | +| `recipe_mode` | Specifies this is a `job`, meaning it runs to completion and exits. | |
| 73 | +| `deployment_name` | Human-readable name for the job. | |
| 74 | +| `recipe_image_uri` | Docker image containing the benchmark code and dependencies. | |
| 75 | +| `recipe_node_shape` | Shape of the VM or GPU node to run the job (e.g., VM.GPU.A10.2). | |
| 76 | + |
| 77 | +### Input Object Storage |
| 78 | + |
| 79 | +| Key | Description | |
| 80 | +| ---------------------- | -------------------------------------------------------- | |
| 81 | +| `input_object_storage` | List of inputs to mount from Object Storage. | |
| 82 | +| `par` | Pre-Authenticated Request (PAR) link to a bucket/folder. | |
| 83 | +| `mount_location` | Files are mounted to this path inside the container. | |
| 84 | +| `volume_size_in_gbs` | Size of the mount volume. | |
| 85 | + |
| 86 | +### Runtime & Infra Settings |
| 87 | + |
| 88 | +| Key | Description | |
| 89 | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | |
| 90 | +| `recipe_container_command_args` | Necessary command args that are passed to benchmarking container (and are assigned to the `recipe_container_env` values) | |
| 91 | +| `recipe_container_env` | The container environment variables that we are using to configure the benchmarking run (see below for specifics) | |
| 92 | +| `recipe_replica_count` | Number of job replicas to run (usually 1 for inference). | |
| 93 | +| `recipe_container_port` | Port (optional for offline mode; required if API is exposed). | |
| 94 | +| `recipe_nvidia_gpu_count` | Number of GPUs allocated to this job. | |
| 95 | +| `recipe_node_pool_size` | Number of nodes in the pool (1 means 1 VM). | |
| 96 | +| `recipe_node_boot_volume_size_in_gbs` | Disk size for OS + dependencies. | |
| 97 | +| `recipe_ephemeral_storage_size` | Local scratch space in GBs. | |
| 98 | +| `recipe_shared_memory_volume_size_limit_in_mb` | Shared memory (used by some inference engines). | |
| 99 | + |
| 100 | +### Recipe Container Environment Variables |
| 101 | + |
| 102 | +These are the environment variables set in the `recipe_container_env` field and are used as values to the command args that are passed to the benchmarking container via the `recipe_container_command_args` field of the blueprint. |
| 103 | + |
| 104 | +| Key | Description | |
| 105 | +| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 106 | +| `backend` | Set to `vllm` since we are using vLLM as the inference engine backend | |
| 107 | +| `model` | Name of the model - note this should be the same as the path to the model directory (already mounted via Object Storage). | |
| 108 | +| `tokenizer` | Name of the tokenizer - this will almost always be the same as the model name and the path to the model (usually same as model path). | |
| 109 | +| `input-len` | Number of tokens in the input prompt. | |
| 110 | +| `output-len` | Number of tokens to generate. | |
| 111 | +| `num-prompts` | Number of total prompts to run (e.g., 64 prompts x 128 output tokens). | |
| 112 | +| `tensor-parallel-size` | Number of parallelism groups to partition tensors across the GPUs to enable parallel computation. This should almost always be set to the number of GPUs per node | |
| 113 | +| `max-model-len` | Largest context length (prompt and output) allowed for the given model | |
| 114 | +| `dtype` | Precision (e.g., float16, bfloat16, auto). | |
| 115 | +| `mlflow_uri` | MLflow server to log performance metrics. Make sure to include `https://` before the url but do not include the port it is listening on such as `:5000` | |
| 116 | +| `experiment_name` | Experiment name to group runs in MLflow UI. | |
| 117 | +| `run_name` | Custom name to identify this particular run. | |
| 118 | + |
| 119 | +--- |
0 commit comments