|
| 1 | +# Hugging Face Inference Endpoints |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Models compatible with vLLM can be deployed on Hugging Face Inference Endpoints, either starting from the [Hugging Face Hub](https://huggingface.co) or directly from the [Inference Endpoints](https://endpoints.huggingface.co/) interface. This allows you to serve models in a fully managed environment with GPU acceleration, auto-scaling, and monitoring, without managing the infrastructure manually. |
| 6 | + |
| 7 | +For advanced details on vLLM integration and deployment options, see [Advanced Deployment Details](#advanced-deployment-details). |
| 8 | + |
| 9 | +## Deployment Methods |
| 10 | + |
| 11 | +- [**Method 1: Deploy from the Catalog.**](#method-1-deploy-from-the-catalog) One-click deploy models from the Hugging Face Hub with ready-made optimized configurations. |
| 12 | +- [**Method 2: Guided Deployment (Transformers Models).**](#method-2-guided-deployment-transformers-models) Instantly deploy models tagged with `transformers` from the Hub UI using the **Deploy** button. |
| 13 | +- [**Method 3: Manual Deployment (Advanced Models).**](#method-3-manual-deployment-advanced-models) For models that either use custom code with the `transformers` tag, or don’t run with standard `transformers` but are supported by vLLM. This method requires manual configuration. |
| 14 | + |
| 15 | +### Method 1: Deploy from the Catalog |
| 16 | + |
| 17 | +This is the easiest way to get started with vLLM on Hugging Face Inference Endpoints. You can browse a catalog of models with verified and optimized deployment configuration at [Inference Endpoints](https://endpoints.huggingface.co/catalog) to maximize performance. |
| 18 | + |
| 19 | +1. Go to [Endpoints Catalog](https://endpoints.huggingface.co/catalog) and in the **Inference Server** options, select `vLLM`.This will display the current list of models with optimized preconfigured options. |
| 20 | + |
| 21 | +  |
| 22 | + |
| 23 | +1. Select the desired model and click **Create Endpoint**. |
| 24 | + |
| 25 | +  |
| 26 | + |
| 27 | +1. Once the deployment is ready, you can use the endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console, remembering to append `/v1` as required. |
| 28 | + |
| 29 | + ```python |
| 30 | + # pip install openai |
| 31 | + from openai import OpenAI |
| 32 | + import os |
| 33 | + |
| 34 | + client = OpenAI( |
| 35 | + base_url = DEPLOYMENT_URL, |
| 36 | + api_key = os.environ["HF_TOKEN"] # https://huggingface.co/settings/tokens |
| 37 | + ) |
| 38 | + |
| 39 | + chat_completion = client.chat.completions.create( |
| 40 | + model = "HuggingFaceTB/SmolLM3-3B", |
| 41 | + messages = [ |
| 42 | + { |
| 43 | + "role": "user", |
| 44 | + "content": [ |
| 45 | + { |
| 46 | + "type": "text", |
| 47 | + "text": "Give me a brief explanation of gravity in simple terms." |
| 48 | + } |
| 49 | + ] |
| 50 | + } |
| 51 | + ], |
| 52 | + stream = True |
| 53 | + ) |
| 54 | + |
| 55 | + for message in chat_completion: |
| 56 | + print(message.choices[0].delta.content, end = "") |
| 57 | + ``` |
| 58 | + |
| 59 | +!!! note |
| 60 | + The catalog provides models optimized for vLLM, including GPU settings and inference engine configurations. You can monitor the endpoint and update the **container or its configuration** from the Inference Endpoints UI. |
| 61 | + |
| 62 | +### Method 2: Guided Deployment (Transformers Models) |
| 63 | + |
| 64 | +This method applies to models with the `transformers` library tag in their metadata. It allows you to deploy a model directly from the Hub UI without manual configuration. |
| 65 | + |
| 66 | +1. Navigate to a model on [Hugging Face Hub](https://huggingface.co/models). |
| 67 | + For this example we will use the [`ibm-granite/granite-docling-258M`](https://huggingface.co/ibm-granite/granite-docling-258M) model. You can verify that the model is compatible by checking the front matter in the [README](https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/README.md), where the library is tagged as `library: transformers`. |
| 68 | + |
| 69 | +2. Locate the **Deploy** button. The button appears for models tagged with `transformers` at the top right of the [model card](https://huggingface.co/ibm-granite/granite-docling-258M). |
| 70 | + |
| 71 | +  |
| 72 | + |
| 73 | +3. Click to **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment. |
| 74 | + |
| 75 | +  |
| 76 | + |
| 77 | +4. Select the Hardware (we choose AWS>GPU>T4 for the example) and Container Configuration. Choose `vLLM` as the container type and finalize the deployment pressing **Create Endpoint**. |
| 78 | + |
| 79 | +  |
| 80 | + |
| 81 | +5. Use the deployed endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console (remember to add `/v1` needed). You can then use your endpoint programmatically or via the SDK. |
| 82 | + |
| 83 | + ```python |
| 84 | + # pip install openai |
| 85 | + from openai import OpenAI |
| 86 | + import os |
| 87 | + |
| 88 | + client = OpenAI( |
| 89 | + base_url = DEPLOYMENT_URL, |
| 90 | + api_key = os.environ["HF_TOKEN"] # https://huggingface.co/settings/tokens |
| 91 | + ) |
| 92 | + |
| 93 | + chat_completion = client.chat.completions.create( |
| 94 | + model = "ibm-granite/granite-docling-258M", |
| 95 | + messages = [ |
| 96 | + { |
| 97 | + "role": "user", |
| 98 | + "content": [ |
| 99 | + { |
| 100 | + "type": "image_url", |
| 101 | + "image_url": { |
| 102 | + "url": "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png" |
| 103 | + } |
| 104 | + }, |
| 105 | + { |
| 106 | + "type": "text", |
| 107 | + "text": "Convert this page to docling." |
| 108 | + } |
| 109 | + ] |
| 110 | + } |
| 111 | + ], |
| 112 | + stream = True |
| 113 | + ) |
| 114 | + |
| 115 | + for message in chat_completion: |
| 116 | + print(message.choices[0].delta.content, end = "") |
| 117 | + ``` |
| 118 | + |
| 119 | +!!! note |
| 120 | + This method uses best-guess defaults. You may need to adjust the configuration to fit your specific requirements. |
| 121 | + |
| 122 | +### Method 3: Manual Deployment (Advanced Models) |
| 123 | + |
| 124 | +Some models require manual deployment because they: |
| 125 | + |
| 126 | +- Use custom code with the `transformers` tag |
| 127 | +- Don't run with standard `transformers` but are supported by `vLLM` |
| 128 | + |
| 129 | +These models cannot be deployed using the **Deploy** button on the model card. |
| 130 | + |
| 131 | +In this guide, we demonstrate manual deployment using the [rednote-hilab/dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr) model, an OCR model integrated with vLLM (see vLLM [PR](https://github.com/vllm-project/vllm/pull/24645)). |
| 132 | + |
| 133 | +1. Start a new deployment. Go to [Inference Endpoints](https://endpoints.huggingface.co/) and click `New`. |
| 134 | + |
| 135 | +  |
| 136 | + |
| 137 | +2. Search the model in the Hub. In the dialog, switch to **Hub** and search for the desired model. |
| 138 | + |
| 139 | +  |
| 140 | + |
| 141 | +3. Choosing infrastructure. On the configuration page, select the cloud provider and hardware from the available options. |
| 142 | + For this demo, we choose AWS and L4 GPU. Adjust according to your hardware needs. |
| 143 | + |
| 144 | +  |
| 145 | + |
| 146 | +4. Configure the container. Scroll to the **Container Configuration** and select `vLLM` as the container type. |
| 147 | + |
| 148 | +  |
| 149 | + |
| 150 | +5. Create the endpoint. Click **Create Endpoint** to deploy the model. |
| 151 | + |
| 152 | + Once the endpoint is ready, you can use it with the OpenAI Completion API, cURL, or other SDKs. Remember to append `/v1` to the deployment URL if needed. |
| 153 | + |
| 154 | +!!! note |
| 155 | + You can adjust the **container settings** (Container URI, Container Arguments) from the Inference Endpoints UI and press **Update Endpoint**. This redeploys the endpoint with the updated container configuration. Changes to the model itself require creating a new endpoint or redeploying with a different model. For example, for this demo, you may need to update the Container URI to the nightly image (`vllm/vllm-openai:nightly`) and add the `--trust-remote-code` flag in the container arguments. |
| 156 | + |
| 157 | +## Advanced Deployment Details |
| 158 | + |
| 159 | +With the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support for any model compatible with `transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications. |
| 160 | + |
| 161 | +Hugging Face Inference Endpoints provides a fully managed environment for serving models via vLLM. You can deploy models without configuring servers, installing dependencies, or managing clusters. Endpoints also support deployment across multiple cloud providers (AWS, Azure, GCP) without the need for separate accounts. |
| 162 | + |
| 163 | +The platform integrates seamlessly with the Hugging Face Hub, allowing you to deploy any vLLM- or `transformers`-compatible model, track usage, and update the inference engine directly. The vLLM engine comes preconfigured, enabling optimized inference and easy switching between models or engines without modifying your code. This setup simplifies production deployment: endpoints are ready in minutes, include monitoring and logging, and let you focus on serving models rather than maintaining infrastructure. |
| 164 | + |
| 165 | +## Next Steps |
| 166 | + |
| 167 | +- Explore the [Inference Endpoints](https://endpoints.huggingface.co/catalog) model catalog |
| 168 | +- Read the Inference Endpoints [documentation](https://huggingface.co/docs/inference-endpoints/en/index) |
| 169 | +- Learn about [Inference Endpoints engines](https://huggingface.co/docs/inference-endpoints/en/engines/vllm) |
| 170 | +- Understand the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html) |
0 commit comments