Skip to content

Commit 099aaee

Browse files
Add Hugging Face Inference Endpoints guide to Deployment docs (vllm-project#25886)
Signed-off-by: sergiopaniego <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
1 parent 35fe398 commit 099aaee

10 files changed

+170
-0
lines changed
627 KB
Loading
350 KB
Loading
814 KB
Loading
267 KB
Loading
354 KB
Loading
781 KB
Loading
51.1 KB
Loading
359 KB
Loading
81.7 KB
Loading
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Hugging Face Inference Endpoints
2+
3+
## Overview
4+
5+
Models compatible with vLLM can be deployed on Hugging Face Inference Endpoints, either starting from the [Hugging Face Hub](https://huggingface.co) or directly from the [Inference Endpoints](https://endpoints.huggingface.co/) interface. This allows you to serve models in a fully managed environment with GPU acceleration, auto-scaling, and monitoring, without managing the infrastructure manually.
6+
7+
For advanced details on vLLM integration and deployment options, see [Advanced Deployment Details](#advanced-deployment-details).
8+
9+
## Deployment Methods
10+
11+
- [**Method 1: Deploy from the Catalog.**](#method-1-deploy-from-the-catalog) One-click deploy models from the Hugging Face Hub with ready-made optimized configurations.
12+
- [**Method 2: Guided Deployment (Transformers Models).**](#method-2-guided-deployment-transformers-models) Instantly deploy models tagged with `transformers` from the Hub UI using the **Deploy** button.
13+
- [**Method 3: Manual Deployment (Advanced Models).**](#method-3-manual-deployment-advanced-models) For models that either use custom code with the `transformers` tag, or don’t run with standard `transformers` but are supported by vLLM. This method requires manual configuration.
14+
15+
### Method 1: Deploy from the Catalog
16+
17+
This is the easiest way to get started with vLLM on Hugging Face Inference Endpoints. You can browse a catalog of models with verified and optimized deployment configuration at [Inference Endpoints](https://endpoints.huggingface.co/catalog) to maximize performance.
18+
19+
1. Go to [Endpoints Catalog](https://endpoints.huggingface.co/catalog) and in the **Inference Server** options, select `vLLM`.This will display the current list of models with optimized preconfigured options.
20+
21+
![Endpoints Catalog](../../assets/deployment/hf-inference-endpoints-catalog.png)
22+
23+
1. Select the desired model and click **Create Endpoint**.
24+
25+
![Create Endpoint](../../assets/deployment/hf-inference-endpoints-create-endpoint.png)
26+
27+
1. Once the deployment is ready, you can use the endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console, remembering to append `/v1` as required.
28+
29+
```python
30+
# pip install openai
31+
from openai import OpenAI
32+
import os
33+
34+
client = OpenAI(
35+
base_url = DEPLOYMENT_URL,
36+
api_key = os.environ["HF_TOKEN"] # https://huggingface.co/settings/tokens
37+
)
38+
39+
chat_completion = client.chat.completions.create(
40+
model = "HuggingFaceTB/SmolLM3-3B",
41+
messages = [
42+
{
43+
"role": "user",
44+
"content": [
45+
{
46+
"type": "text",
47+
"text": "Give me a brief explanation of gravity in simple terms."
48+
}
49+
]
50+
}
51+
],
52+
stream = True
53+
)
54+
55+
for message in chat_completion:
56+
print(message.choices[0].delta.content, end = "")
57+
```
58+
59+
!!! note
60+
The catalog provides models optimized for vLLM, including GPU settings and inference engine configurations. You can monitor the endpoint and update the **container or its configuration** from the Inference Endpoints UI.
61+
62+
### Method 2: Guided Deployment (Transformers Models)
63+
64+
This method applies to models with the `transformers` library tag in their metadata. It allows you to deploy a model directly from the Hub UI without manual configuration.
65+
66+
1. Navigate to a model on [Hugging Face Hub](https://huggingface.co/models).
67+
For this example we will use the [`ibm-granite/granite-docling-258M`](https://huggingface.co/ibm-granite/granite-docling-258M) model. You can verify that the model is compatible by checking the front matter in the [README](https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/README.md), where the library is tagged as `library: transformers`.
68+
69+
2. Locate the **Deploy** button. The button appears for models tagged with `transformers` at the top right of the [model card](https://huggingface.co/ibm-granite/granite-docling-258M).
70+
71+
![Locate deploy button](../../assets/deployment/hf-inference-endpoints-locate-deploy-button.png)
72+
73+
3. Click to **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment.
74+
75+
![Click deploy button](../../assets/deployment/hf-inference-endpoints-click-deploy-button.png)
76+
77+
4. Select the Hardware (we choose AWS>GPU>T4 for the example) and Container Configuration. Choose `vLLM` as the container type and finalize the deployment pressing **Create Endpoint**.
78+
79+
![Select Hardware](../../assets/deployment/hf-inference-endpoints-select-hardware.png)
80+
81+
5. Use the deployed endpoint. Update the `DEPLOYMENT_URL` with the URL provided in the console (remember to add `/v1` needed). You can then use your endpoint programmatically or via the SDK.
82+
83+
```python
84+
# pip install openai
85+
from openai import OpenAI
86+
import os
87+
88+
client = OpenAI(
89+
base_url = DEPLOYMENT_URL,
90+
api_key = os.environ["HF_TOKEN"] # https://huggingface.co/settings/tokens
91+
)
92+
93+
chat_completion = client.chat.completions.create(
94+
model = "ibm-granite/granite-docling-258M",
95+
messages = [
96+
{
97+
"role": "user",
98+
"content": [
99+
{
100+
"type": "image_url",
101+
"image_url": {
102+
"url": "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png"
103+
}
104+
},
105+
{
106+
"type": "text",
107+
"text": "Convert this page to docling."
108+
}
109+
]
110+
}
111+
],
112+
stream = True
113+
)
114+
115+
for message in chat_completion:
116+
print(message.choices[0].delta.content, end = "")
117+
```
118+
119+
!!! note
120+
This method uses best-guess defaults. You may need to adjust the configuration to fit your specific requirements.
121+
122+
### Method 3: Manual Deployment (Advanced Models)
123+
124+
Some models require manual deployment because they:
125+
126+
- Use custom code with the `transformers` tag
127+
- Don't run with standard `transformers` but are supported by `vLLM`
128+
129+
These models cannot be deployed using the **Deploy** button on the model card.
130+
131+
In this guide, we demonstrate manual deployment using the [rednote-hilab/dots.ocr](https://huggingface.co/rednote-hilab/dots.ocr) model, an OCR model integrated with vLLM (see vLLM [PR](https://github.com/vllm-project/vllm/pull/24645)).
132+
133+
1. Start a new deployment. Go to [Inference Endpoints](https://endpoints.huggingface.co/) and click `New`.
134+
135+
![New Endpoint](../../assets/deployment/hf-inference-endpoints-new-endpoint.png)
136+
137+
2. Search the model in the Hub. In the dialog, switch to **Hub** and search for the desired model.
138+
139+
![Select model](../../assets/deployment/hf-inference-endpoints-select-model.png)
140+
141+
3. Choosing infrastructure. On the configuration page, select the cloud provider and hardware from the available options.
142+
For this demo, we choose AWS and L4 GPU. Adjust according to your hardware needs.
143+
144+
![Choose Infra](../../assets/deployment/hf-inference-endpoints-choose-infra.png)
145+
146+
4. Configure the container. Scroll to the **Container Configuration** and select `vLLM` as the container type.
147+
148+
![Configure Container](../../assets/deployment/hf-inference-endpoints-configure-container.png)
149+
150+
5. Create the endpoint. Click **Create Endpoint** to deploy the model.
151+
152+
Once the endpoint is ready, you can use it with the OpenAI Completion API, cURL, or other SDKs. Remember to append `/v1` to the deployment URL if needed.
153+
154+
!!! note
155+
You can adjust the **container settings** (Container URI, Container Arguments) from the Inference Endpoints UI and press **Update Endpoint**. This redeploys the endpoint with the updated container configuration. Changes to the model itself require creating a new endpoint or redeploying with a different model. For example, for this demo, you may need to update the Container URI to the nightly image (`vllm/vllm-openai:nightly`) and add the `--trust-remote-code` flag in the container arguments.
156+
157+
## Advanced Deployment Details
158+
159+
With the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html), vLLM now offers Day 0 support for any model compatible with `transformers`. This means you can deploy such models immediately, leveraging vLLM’s optimized inference without additional backend modifications.
160+
161+
Hugging Face Inference Endpoints provides a fully managed environment for serving models via vLLM. You can deploy models without configuring servers, installing dependencies, or managing clusters. Endpoints also support deployment across multiple cloud providers (AWS, Azure, GCP) without the need for separate accounts.
162+
163+
The platform integrates seamlessly with the Hugging Face Hub, allowing you to deploy any vLLM- or `transformers`-compatible model, track usage, and update the inference engine directly. The vLLM engine comes preconfigured, enabling optimized inference and easy switching between models or engines without modifying your code. This setup simplifies production deployment: endpoints are ready in minutes, include monitoring and logging, and let you focus on serving models rather than maintaining infrastructure.
164+
165+
## Next Steps
166+
167+
- Explore the [Inference Endpoints](https://endpoints.huggingface.co/catalog) model catalog
168+
- Read the Inference Endpoints [documentation](https://huggingface.co/docs/inference-endpoints/en/index)
169+
- Learn about [Inference Endpoints engines](https://huggingface.co/docs/inference-endpoints/en/engines/vllm)
170+
- Understand the [transformers backend integration](https://blog.vllm.ai/2025/04/11/transformers-backend.html)

0 commit comments

Comments
 (0)