You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/usage/security.md
+110Lines changed: 110 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -108,6 +108,116 @@ networks.
108
108
Consult your operating system or application platform documentation for specific
109
109
firewall configuration instructions.
110
110
111
+
## API Key Authentication Limitations
112
+
113
+
### Overview
114
+
115
+
The `--api-key` flag (or `VLLM_API_KEY` environment variable) provides authentication for vLLM's HTTP server, but **only for OpenAI-compatible API endpoints under the `/v1` path prefix**. Many other sensitive endpoints are exposed on the same HTTP server without any authentication enforcement.
116
+
117
+
**Important:** Do not rely exclusively on `--api-key` for securing access to vLLM. Additional security measures are required for production deployments.
118
+
119
+
### Protected Endpoints (Require API Key)
120
+
121
+
When `--api-key` is configured, the following `/v1` endpoints require Bearer token authentication:
122
+
123
+
-`/v1/models` - List available models
124
+
-`/v1/chat/completions` - Chat completions
125
+
-`/v1/completions` - Text completions
126
+
-`/v1/embeddings` - Generate embeddings
127
+
-`/v1/audio/transcriptions` - Audio transcription
128
+
-`/v1/audio/translations` - Audio translation
129
+
-`/v1/messages` - Anthropic-compatible messages API
130
+
-`/v1/responses` - Response management
131
+
-`/v1/score` - Scoring API
132
+
-`/v1/rerank` - Reranking API
133
+
134
+
### Unprotected Endpoints (No API Key Required)
135
+
136
+
The following endpoints **do not require authentication** even when `--api-key` is configured:
137
+
138
+
**Inference endpoints:**
139
+
140
+
-`/invocations` - SageMaker-compatible endpoint (routes to the same inference functions as `/v1` endpoints)
141
+
-`/inference/v1/generate` - Generate completions
142
+
-`/pooling` - Pooling API
143
+
-`/classify` - Classification API
144
+
-`/score` - Scoring API (non-`/v1` variant)
145
+
-`/rerank` - Reranking API (non-`/v1` variant)
146
+
147
+
**Operational control endpoints (always enabled):**
148
+
149
+
-`/pause` - Pause generation (causes denial of service)
150
+
-`/resume` - Resume generation
151
+
-`/scale_elastic_ep` - Trigger scaling operations
152
+
153
+
**Utility endpoints:**
154
+
155
+
-`/tokenize` - Tokenize text
156
+
-`/detokenize` - Detokenize tokens
157
+
-`/health` - Health check
158
+
-`/ping` - SageMaker health check
159
+
-`/version` - Version information
160
+
-`/load` - Server load metrics
161
+
162
+
**Tokenizer information endpoint (only when `--enable-tokenizer-info-endpoint` is set):**
163
+
164
+
This endpoint is **only available when the `--enable-tokenizer-info-endpoint` flag is set**. It may expose sensitive information such as chat templates and tokenizer configuration:
165
+
166
+
-`/tokenizer_info` - Get comprehensive tokenizer information including chat templates and configuration
167
+
168
+
**Development endpoints (only when `VLLM_SERVER_DEV_MODE=1`):**
169
+
170
+
These endpoints are **only available when the environment variable `VLLM_SERVER_DEV_MODE` is set to `1`**. They are intended for development and debugging purposes and should never be enabled in production:
171
+
172
+
-`/server_info` - Get detailed server configuration
-`/sleep` - Put engine to sleep (causes denial of service)
176
+
-`/wake_up` - Wake engine from sleep
177
+
-`/is_sleeping` - Check if engine is sleeping
178
+
-`/collective_rpc` - Execute arbitrary RPC methods on the engine (extremely dangerous)
179
+
180
+
**Profiler endpoints (only when `VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE` are set):**
181
+
182
+
These endpoints are only available when profiling is enabled and should only be used for local development:
183
+
184
+
-`/start_profile` - Start PyTorch profiler
185
+
-`/stop_profile` - Stop PyTorch profiler
186
+
187
+
**Note:** The `/invocations` endpoint is particularly concerning as it provides unauthenticated access to the same inference capabilities as the protected `/v1` endpoints.
188
+
189
+
### Security Implications
190
+
191
+
An attacker who can reach the vLLM HTTP server can:
192
+
193
+
1.**Bypass authentication** by using non-`/v1` endpoints like `/invocations`, `/inference/v1/generate`, `/pooling`, `/classify`, `/score`, or `/rerank` to run arbitrary inference without credentials
194
+
2.**Cause denial of service** by calling `/pause` or `/scale_elastic_ep` without a token
195
+
3.**Access operational controls** to manipulate server state (e.g., pausing generation)
196
+
4.**If `--enable-tokenizer-info-endpoint` is set:** Access sensitive tokenizer configuration including chat templates, which may reveal prompt engineering strategies or other implementation details
197
+
5.**If `VLLM_SERVER_DEV_MODE=1` is set:** Execute arbitrary RPC commands via `/collective_rpc`, reset caches, put the engine to sleep, and access detailed server configuration
198
+
199
+
### Recommended Security Practices
200
+
201
+
#### 1. Minimize Exposed Endpoints
202
+
203
+
**CRITICAL:** Never set `VLLM_SERVER_DEV_MODE=1` in production environments. Development endpoints expose extremely dangerous functionality including:
204
+
205
+
- Arbitrary RPC execution via `/collective_rpc`
206
+
- Cache manipulation that can disrupt service
207
+
- Detailed server configuration disclosure
208
+
209
+
Similarly, never enable profiler endpoints (`VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE`) in production.
210
+
211
+
**Be cautious with `--enable-tokenizer-info-endpoint`:** Only enable the `/tokenizer_info` endpoint if you need to expose tokenizer configuration information. This endpoint reveals chat templates and tokenizer settings that may contain sensitive implementation details or prompt engineering strategies.
212
+
213
+
#### 2. Deploy Behind a Reverse Proxy
214
+
215
+
The most effective approach is to deploy vLLM behind a reverse proxy (such as nginx, Envoy, or a Kubernetes Gateway) that:
216
+
217
+
- Explicitly allowlists only the endpoints you want to expose to end users
218
+
- Blocks all other endpoints, including the unauthenticated inference and operational control endpoints
219
+
- Implements additional authentication, rate limiting, and logging at the proxy layer
220
+
111
221
## Reporting Security Vulnerabilities
112
222
113
223
If you believe you have found a security vulnerability in vLLM, please report it following the project's security policy. For more information on how to report security issues and the project's security policy, please see the [vLLM Security Policy](https://github.com/vllm-project/vllm/blob/main/SECURITY.md).
0 commit comments