Skip to content

Commit b08025a

Browse files
authored
[Docs] Discuss api key limitations in security guide (vllm-project#29922)
Signed-off-by: Russell Bryant <[email protected]>
1 parent d7284a2 commit b08025a

File tree

2 files changed

+114
-0
lines changed

2 files changed

+114
-0
lines changed

docs/usage/security.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,116 @@ networks.
108108
Consult your operating system or application platform documentation for specific
109109
firewall configuration instructions.
110110

111+
## API Key Authentication Limitations
112+
113+
### Overview
114+
115+
The `--api-key` flag (or `VLLM_API_KEY` environment variable) provides authentication for vLLM's HTTP server, but **only for OpenAI-compatible API endpoints under the `/v1` path prefix**. Many other sensitive endpoints are exposed on the same HTTP server without any authentication enforcement.
116+
117+
**Important:** Do not rely exclusively on `--api-key` for securing access to vLLM. Additional security measures are required for production deployments.
118+
119+
### Protected Endpoints (Require API Key)
120+
121+
When `--api-key` is configured, the following `/v1` endpoints require Bearer token authentication:
122+
123+
- `/v1/models` - List available models
124+
- `/v1/chat/completions` - Chat completions
125+
- `/v1/completions` - Text completions
126+
- `/v1/embeddings` - Generate embeddings
127+
- `/v1/audio/transcriptions` - Audio transcription
128+
- `/v1/audio/translations` - Audio translation
129+
- `/v1/messages` - Anthropic-compatible messages API
130+
- `/v1/responses` - Response management
131+
- `/v1/score` - Scoring API
132+
- `/v1/rerank` - Reranking API
133+
134+
### Unprotected Endpoints (No API Key Required)
135+
136+
The following endpoints **do not require authentication** even when `--api-key` is configured:
137+
138+
**Inference endpoints:**
139+
140+
- `/invocations` - SageMaker-compatible endpoint (routes to the same inference functions as `/v1` endpoints)
141+
- `/inference/v1/generate` - Generate completions
142+
- `/pooling` - Pooling API
143+
- `/classify` - Classification API
144+
- `/score` - Scoring API (non-`/v1` variant)
145+
- `/rerank` - Reranking API (non-`/v1` variant)
146+
147+
**Operational control endpoints (always enabled):**
148+
149+
- `/pause` - Pause generation (causes denial of service)
150+
- `/resume` - Resume generation
151+
- `/scale_elastic_ep` - Trigger scaling operations
152+
153+
**Utility endpoints:**
154+
155+
- `/tokenize` - Tokenize text
156+
- `/detokenize` - Detokenize tokens
157+
- `/health` - Health check
158+
- `/ping` - SageMaker health check
159+
- `/version` - Version information
160+
- `/load` - Server load metrics
161+
162+
**Tokenizer information endpoint (only when `--enable-tokenizer-info-endpoint` is set):**
163+
164+
This endpoint is **only available when the `--enable-tokenizer-info-endpoint` flag is set**. It may expose sensitive information such as chat templates and tokenizer configuration:
165+
166+
- `/tokenizer_info` - Get comprehensive tokenizer information including chat templates and configuration
167+
168+
**Development endpoints (only when `VLLM_SERVER_DEV_MODE=1`):**
169+
170+
These endpoints are **only available when the environment variable `VLLM_SERVER_DEV_MODE` is set to `1`**. They are intended for development and debugging purposes and should never be enabled in production:
171+
172+
- `/server_info` - Get detailed server configuration
173+
- `/reset_prefix_cache` - Reset prefix cache (can disrupt service)
174+
- `/reset_mm_cache` - Reset multimodal cache (can disrupt service)
175+
- `/sleep` - Put engine to sleep (causes denial of service)
176+
- `/wake_up` - Wake engine from sleep
177+
- `/is_sleeping` - Check if engine is sleeping
178+
- `/collective_rpc` - Execute arbitrary RPC methods on the engine (extremely dangerous)
179+
180+
**Profiler endpoints (only when `VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE` are set):**
181+
182+
These endpoints are only available when profiling is enabled and should only be used for local development:
183+
184+
- `/start_profile` - Start PyTorch profiler
185+
- `/stop_profile` - Stop PyTorch profiler
186+
187+
**Note:** The `/invocations` endpoint is particularly concerning as it provides unauthenticated access to the same inference capabilities as the protected `/v1` endpoints.
188+
189+
### Security Implications
190+
191+
An attacker who can reach the vLLM HTTP server can:
192+
193+
1. **Bypass authentication** by using non-`/v1` endpoints like `/invocations`, `/inference/v1/generate`, `/pooling`, `/classify`, `/score`, or `/rerank` to run arbitrary inference without credentials
194+
2. **Cause denial of service** by calling `/pause` or `/scale_elastic_ep` without a token
195+
3. **Access operational controls** to manipulate server state (e.g., pausing generation)
196+
4. **If `--enable-tokenizer-info-endpoint` is set:** Access sensitive tokenizer configuration including chat templates, which may reveal prompt engineering strategies or other implementation details
197+
5. **If `VLLM_SERVER_DEV_MODE=1` is set:** Execute arbitrary RPC commands via `/collective_rpc`, reset caches, put the engine to sleep, and access detailed server configuration
198+
199+
### Recommended Security Practices
200+
201+
#### 1. Minimize Exposed Endpoints
202+
203+
**CRITICAL:** Never set `VLLM_SERVER_DEV_MODE=1` in production environments. Development endpoints expose extremely dangerous functionality including:
204+
205+
- Arbitrary RPC execution via `/collective_rpc`
206+
- Cache manipulation that can disrupt service
207+
- Detailed server configuration disclosure
208+
209+
Similarly, never enable profiler endpoints (`VLLM_TORCH_PROFILER_DIR` or `VLLM_TORCH_CUDA_PROFILE`) in production.
210+
211+
**Be cautious with `--enable-tokenizer-info-endpoint`:** Only enable the `/tokenizer_info` endpoint if you need to expose tokenizer configuration information. This endpoint reveals chat templates and tokenizer settings that may contain sensitive implementation details or prompt engineering strategies.
212+
213+
#### 2. Deploy Behind a Reverse Proxy
214+
215+
The most effective approach is to deploy vLLM behind a reverse proxy (such as nginx, Envoy, or a Kubernetes Gateway) that:
216+
217+
- Explicitly allowlists only the endpoints you want to expose to end users
218+
- Blocks all other endpoints, including the unauthenticated inference and operational control endpoints
219+
- Implements additional authentication, rate limiting, and logging at the proxy layer
220+
111221
## Reporting Security Vulnerabilities
112222

113223
If you believe you have found a security vulnerability in vLLM, please report it following the project's security policy. For more information on how to report security issues and the project's security policy, please see the [vLLM Security Policy](https://github.com/vllm-project/vllm/blob/main/SECURITY.md).

vllm/entrypoints/cli/openai.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,10 @@ def _add_query_options(parser: FlexibleArgumentParser) -> FlexibleArgumentParser
109109
help=(
110110
"API key for OpenAI services. If provided, this api key "
111111
"will overwrite the api key obtained through environment variables."
112+
" It is important to note that this option only applies to the "
113+
"OpenAI-compatible API endpoints and NOT other endpoints that may "
114+
"be present in the server. See the security guide in the vLLM docs "
115+
"for more details."
112116
),
113117
)
114118
return parser

0 commit comments

Comments
 (0)