You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: solutions/security/ai/connect-to-vLLM.md
+8-4Lines changed: 8 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,8 +44,9 @@ The process involves four main steps:
44
44
1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
45
45
1. Log in to your Hugging Face account.
46
46
2. Navigate to **Settings > Access Tokens**.
47
-
3. Create a new token with at least `read` permissions. Copy it to a secure location.
47
+
3. Create a new token with at least `read` permissions. Save it in a secure location.
48
48
2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy.
49
+
3. Install any necessary GPU drivers.
49
50
50
51
## Step 2: Run your vLLM container
51
52
@@ -72,9 +73,11 @@ vllm/vllm-openai:v0.9.1 \
72
73
--tensor-parallel-size 2
73
74
```
74
75
75
-
::::{admonition} Explanation of command
76
+
.**Click to expand an explanation of the command**
77
+
[%collapsible]
78
+
=====
76
79
`--gpus all`: Exposes all available GPUs to the container.
77
-
`--name`: Set predefined name for the container, otherwise it’s going to be generated
80
+
`--name`: Defines a name for the container.
78
81
`-v /root/.cache/huggingface:/root/.cache/huggingface`: Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`).
79
82
`-e HUGGING_FACE_HUB_TOKEN`: Sets the environment variable for your Hugging Face token (only required for gated models).
80
83
`--env VLLM_API_KEY`: vLLM API Key used for authentication between {{ecloud}} and vLLM.
@@ -86,4 +89,5 @@ vllm/vllm-openai:v0.9.1 \
86
89
`-enable-auto-tool-choice`: Enables automatic function calling.
87
90
`--gpu-memory-utilization 0.90`: Limits max GPU used by vLLM (may vary depending on the machine resources available).
88
91
`--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems.
0 commit comments