Update connect-to-vLLM.md

benironside · benironside · commit 4ce881b868d2 · 2025-11-07T17:11:26.000-06:00
diff --git a/solutions/security/ai/connect-to-vLLM.md b/solutions/security/ai/connect-to-vLLM.md
@@ -41,7 +41,7 @@ The process involves four main steps:
 
 ## Step 1: Configure your host server
 
-1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
+1. (Optional) If you plan to use a gated model (such as Llama 3.1) or a private model, create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
   1. Log in to your Hugging Face account.
   2. Navigate to **Settings > Access Tokens**.
   3. Create a new token with at least `read` permissions. Save it in a secure location.
@@ -73,7 +73,7 @@ vllm/vllm-openai:v0.9.1 \
 --tensor-parallel-size 2
 ```
 
-.**Click to expand an explanation of the command** 
+.**Click to expand a full explanation of the command** 
 [%collapsible]
 =====
 `--gpus all`: Exposes all available GPUs to the container.
@@ -91,3 +91,62 @@ vllm/vllm-openai:v0.9.1 \
 `--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. 
 =====
 
+3. Verify the containers were created by running `docker ps -a`. The output should show the value you specified for the `--name` parameter.
+
+## Step 3: Expose the API with a reverse proxy
+
+This example uses Nginx to create a reverse proxy. This improves stability and enables monitoring by means of Elastic's native Nginx integration. The following example configuration forwards traffic to the vLLM container and uses a secret token for authentication.
+
+1. Install Nginx on your server.
+2. Create a configuration file, for example at `/etc/nginx/sites-available/default`. Give it the following content:
+
+```
+server {
+    listen 80;
+    server_name <yourdomainname.com>;
+    return 301 https://$server_name$request_uri;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name <yourdomainname.com>;
+
+    ssl_certificate /etc/letsencrypt/live/<yourdomainname.com>/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/<yourdomainname.com>/privkey.pem;
+
+    location / {
+        if ($http_authorization != "Bearer <secret token>") {
+            return 401;
+        }
+        proxy_pass http://localhost:8000/;
+    }
+}
+```
+
+3. Enable and restart Nginx to apply the configuration.
+
+:::{note}
+For quick testing, you can use [ngrok](https://ngrok.com/) as an alternative to Nginx, but it is not recommended for production use.
+:::
+
+## Step 4: Configure the connector in your elastic deployment
+
+Finally, create the connector within your Elastic deployment to link it to your vLLM instance.
+
+1. Log in to {{kib}}.
+2. Navigate to the **Connectors** page, click **Create Connector**, and select **OpenAI**.
+3. Give the connector a descriptive name, such as `vLLM - Mistral Small 3.2`.
+4. In **Connector settings**, configure the following:
+  * For **Select an OpenAI provider**, select **Other (OpenAI Compatible Service)**.
+  * For **URL**, enter your server's public URL followed by `/v1/chat/completions`.
+5. For **Default Model**, enter `mistralai/Mistral-Small-3.2-24B-Instruct-2506` or the model ID you used during setup.
+6. For **Authentication**, configure the following:
+  * For **API key**, enter the secret token you created in Step 1 and specified in your Nginx configuration file.
+  * If your chosen model supports tool use, then turn on **Enable native function calling**.
+7. Click **Save**
+
+Setup is now complete. The model served by your vLLM container can now power Elastic's generative AI features, such as the AI Assistant. 
+
+:::{note}
+To run a different model, stop the current container and run a new one with an updated `--model` parameter.
+:::