|
| 1 | +--- |
| 2 | +applies_to: |
| 3 | + stack: all |
| 4 | + serverless: |
| 5 | + security: all |
| 6 | +products: |
| 7 | + - id: security |
| 8 | + - id: cloud-serverless |
| 9 | +--- |
| 10 | + |
| 11 | +# Connect to your own LLM using vLLM (air gapped environments) |
| 12 | +This guide shows you how to run an OpenAI-compatible large language model with [vLLM](https://docs.vllm.ai/en/latest/) and connect it to Elastic. The setup runs inside Docker or Podman, is served through an Nginx reverse proxy, and does not require any outbound network access. This makes it a safe option for air-gapped environments or deployments with strict network controls. |
| 13 | + |
| 14 | +The steps below show one example configuration, but you can use any model supported by vLLM, including private and gated models on Hugging Face. |
| 15 | + |
| 16 | +## Prerequisites |
| 17 | + |
| 18 | +* To set up the necessary {{kib}} connector, the `Actions and connectors: all` [{{kib}} privilege](/deploy-manage/users-roles/cluster-or-deployment-auth/kibana-privileges.md). |
| 19 | +* Admin access to a sufficiently powerful server. |
| 20 | + |
| 21 | +## Connect vLLM to {{kib}} |
| 22 | + |
| 23 | +:::::{stepper} |
| 24 | + |
| 25 | +::::{step} Configure your host server |
| 26 | + |
| 27 | +To support this use case, you need a powerful server. For example, we tested a server with the following specifications: |
| 28 | + |
| 29 | +* Operating system: Ubuntu 24.10 |
| 30 | +* Machine type: a2-ultragpu-2g |
| 31 | +* vCPU: 24 (12 cores) |
| 32 | +* Architecture: x86/64 |
| 33 | +* CPU Platform: Intel Cascade Lake |
| 34 | +* Memory: 340GB |
| 35 | +* Accelerator: 2 x NVIDIA A100 80GB GPUs |
| 36 | + |
| 37 | +Set up your server then install all necessary GPU drivers. |
| 38 | + |
| 39 | +:::: |
| 40 | + |
| 41 | +::::{step} Generate auth tokens |
| 42 | + |
| 43 | + |
| 44 | +1. (Optional) Create a Hugging Face user token. If you plan to use a gated model (such as Llama 3.1) or a private model, create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens). |
| 45 | + 1. Log in to your Hugging Face account. |
| 46 | + 2. Navigate to **Settings > Access Tokens**. |
| 47 | + 3. Create a new token with at least `read` permissions. Save it in a secure location. |
| 48 | + |
| 49 | +2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between Elastic and your reverse proxy. |
| 50 | + |
| 51 | +:::: |
| 52 | + |
| 53 | +::::{step} Run your vLLM container |
| 54 | + |
| 55 | +To pull and run your chosen vLLM image: |
| 56 | + |
| 57 | +1. Connect to your server using SSH. |
| 58 | +2. Run the following terminal command to start the vLLM server, download the model, and expose it on port 8000: |
| 59 | + |
| 60 | +```bash |
| 61 | +docker run \ |
| 62 | + --name [YOUR_MODEL_ID] \ <1> |
| 63 | + --gpus all \ <2> |
| 64 | + -v /root/.cache/huggingface:/root/.cache/huggingface \ <3> |
| 65 | + --env HUGGING_FACE_HUB_TOKEN=xxxx \ <4> |
| 66 | + --env VLLM_API_KEY=xxxx \ <5> |
| 67 | + -p 8000:8000 \ <6> |
| 68 | + --ipc=host \ <7> |
| 69 | + vllm/vllm-openai:v0.9.1 \ <8> |
| 70 | + --model mistralai/[YOUR_MODEL_ID] \ <9> |
| 71 | + --tool-call-parser mistral \ <10> |
| 72 | + --tokenizer-mode mistral \ <11> |
| 73 | + --config-format mistral \ <12> |
| 74 | + --load-format mistral \ <13> |
| 75 | + --enable-auto-tool-choice \ <14> |
| 76 | + --gpu-memory-utilization 0.90 \ <15> |
| 77 | + --tensor-parallel-size 2 <16> |
| 78 | +``` |
| 79 | +1. Defines a name for the container. |
| 80 | +2. Exposes all available GPUs to the container. |
| 81 | +3. Sets the Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`). |
| 82 | +4. Sets the environment variable for your Hugging Face token (only required for gated models). |
| 83 | +5. vLLM API key used for authentication between {{ecloud}} and vLLM. |
| 84 | +6. Maps port 8000 on the host to port 8000 in the container. |
| 85 | +7. Enables sharing memory between host and container. |
| 86 | +8. Specifies the official vLLM OpenAI-compatible image, version 0.9.1. This is the version of vLLM we recommend. |
| 87 | +9. ID of the Hugging Face model you wish to serve. |
| 88 | +10. Mistral-specific tool call parser. Refer to the Hugging Face model card for recommended values. |
| 89 | +11. Mistral-specific tokenizer mode. Refer to the Hugging Face model card for recommended values. |
| 90 | +12. Mistral-specific configuration format. Refer to the Hugging Face model card for recommended values. |
| 91 | +13. Mistral-specific load format. Refer to the Hugging Face model card for recommended values. |
| 92 | +14. Enables automatic function calling. |
| 93 | +15. Limits max GPU used by vLLM (may vary depending on the machine resources available). |
| 94 | +16. This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. |
| 95 | + |
| 96 | +:::{important} |
| 97 | +Verify the container's status by running the `docker ps -a` command. The output should show the value you specified for the `--name` parameter. |
| 98 | +::: |
| 99 | + |
| 100 | +:::: |
| 101 | + |
| 102 | +::::{step} Expose the API with a reverse proxy |
| 103 | + |
| 104 | +Using a reverse proxy improves stability for this use case. This example uses Nginx, which supports monitoring by means of Elastic's native Nginx integration. The example Nginx configuration forwards traffic to the vLLM container and uses a secret token for authentication. |
| 105 | + |
| 106 | +1. Install Nginx on your server. |
| 107 | +2. Create a configuration file, for example at `/etc/nginx/sites-available/default`. Give it the following content: |
| 108 | + |
| 109 | +``` |
| 110 | +server { |
| 111 | + listen 80; |
| 112 | + server_name <yourdomainname.com>; |
| 113 | + return 301 https://$server_name$request_uri; |
| 114 | +} |
| 115 | +
|
| 116 | +server { |
| 117 | + listen 443 ssl http2; |
| 118 | + server_name <yourdomainname.com>; |
| 119 | +
|
| 120 | + ssl_certificate /etc/letsencrypt/live/<yourdomainname.com>/fullchain.pem; |
| 121 | + ssl_certificate_key /etc/letsencrypt/live/<yourdomainname.com>/privkey.pem; |
| 122 | +
|
| 123 | + location / { |
| 124 | + if ($http_authorization != "Bearer <secret token>") { |
| 125 | + return 401; |
| 126 | + } |
| 127 | + proxy_pass http://localhost:8000/; |
| 128 | + } |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +3. Enable and restart Nginx to apply the configuration. |
| 133 | + |
| 134 | +:::{note} |
| 135 | +For quick testing, you can use [ngrok](https://ngrok.com/) as an alternative to Nginx, but it is not recommended for production use. |
| 136 | +::: |
| 137 | + |
| 138 | +:::: |
| 139 | + |
| 140 | +::::{step} Configure the connector in your Elastic deployment |
| 141 | + |
| 142 | +Create the connector within your Elastic deployment to link it to your vLLM instance. |
| 143 | + |
| 144 | +1. In {{kib}}, navigate to the **Connectors** page, click **Create Connector**, and select **OpenAI**. |
| 145 | +2. Give the connector a descriptive name, such as `vLLM - Mistral Small 3.2`. |
| 146 | +3. In **Connector settings**, configure the following: |
| 147 | + * For **Select an OpenAI provider**, select **Other (OpenAI Compatible Service)**. |
| 148 | + * For **URL**, enter your server's public URL followed by `/v1/chat/completions`. |
| 149 | +4. For **Default Model**, enter `mistralai/[YOUR_MODEL_ID]`. |
| 150 | +5. For **Authentication**, configure the following: |
| 151 | + * For **API key**, enter the secret token you created in Step 1 and specified in your Nginx configuration file. |
| 152 | + * If your chosen model supports tool use, then turn on **Enable native function calling**. |
| 153 | +6. Click **Save** |
| 154 | +7. To enable the connector to work with AI Assistant for Security, add the following to your `config/kibana.yml` file: |
| 155 | + ``` |
| 156 | + feature_flags.overrides: |
| 157 | + securitySolution.inferenceChatModelDisabled: true |
| 158 | + ``` |
| 159 | +8. Finally, open the **AI Assistant for Security** page using the navigation menu or the [global search field](/explore-analyze/find-and-organize/find-apps-and-objects.md). |
| 160 | + * On the **Conversations** tab, turn off **Streaming**. |
| 161 | + * If your model supports tool use, then on the **System prompts** page, create a new system prompt with a variation of the following prompt, to prevent your model from returning tool calls in AI Assistant conversations: |
| 162 | + |
| 163 | + ```markdown |
| 164 | + You are a model running under OpenAI-compatible tool calling mode. |
| 165 | + |
| 166 | + Rules: |
| 167 | + 1. When you want to invoke a tool, never describe the call in text. |
| 168 | + 2. Always return the invocation in the `tool_calls` field. |
| 169 | + 3. The `content` field must remain empty for any assistant message that performs a tool call. |
| 170 | + 4. Only use tool calls defined in the "tools" parameter. |
| 171 | + ``` |
| 172 | +:::: |
| 173 | +::::: |
| 174 | +
|
| 175 | +Setup is now complete. The model served by your vLLM container can now power Elastic's generative AI features. |
| 176 | +
|
| 177 | +:::{note} |
| 178 | +To run a different model: |
| 179 | +* Stop the current container and run a new one with an updated `--model` parameter. |
| 180 | +* Update your {{kib}} connector's **Default model** parameter to match the new model ID. |
| 181 | +::: |
| 182 | +
|
| 183 | +## Next steps |
| 184 | +
|
| 185 | +With your vLLM connector set up, you can use it to power features including: |
| 186 | +
|
| 187 | +* [AI Assistant for Security](/solutions/security/ai/ai-assistant.md): Interact with an agent designed to assist with {{elastic-sec}} tasks. |
| 188 | +* [Attack Discovery](/solutions/security/ai/attack-discovery.md): Use AI to quickly correlate and triage security alerts. |
| 189 | +* [Automatic import](/solutions/security/get-started/automatic-import.md): Use AI to create custom integrations for third-party data sources. |
| 190 | +* [AI Assistant for Observability and Search](/solutions/observability/observability-ai-assistant.md): Interact with an agent designed to assist with {{observability}} and Search tasks. |
| 191 | +
|
| 192 | +You can also learn how to [set up other types of LLM connectors](/solutions/security/ai/set-up-connectors-for-large-language-models-llm.md). |
0 commit comments