Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
82a7cf7
Creates vLLM connection guide
benironside Nov 7, 2025
ada1c84
Update connect-to-vLLM.md
benironside Nov 7, 2025
e526500
adds collapsible explanation section
benironside Nov 7, 2025
4ce881b
Update connect-to-vLLM.md
benironside Nov 7, 2025
7fbdd2c
Adds final setup steps
benironside Nov 7, 2025
bd93ac1
fixes formatting
benironside Nov 13, 2025
ab58841
Merge branch 'main' into 3474-vLLM-guide
benironside Nov 13, 2025
02cd1e8
removes collapsible block
benironside Nov 13, 2025
f5d9acd
Merge branch '3474-vLLM-guide' of https://github.com/elastic/docs-con…
benironside Nov 13, 2025
9aed97e
Incorporates Patryk's review
benironside Nov 24, 2025
a2a08a2
Update connect-to-vLLM.md
benironside Nov 24, 2025
5024b23
specifies use-case for each custom llm guide
benironside Nov 24, 2025
49eb86d
minor edit
benironside Nov 24, 2025
6742fd3
First pass incorporating Brandon's review
benironside Nov 25, 2025
3e5cd12
Additional edits inspired by Brandon's review
benironside Nov 25, 2025
58f1a0b
implements stepper
benironside Nov 25, 2025
5bd6224
Adds prereqs and fixes stepper
benironside Nov 25, 2025
480535d
additional fixes
benironside Nov 25, 2025
2ddbf8c
adds next steps
benironside Nov 25, 2025
08e3a9e
fixes list in step 1
benironside Nov 25, 2025
b9eec65
experiments with indentation
benironside Nov 25, 2025
420f528
typo
benironside Nov 25, 2025
1edfc67
Update connect-to-vLLM.md
benironside Nov 25, 2025
56d3b8a
Update connect-to-vLLM.md
benironside Nov 25, 2025
a080779
Update solutions/security/ai/connect-to-vLLM.md
benironside Nov 25, 2025
9c79a20
Update solutions/security/ai/connect-to-vLLM.md
benironside Nov 25, 2025
8377140
Merge branch 'main' into 3474-vLLM-guide
benironside Nov 25, 2025
9fe1e5b
Merge branch 'main' into 3474-vLLM-guide
benironside Nov 25, 2025
abee578
Merge branch 'main' into 3474-vLLM-guide
benironside Nov 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion solutions/security/ai/connect-to-own-local-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ products:
- id: cloud-serverless
---

# Connect to your own local LLM
# Connect to your own local LLM using LM Studio

This page provides instructions for setting up a connector to a large language model (LLM) of your choice using LM Studio. This allows you to use your chosen model within {{elastic-sec}}. You’ll first need to set up a reverse proxy to communicate with {{elastic-sec}}, then set up LM Studio on a server, and finally configure the connector in your Elastic deployment. [Learn more about the benefits of using a local LLM](https://www.elastic.co/blog/ai-assistant-locally-hosted-models).

Expand Down
89 changes: 89 additions & 0 deletions solutions/security/ai/connect-to-vLLM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
applies_to:
stack: all
serverless:
security: all
products:
- id: security
- id: cloud-serverless
---

# Connect to your own LLM using vLLM (air gapped environments)
This page provides an example of how to connect to a self-hosted, open-source large language model (LLM) using the [vLLM inference engine](https://docs.vllm.ai/en/latest/) running in a Docker or Podman container.

Using this approach, you can power elastic's AI features with an LLM of your choice deployed and managed on infrastructure you control without granting external network access, which is particularly useful for air-gapped environments and organizations with strict network security policies.

## Requirements

* Docker or Podman.
* Necessary GPU drivers.

## Server used in this example

This example uses a GCP server configured as follows:

* Operating system: Ubuntu 24.10
* Machine type: a2-ultragpu-2g
* vCPU: 24 (12 cores)
* Architecture: x86/64
* CPU Platform: Intel Cascade Lake
* Memory: 340GB
* Accelerator: 2 x NVIDIA A100 80GB GPUs
* Reverse Proxy: Nginx

## Outline
The process involves four main steps:

1. Configure your host server with the necessary GPU resources.
2. Run the desired model in a vLLM container.
3. Use a reverse proxy like Nginx to securely expose the endpoint to {{ecloud}}.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just Elastic Cloud that this works with? Not other deployment types?

4. Configure the OpenAI connector in your Elastic deployment.

## Step 1: Configure your host server

1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens).
1. Log in to your Hugging Face account.
2. Navigate to **Settings > Access Tokens**.
3. Create a new token with at least `read` permissions. Copy it to a secure location.
2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy.

## Step 2: Run your vLLM container

To pull and run your chosen vLLM image:

1. Connect to your server using SSH.
2. Run the following terminal command to start the vLLM server, download the model, and expose it on port 8000:

```bash
docker run --name Mistral-Small-3.2-24B --gpus all \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is is something we will be able to update shortly? I mean we should avoid recommending Mistral-Small-3.2-24B as it has a lot of issues with Security Assistant tool calling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update this any time. For now, since this model isn't recommended, I replaced it with [YOUR_MODEL_ID]. Make sense to you?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's going to be less confusing if we stay to the previous version and just update it with a new model, because the list of params depends on the model id

-v /root/.cache/huggingface:/root/.cache/huggingface \
--env HUGGING_FACE_HUB_TOKEN=xxxx \
--env VLLM_API_KEY=xxxx \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.9.1 \
--model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
--tool-call-parser mistral \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--enable-auto-tool-choice \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 2
```

::::{admonition} Explanation of command
`--gpus all`: Exposes all available GPUs to the container.
`--name`: Set predefined name for the container, otherwise it’s going to be generated
`-v /root/.cache/huggingface:/root/.cache/huggingface`: Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`).
`-e HUGGING_FACE_HUB_TOKEN`: Sets the environment variable for your Hugging Face token (only required for gated models).
`--env VLLM_API_KEY`: vLLM API Key used for authentication between {{ecloud}} and vLLM.
`-p 8000:8000`: Maps port 8000 on the host to port 8000 in the container.
`–ipc=host`: Enables sharing memory between host and container.
`vllm/vllm-openai:v0.9.1`: Specifies the official vLLM OpenAI-compatible image, version 0.9.1. This is the version of vLLM we recommend.
`--model`: ID of the Hugging Face model you wish to serve. In this example it represents the `Mistral-Small-3.2-24B` model.
`--tool-call-parser mistral \`, `--tokenizer-mode mistral \`, `--config-format mistral \`, and `--load-format mistral`: Mistral specific parameters, refer to the Hugging Face model card for recommended values.
`-enable-auto-tool-choice`: Enables automatic function calling.
`--gpu-memory-utilization 0.90`: Limits max GPU used by vLLM (may vary depending on the machine resources available).
`--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems.
::::
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,10 @@ Follow these guides to connect to one or more third-party LLM providers:
* [OpenAI](/solutions/security/ai/connect-to-openai.md)
* [Google Vertex](/solutions/security/ai/connect-to-google-vertex.md)

## Connect to a custom local LLM
## Connect to a self-managed LLM

You can [connect to LM Studio](/solutions/security/ai/connect-to-own-local-llm.md) to use a custom LLM deployed and managed by you.
- You can [connect to LM Studio](/solutions/security/ai/connect-to-own-local-llm.md) to use a custom LLM deployed and managed by you.
- For air-gapped environments, you can [connect to vLLM](/solutions/security/ai/connect-to-vLLM.md).

## Preconfigured connectors

Expand Down
1 change: 1 addition & 0 deletions solutions/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -575,6 +575,7 @@ toc:
- file: security/ai/connect-to-openai.md
- file: security/ai/connect-to-google-vertex.md
- file: security/ai/connect-to-own-local-llm.md
- file: security/ai/connect-to-vLLM.md
- file: security/ai/use-cases.md
children:
- file: security/ai/triage-alerts.md
Expand Down