-
Notifications
You must be signed in to change notification settings - Fork 181
vLLM custom connector setup guide #3858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
82a7cf7
ada1c84
e526500
4ce881b
7fbdd2c
bd93ac1
ab58841
02cd1e8
f5d9acd
9aed97e
a2a08a2
5024b23
49eb86d
6742fd3
3e5cd12
58f1a0b
5bd6224
480535d
2ddbf8c
08e3a9e
b9eec65
420f528
1edfc67
56d3b8a
a080779
9c79a20
8377140
9fe1e5b
abee578
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| --- | ||
| applies_to: | ||
| stack: all | ||
| serverless: | ||
| security: all | ||
| products: | ||
| - id: security | ||
| - id: cloud-serverless | ||
| --- | ||
|
|
||
| # Connect to your own LLM using vLLM (air gapped environments) | ||
| This page provides an example of how to connect to a self-hosted, open-source large language model (LLM) using the [vLLM inference engine](https://docs.vllm.ai/en/latest/) running in a Docker or Podman container. | ||
|
|
||
| Using this approach, you can power elastic's AI features with an LLM of your choice deployed and managed on infrastructure you control without granting external network access, which is particularly useful for air-gapped environments and organizations with strict network security policies. | ||
|
|
||
| ## Requirements | ||
|
|
||
| * Docker or Podman. | ||
benironside marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * Necessary GPU drivers. | ||
benironside marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Server used in this example | ||
benironside marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| This example uses a GCP server configured as follows: | ||
|
|
||
| * Operating system: Ubuntu 24.10 | ||
| * Machine type: a2-ultragpu-2g | ||
| * vCPU: 24 (12 cores) | ||
| * Architecture: x86/64 | ||
| * CPU Platform: Intel Cascade Lake | ||
| * Memory: 340GB | ||
| * Accelerator: 2 x NVIDIA A100 80GB GPUs | ||
benironside marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * Reverse Proxy: Nginx | ||
|
|
||
| ## Outline | ||
benironside marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| The process involves four main steps: | ||
|
|
||
| 1. Configure your host server with the necessary GPU resources. | ||
| 2. Run the desired model in a vLLM container. | ||
| 3. Use a reverse proxy like Nginx to securely expose the endpoint to {{ecloud}}. | ||
|
||
| 4. Configure the OpenAI connector in your Elastic deployment. | ||
|
|
||
| ## Step 1: Configure your host server | ||
benironside marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. (Optional) If you plan to use a gated model (like Llama 3.1) or a private model, you need to create a [Hugging Face user access token](https://huggingface.co/docs/hub/en/security-tokens). | ||
| 1. Log in to your Hugging Face account. | ||
| 2. Navigate to **Settings > Access Tokens**. | ||
| 3. Create a new token with at least `read` permissions. Copy it to a secure location. | ||
| 2. Create an OpenAI-compatible secret token. Generate a strong, random string and save it in a secure location. You need the secret token to authenticate communication between {{ecloud}} and your Nginx reverse proxy. | ||
|
|
||
| ## Step 2: Run your vLLM container | ||
|
|
||
| To pull and run your chosen vLLM image: | ||
|
|
||
| 1. Connect to your server using SSH. | ||
| 2. Run the following terminal command to start the vLLM server, download the model, and expose it on port 8000: | ||
|
|
||
| ```bash | ||
| docker run --name Mistral-Small-3.2-24B --gpus all \ | ||
|
||
| -v /root/.cache/huggingface:/root/.cache/huggingface \ | ||
| --env HUGGING_FACE_HUB_TOKEN=xxxx \ | ||
| --env VLLM_API_KEY=xxxx \ | ||
| -p 8000:8000 \ | ||
| --ipc=host \ | ||
| vllm/vllm-openai:v0.9.1 \ | ||
| --model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \ | ||
| --tool-call-parser mistral \ | ||
| --tokenizer-mode mistral \ | ||
| --config-format mistral \ | ||
| --load-format mistral \ | ||
| --enable-auto-tool-choice \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --tensor-parallel-size 2 | ||
| ``` | ||
|
|
||
| ::::{admonition} Explanation of command | ||
| `--gpus all`: Exposes all available GPUs to the container. | ||
| `--name`: Set predefined name for the container, otherwise it’s going to be generated | ||
| `-v /root/.cache/huggingface:/root/.cache/huggingface`: Hugging Face cache directory (optional if used with `HUGGING_FACE_HUB_TOKEN`). | ||
| `-e HUGGING_FACE_HUB_TOKEN`: Sets the environment variable for your Hugging Face token (only required for gated models). | ||
| `--env VLLM_API_KEY`: vLLM API Key used for authentication between {{ecloud}} and vLLM. | ||
| `-p 8000:8000`: Maps port 8000 on the host to port 8000 in the container. | ||
| `–ipc=host`: Enables sharing memory between host and container. | ||
| `vllm/vllm-openai:v0.9.1`: Specifies the official vLLM OpenAI-compatible image, version 0.9.1. This is the version of vLLM we recommend. | ||
| `--model`: ID of the Hugging Face model you wish to serve. In this example it represents the `Mistral-Small-3.2-24B` model. | ||
| `--tool-call-parser mistral \`, `--tokenizer-mode mistral \`, `--config-format mistral \`, and `--load-format mistral`: Mistral specific parameters, refer to the Hugging Face model card for recommended values. | ||
| `-enable-auto-tool-choice`: Enables automatic function calling. | ||
| `--gpu-memory-utilization 0.90`: Limits max GPU used by vLLM (may vary depending on the machine resources available). | ||
| `--tensor-parallel-size 2`: This value should match the number of available GPUs (in this case, 2). This is critical for performance on multi-GPU systems. | ||
| :::: | ||
Uh oh!
There was an error while loading. Please reload this page.