Skip to content

Add vLLM backend support#745

Merged
sozercan merged 1 commit intomainfrom
add-vllm-backend
Mar 9, 2026
Merged

Add vLLM backend support#745
sozercan merged 1 commit intomainfrom
add-vllm-backend

Conversation

@sozercan
Copy link
Member

@sozercan sozercan commented Mar 9, 2026

Summary

  • Adds vLLM as a third backend option (alongside llama-cpp and diffusers) for high-throughput GPU inference with HuggingFace safetensors models
  • vLLM is CUDA-only, amd64-only — validated end-to-end on NVIDIA A100 80GB with Qwen2.5-0.5B-Instruct
  • Includes build-time workarounds for the pre-built LocalAI vLLM backend image (flash_attn ABI fix, AsyncLLM API patch)

Changes

  • pkg/utils/const.goBackendVLLM constant
  • pkg/aikit2llb/inference/backend.go — Backend tag/name/alias resolution, dependency installation, compatibility patches for vLLM
  • pkg/aikit2llb/inference/vllm.go — Python base deps + gcc/libc6-dev for Triton JIT
  • pkg/aikit2llb/inference/convert.go — CUDA apt packages for vLLM
  • pkg/aikit2llb/inference/download.go — HuggingFace repo-level downloads (huggingface://namespace/model)
  • pkg/build/build.go — Validation (vLLM requires CUDA, amd64)
  • test/aikitfile-vllm.yaml — Test aikitfile (runtime model download)
  • .github/workflows/test-docker-gpu.yaml — vLLM in GPU test matrix
  • website/docs/ — Specs and GPU docs updated

Test plan

  • make test — 148 tests pass (including new vLLM cases for backend tag/name/alias, build validation, ARM64 rejection)
  • make lint — 0 issues
  • Build AIKit image (docker buildx build .)
  • Build vLLM model image (docker buildx build -f test/aikitfile-vllm.yaml)
  • Run inference on GPU VM: curl /v1/chat/completions with Qwen2.5-0.5B-Instruct — success

Copilot AI review requested due to automatic review settings March 9, 2026 20:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds vLLM as an additional LocalAI backend option to support high-throughput CUDA GPU inference with HuggingFace safetensors models, including build-time validation and HuggingFace repo-level model downloads.

Changes:

  • Introduces vllm backend constant plus backend tag/name/alias resolution and install flow (including compatibility patching).
  • Adds HuggingFace repo-level downloads (huggingface://namespace/model) via an hf-cli helper image.
  • Extends build validation, tests, docs, and GPU CI workflow matrix to cover vLLM (CUDA-only, amd64-only).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
website/docs/specs-inference.md Documents vllm as an allowed backend option.
website/docs/gpu.md Adds vLLM usage guidance and example config for GPU inference.
test/aikitfile-vllm.yaml Adds a vLLM test aikitfile that downloads the model at runtime.
pkg/utils/const.go Adds BackendVLLM constant.
pkg/build/build.go Enforces vLLM runtime requirement (runtime: cuda) and supports vLLM in backend allowlist.
pkg/build/build_test.go Adds test cases for vLLM validation and arm64 rejection.
pkg/aikit2llb/inference/backend.go Adds vLLM tag/name/alias logic, dependency install hook, and compatibility patching for the backend image.
pkg/aikit2llb/inference/backend_test.go Adds backend tag/name/alias tests for vLLM on CUDA/amd64.
pkg/aikit2llb/inference/vllm.go Adds vLLM dependency installation (Python base deps + gcc/libc6-dev).
pkg/aikit2llb/inference/vllm_test.go Adds a basic non-panicking test for vLLM dependency install wiring.
pkg/aikit2llb/inference/convert.go Ensures CUDA libs are installed when vLLM backend is selected.
pkg/aikit2llb/inference/download.go Adds HuggingFace repo-level download support using hf CLI image + optional HF token secret.
.github/workflows/test-docker-gpu.yaml Adds vLLM to GPU test matrix and validates a chat completion response.

Comment on lines +10 to +20
func installVLLMDependencies(s llb.State, merge llb.State) llb.State {
merge = installPythonBaseDependencies(s, merge)

savedState := s
s = s.Run(utils.Sh("apt-get update && apt-get install --no-install-recommends -y gcc libc6-dev && apt-get clean"),
llb.WithCustomName("Installing C compiler for vLLM Triton JIT"),
).Root()

diff := llb.Diff(savedState, s)
return llb.Merge([]llb.State{merge, diff})
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installVLLMDependencies installs Python deps via installPythonBaseDependencies(s, merge) (which produces a diff from the current state), but then computes the gcc/libc6-dev diff from the original state s again. Merging two diffs that both modify dpkg/apt state (e.g., /var/lib/dpkg/status) can lead to lost updates or non-deterministic results. Consider running the gcc install step on top of the state returned by installPythonBaseDependencies (or performing both installs in a single Run and taking one Diff) so dpkg changes are applied sequentially instead of merged.

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +101
choices=$(echo "$result" | jq '.choices')
if [ -z "$choices" ]; then
exit 1
fi
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only checks that .choices is a non-empty string. If the API returns { "choices": null } or { "choices": [] }, jq '.choices' outputs null/[] (both non-empty strings) and the step will still pass. Consider using jq -e with a predicate that asserts choices is an array with length > 0 (and fail otherwise) to avoid false positives in the vLLM GPU test.

Copilot uses AI. Check for mistakes.
Add vLLM as a third backend option alongside llama-cpp and diffusers,
enabling HuggingFace safetensors model inference via vLLM on NVIDIA GPUs.

- Add BackendVLLM constant and wire through backend selection, OCI tag
  resolution, alias mapping, and metadata generation
- Install Python base dependencies + gcc/libc6-dev for Triton JIT compilation
- Install CUDA apt packages (libcublas, cuda-cudart) for vLLM runtime
- Support HuggingFace repo-level downloads (huggingface://namespace/model)
  in addition to existing single-file downloads
- Add build-time patches for pre-built vLLM backend image compatibility
  (flash_attn ABI fix, AsyncLLM API update)
- Add validation: vLLM requires CUDA runtime, amd64-only
- Add test aikitfile, unit tests, GPU CI workflow matrix entry, and docs

Validated end-to-end on NVIDIA A100 80GB with Qwen2.5-0.5B-Instruct.
@sozercan sozercan merged commit 109b14f into main Mar 9, 2026
38 of 40 checks passed
@sozercan sozercan deleted the add-vllm-backend branch March 9, 2026 23:43
@sozercan sozercan linked an issue Mar 10, 2026 that may be closed by this pull request
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[REQ] add vllm backend

2 participants