Add vLLM backend support by sozercan · Pull Request #745 · kaito-project/aikit

sozercan · 2026-03-09T20:29:54Z

Summary

Adds vLLM as a third backend option (alongside llama-cpp and diffusers) for high-throughput GPU inference with HuggingFace safetensors models
vLLM is CUDA-only, amd64-only — validated end-to-end on NVIDIA A100 80GB with Qwen2.5-0.5B-Instruct
Includes build-time workarounds for the pre-built LocalAI vLLM backend image (flash_attn ABI fix, AsyncLLM API patch)

Changes

pkg/utils/const.go — BackendVLLM constant
pkg/aikit2llb/inference/backend.go — Backend tag/name/alias resolution, dependency installation, compatibility patches for vLLM
pkg/aikit2llb/inference/vllm.go — Python base deps + gcc/libc6-dev for Triton JIT
pkg/aikit2llb/inference/convert.go — CUDA apt packages for vLLM
pkg/aikit2llb/inference/download.go — HuggingFace repo-level downloads (huggingface://namespace/model)
pkg/build/build.go — Validation (vLLM requires CUDA, amd64)
test/aikitfile-vllm.yaml — Test aikitfile (runtime model download)
.github/workflows/test-docker-gpu.yaml — vLLM in GPU test matrix
website/docs/ — Specs and GPU docs updated

Test plan

make test — 148 tests pass (including new vLLM cases for backend tag/name/alias, build validation, ARM64 rejection)
make lint — 0 issues
Build AIKit image (docker buildx build .)
Build vLLM model image (docker buildx build -f test/aikitfile-vllm.yaml)
Run inference on GPU VM: curl /v1/chat/completions with Qwen2.5-0.5B-Instruct — success

Copilot

Pull request overview

Adds vLLM as an additional LocalAI backend option to support high-throughput CUDA GPU inference with HuggingFace safetensors models, including build-time validation and HuggingFace repo-level model downloads.

Changes:

Introduces vllm backend constant plus backend tag/name/alias resolution and install flow (including compatibility patching).
Adds HuggingFace repo-level downloads (huggingface://namespace/model) via an hf-cli helper image.
Extends build validation, tests, docs, and GPU CI workflow matrix to cover vLLM (CUDA-only, amd64-only).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`website/docs/specs-inference.md`	Documents `vllm` as an allowed backend option.
`website/docs/gpu.md`	Adds vLLM usage guidance and example config for GPU inference.
`test/aikitfile-vllm.yaml`	Adds a vLLM test aikitfile that downloads the model at runtime.
`pkg/utils/const.go`	Adds `BackendVLLM` constant.
`pkg/build/build.go`	Enforces vLLM runtime requirement (`runtime: cuda`) and supports vLLM in backend allowlist.
`pkg/build/build_test.go`	Adds test cases for vLLM validation and arm64 rejection.
`pkg/aikit2llb/inference/backend.go`	Adds vLLM tag/name/alias logic, dependency install hook, and compatibility patching for the backend image.
`pkg/aikit2llb/inference/backend_test.go`	Adds backend tag/name/alias tests for vLLM on CUDA/amd64.
`pkg/aikit2llb/inference/vllm.go`	Adds vLLM dependency installation (Python base deps + gcc/libc6-dev).
`pkg/aikit2llb/inference/vllm_test.go`	Adds a basic non-panicking test for vLLM dependency install wiring.
`pkg/aikit2llb/inference/convert.go`	Ensures CUDA libs are installed when vLLM backend is selected.
`pkg/aikit2llb/inference/download.go`	Adds HuggingFace repo-level download support using `hf` CLI image + optional HF token secret.
`.github/workflows/test-docker-gpu.yaml`	Adds vLLM to GPU test matrix and validates a chat completion response.

Copilot · 2026-03-09T20:36:55Z

pkg/aikit2llb/inference/vllm.go

+func installVLLMDependencies(s llb.State, merge llb.State) llb.State {
+	merge = installPythonBaseDependencies(s, merge)
+
+	savedState := s
+	s = s.Run(utils.Sh("apt-get update && apt-get install --no-install-recommends -y gcc libc6-dev && apt-get clean"),
+		llb.WithCustomName("Installing C compiler for vLLM Triton JIT"),
+	).Root()
+
+	diff := llb.Diff(savedState, s)
+	return llb.Merge([]llb.State{merge, diff})
+}


installVLLMDependencies installs Python deps via installPythonBaseDependencies(s, merge) (which produces a diff from the current state), but then computes the gcc/libc6-dev diff from the original state s again. Merging two diffs that both modify dpkg/apt state (e.g., /var/lib/dpkg/status) can lead to lost updates or non-deterministic results. Consider running the gcc install step on top of the state returned by installPythonBaseDependencies (or performing both installs in a single Run and taking one Diff) so dpkg changes are applied sequentially instead of merged.

Copilot · 2026-03-09T20:36:55Z

.github/workflows/test-docker-gpu.yaml

+          choices=$(echo "$result" | jq '.choices')
+          if [ -z "$choices" ]; then
+            exit 1
+          fi


This test only checks that .choices is a non-empty string. If the API returns { "choices": null } or { "choices": [] }, jq '.choices' outputs null/[] (both non-empty strings) and the step will still pass. Consider using jq -e with a predicate that asserts choices is an array with length > 0 (and fail otherwise) to avoid false positives in the vLLM GPU test.

Add vLLM as a third backend option alongside llama-cpp and diffusers, enabling HuggingFace safetensors model inference via vLLM on NVIDIA GPUs. - Add BackendVLLM constant and wire through backend selection, OCI tag resolution, alias mapping, and metadata generation - Install Python base dependencies + gcc/libc6-dev for Triton JIT compilation - Install CUDA apt packages (libcublas, cuda-cudart) for vLLM runtime - Support HuggingFace repo-level downloads (huggingface://namespace/model) in addition to existing single-file downloads - Add build-time patches for pre-built vLLM backend image compatibility (flash_attn ABI fix, AsyncLLM API update) - Add validation: vLLM requires CUDA runtime, amd64-only - Add test aikitfile, unit tests, GPU CI workflow matrix entry, and docs Validated end-to-end on NVIDIA A100 80GB with Qwen2.5-0.5B-Instruct.

Copilot AI review requested due to automatic review settings March 9, 2026 20:29

Copilot started reviewing on behalf of sozercan March 9, 2026 20:30 View session

sozercan force-pushed the add-vllm-backend branch from 173245d to 079e231 Compare March 9, 2026 20:35

Copilot AI reviewed Mar 9, 2026

View reviewed changes

sozercan force-pushed the add-vllm-backend branch from 079e231 to c71b878 Compare March 9, 2026 21:28

sozercan force-pushed the add-vllm-backend branch from c71b878 to e9f450c Compare March 9, 2026 23:27

sozercan merged commit 109b14f into main Mar 9, 2026
38 of 40 checks passed

sozercan deleted the add-vllm-backend branch March 9, 2026 23:43

sozercan linked an issue Mar 10, 2026 that may be closed by this pull request

[REQ] add vllm backend #418

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM backend support#745

Add vLLM backend support#745
sozercan merged 1 commit intomainfrom
add-vllm-backend

sozercan commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sozercan commented Mar 9, 2026

Summary

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants