Conversation
173245d to
079e231
Compare
There was a problem hiding this comment.
Pull request overview
Adds vLLM as an additional LocalAI backend option to support high-throughput CUDA GPU inference with HuggingFace safetensors models, including build-time validation and HuggingFace repo-level model downloads.
Changes:
- Introduces
vllmbackend constant plus backend tag/name/alias resolution and install flow (including compatibility patching). - Adds HuggingFace repo-level downloads (
huggingface://namespace/model) via anhf-clihelper image. - Extends build validation, tests, docs, and GPU CI workflow matrix to cover vLLM (CUDA-only, amd64-only).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
website/docs/specs-inference.md |
Documents vllm as an allowed backend option. |
website/docs/gpu.md |
Adds vLLM usage guidance and example config for GPU inference. |
test/aikitfile-vllm.yaml |
Adds a vLLM test aikitfile that downloads the model at runtime. |
pkg/utils/const.go |
Adds BackendVLLM constant. |
pkg/build/build.go |
Enforces vLLM runtime requirement (runtime: cuda) and supports vLLM in backend allowlist. |
pkg/build/build_test.go |
Adds test cases for vLLM validation and arm64 rejection. |
pkg/aikit2llb/inference/backend.go |
Adds vLLM tag/name/alias logic, dependency install hook, and compatibility patching for the backend image. |
pkg/aikit2llb/inference/backend_test.go |
Adds backend tag/name/alias tests for vLLM on CUDA/amd64. |
pkg/aikit2llb/inference/vllm.go |
Adds vLLM dependency installation (Python base deps + gcc/libc6-dev). |
pkg/aikit2llb/inference/vllm_test.go |
Adds a basic non-panicking test for vLLM dependency install wiring. |
pkg/aikit2llb/inference/convert.go |
Ensures CUDA libs are installed when vLLM backend is selected. |
pkg/aikit2llb/inference/download.go |
Adds HuggingFace repo-level download support using hf CLI image + optional HF token secret. |
.github/workflows/test-docker-gpu.yaml |
Adds vLLM to GPU test matrix and validates a chat completion response. |
| func installVLLMDependencies(s llb.State, merge llb.State) llb.State { | ||
| merge = installPythonBaseDependencies(s, merge) | ||
|
|
||
| savedState := s | ||
| s = s.Run(utils.Sh("apt-get update && apt-get install --no-install-recommends -y gcc libc6-dev && apt-get clean"), | ||
| llb.WithCustomName("Installing C compiler for vLLM Triton JIT"), | ||
| ).Root() | ||
|
|
||
| diff := llb.Diff(savedState, s) | ||
| return llb.Merge([]llb.State{merge, diff}) | ||
| } |
There was a problem hiding this comment.
installVLLMDependencies installs Python deps via installPythonBaseDependencies(s, merge) (which produces a diff from the current state), but then computes the gcc/libc6-dev diff from the original state s again. Merging two diffs that both modify dpkg/apt state (e.g., /var/lib/dpkg/status) can lead to lost updates or non-deterministic results. Consider running the gcc install step on top of the state returned by installPythonBaseDependencies (or performing both installs in a single Run and taking one Diff) so dpkg changes are applied sequentially instead of merged.
| choices=$(echo "$result" | jq '.choices') | ||
| if [ -z "$choices" ]; then | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
This test only checks that .choices is a non-empty string. If the API returns { "choices": null } or { "choices": [] }, jq '.choices' outputs null/[] (both non-empty strings) and the step will still pass. Consider using jq -e with a predicate that asserts choices is an array with length > 0 (and fail otherwise) to avoid false positives in the vLLM GPU test.
079e231 to
c71b878
Compare
Add vLLM as a third backend option alongside llama-cpp and diffusers, enabling HuggingFace safetensors model inference via vLLM on NVIDIA GPUs. - Add BackendVLLM constant and wire through backend selection, OCI tag resolution, alias mapping, and metadata generation - Install Python base dependencies + gcc/libc6-dev for Triton JIT compilation - Install CUDA apt packages (libcublas, cuda-cudart) for vLLM runtime - Support HuggingFace repo-level downloads (huggingface://namespace/model) in addition to existing single-file downloads - Add build-time patches for pre-built vLLM backend image compatibility (flash_attn ABI fix, AsyncLLM API update) - Add validation: vLLM requires CUDA runtime, amd64-only - Add test aikitfile, unit tests, GPU CI workflow matrix entry, and docs Validated end-to-end on NVIDIA A100 80GB with Qwen2.5-0.5B-Instruct.
c71b878 to
e9f450c
Compare
Summary
llama-cppanddiffusers) for high-throughput GPU inference with HuggingFace safetensors modelsflash_attnABI fix,AsyncLLMAPI patch)Changes
pkg/utils/const.go—BackendVLLMconstantpkg/aikit2llb/inference/backend.go— Backend tag/name/alias resolution, dependency installation, compatibility patches for vLLMpkg/aikit2llb/inference/vllm.go— Python base deps + gcc/libc6-dev for Triton JITpkg/aikit2llb/inference/convert.go— CUDA apt packages for vLLMpkg/aikit2llb/inference/download.go— HuggingFace repo-level downloads (huggingface://namespace/model)pkg/build/build.go— Validation (vLLM requires CUDA, amd64)test/aikitfile-vllm.yaml— Test aikitfile (runtime model download).github/workflows/test-docker-gpu.yaml— vLLM in GPU test matrixwebsite/docs/— Specs and GPU docs updatedTest plan
make test— 148 tests pass (including new vLLM cases for backend tag/name/alias, build validation, ARM64 rejection)make lint— 0 issuesdocker buildx build .)docker buildx build -f test/aikitfile-vllm.yaml)curl /v1/chat/completionswith Qwen2.5-0.5B-Instruct — success