You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add vLLM backend support for high-throughput GPU inference
Add vLLM as a third backend option alongside llama-cpp and diffusers,
enabling HuggingFace safetensors model inference via vLLM on NVIDIA GPUs.
- Add BackendVLLM constant and wire through backend selection, OCI tag
resolution, alias mapping, and metadata generation
- Install Python base dependencies + gcc/libc6-dev for Triton JIT compilation
- Install CUDA apt packages (libcublas, cuda-cudart) for vLLM runtime
- Support HuggingFace repo-level downloads (huggingface://namespace/model)
in addition to existing single-file downloads
- Add build-time patches for pre-built vLLM backend image compatibility
(flash_attn ABI fix, AsyncLLM API update)
- Add validation: vLLM requires CUDA runtime, amd64-only
- Add test aikitfile, unit tests, GPU CI workflow matrix entry, and docs
Validated end-to-end on NVIDIA A100 80GB with Qwen2.5-0.5B-Instruct.
0 commit comments