MCV: Investigate vLLM CPU backend cache support

## Background

vLLM V1 architecture supports CPU backend for inference, which is now fully functional (🟢 status). The CPU backend also uses torch.compile like the GPU backend, meaning it likely generates similar compiled cache artifacts that could benefit from MCV's cache
packaging and distribution capabilities.

However, MCV currently focuses on GPU-based caches (Triton kernels for AMD/NVIDIA GPUs). We need to investigate whether the CPU backend generates cacheable artifacts and if MCV should support packaging and distributing CPU-optimized caches.

## vLLM CPU Backend Overview

Status: Fully supported in V1 architecture (see https://docs.vllm.ai/en/latest/usage/v1_guide.html)

Supported CPU Architectures:
- Intel/AMD x86
- ARM AArch64
- Apple Silicon
- IBM Z (S390X)

Key Environment Variables:
- VLLM_CPU_KVCACHE_SPACE: KV cache size in GiB (default: 4GB)
- VLLM_CPU_OMP_THREADS_BIND: CPU core binding for OpenMP threads
- VLLM_CACHE_ROOT: Base cache directory (default: ~/.cache/vllm)

## Documentation:
- https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
- https://docs.vllm.ai/en/latest/usage/v1_guide.html

## Questions to Investigate

1. Cache Directory Structure
  - Does CPU backend use the same ~/.cache/vllm/torch_compile_cache/ structure as GPU backend?
  - Are there CPU-specific subdirectories or naming conventions?
  - Does it generate Triton cache or use different compilation backends?
2. Compiled Artifacts
  - What compilation backends does CPU use? (e.g., Inductor, IPEX, OpenVINO?)
  - Are compiled kernels architecture-specific (x86 vs ARM vs Apple Silicon)?
  - Do CPU caches include binary artifacts similar to GPU Triton/Inductor caches?
3. Cache Portability
  - Can CPU caches be transferred between machines with the same architecture?
  - Are caches tied to specific CPU features (AVX, AVX2, AVX-512, NEON, etc.)?
  - Do caches depend on specific library versions (Intel MKL, OpenBLAS, etc.)?
4. Hardware Detection
  - How should MCV detect CPU backend usage vs GPU backend?
  - What CPU-specific metadata should be captured in manifests?
  - Should we track CPU architecture, instruction sets, NUMA topology?
5. Use Cases
  - Is there demand for distributing CPU-optimized vLLM caches?
  - What deployment scenarios benefit most (edge devices, ARM servers, x86 clusters)?
  - How does cache reuse impact startup time on CPU vs GPU?

## Investigation Tasks

- Test CPU backend cache generation
  - Run vLLM on CPU backend and observe cache directory structure
  - Identify what gets cached and where
  - Compare to GPU backend cache structure
- Analyze compilation artifacts
  - Determine compilation backend used (Inductor, IPEX, etc.)
  - Identify binary vs source artifacts
  - Check architecture-specific optimizations
- Test cache portability
  - Generate cache on one CPU system
  - Transfer to another CPU system with same architecture
  - Verify cache reuse and performance impact
- Review vLLM source code
  - vllm/platforms/cpu.py - CPU platform implementation
  - vllm/v1/attention/backends/cpu_attn.py - CPU attention backend
  - Check for CPU-specific cache handling
- Assess MCV requirements
  - Determine if CPU cache support is needed
  - Identify required changes to MCV architecture
  - Evaluate effort vs benefit

## Potential MCV Changes (if CPU support is needed)

1. Detection Logic (pkg/accelerator/devices/)
  - Add CPU device detection
  - Distinguish between CPU-only and GPU+CPU deployments
2. Cache Detection (pkg/cache/vllm.go)
  - Detect CPU backend caches
  - Handle architecture-specific metadata
  - Support mixed GPU/CPU cache scenarios
3. Manifest Generation
  - Include CPU architecture information (x86, ARM, etc.)
  - Capture instruction set capabilities (AVX, NEON, etc.)
  - Document NUMA topology if relevant
4. Preflight Checks (pkg/preflightcheck/)
  - Verify CPU architecture compatibility
  - Check instruction set availability
  - Validate library dependencies
5. Documentation
  - Add CPU backend examples
  - Document CPU-specific considerations
  - Provide deployment scenarios

## References

- vLLM V1 Blog Post: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- CPU Installation: https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html
- CPU Platform Code: vllm/platforms/cpu.py:145-218
- CPU Attention Backend: vllm/v1/attention/backends/cpu_attn.py
- Cache Config: vllm/config/cache.py:108-109

## Success Criteria

- Clear understanding of CPU backend cache structure and portability
- Decision on whether MCV should support CPU caches
- If yes: Implementation plan with effort estimates
- If no: Documented reasoning for exclusion

## Notes

- CPU backend is production-ready in V1 (🟢 status)
- Unlike GPU backend, CPU doesn't have specialized GPU kernels
- May use different compilation strategies (IPEX, ONNX, etc.)
- Architecture-specific optimizations may limit portability
- Consider edge computing and ARM server use cases

---
This investigation will help determine if MCV should expand beyond GPU cache support to include CPU-optimized deployments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCV: Investigate vLLM CPU backend cache support #149

Background

vLLM CPU Backend Overview

Documentation:

Questions to Investigate

Investigation Tasks

Potential MCV Changes (if CPU support is needed)

References

Success Criteria

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MCV: Investigate vLLM CPU backend cache support #149

Description

Background

vLLM CPU Backend Overview

Documentation:

Questions to Investigate

Investigation Tasks

Potential MCV Changes (if CPU support is needed)

References

Success Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions