Skip to content

Commit bbd37ef

Browse files
committed
mcv: AOT cache support
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
1 parent 6073411 commit bbd37ef

File tree

2 files changed

+322
-12
lines changed

2 files changed

+322
-12
lines changed

mcv/README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -88,15 +88,19 @@ Kernel/vLLM model. The details can be found in
8888

8989
### vLLM Binary Cache Support
9090

91-
MCV supports both legacy (triton cache) and new (binary cache) vLLM formats:
91+
MCV supports all vLLM cache formats:
9292

9393
1. **vLLM Triton Cache Format** (legacy) - Stores `triton_cache/` and
9494
`inductor_cache/` inside rank directories
95-
2. **vLLM Binary Cache Format** (new) - Stores prefix directories
96-
(e.g., `backbone/`) inside rank directories
95+
2. **vLLM Binary Cache Format** (default) - Stores compiled artifacts in prefix
96+
directories (e.g., `backbone/`) with embedded Triton kernels
97+
3. **vLLM AOT Cache Format** (advanced) - Uses `VLLM_USE_MEGA_AOT_ARTIFACT=true`
98+
for fully self-contained portable artifacts
9799

98-
For detailed information about vLLM binary cache support, see:
99-
[vllm-binary-cache.md](./docs/vllm-binary-cache.md)
100+
Both binary and AOT formats use identical structure and are automatically detected.
101+
102+
For detailed information about vLLM cache formats, torch.compile architecture,
103+
and best practices, see [vllm-binary-cache.md](./docs/vllm-binary-cache.md)
100104

101105
### Triton Cache Example
102106

mcv/docs/vllm-binary-cache.md

Lines changed: 313 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,16 @@
22

33
## Overview
44

5-
MCV supports two vLLM cache formats:
5+
MCV supports three vLLM cache formats:
66

77
1. **vLLM Triton Cache Format** (legacy) - Stores `triton_cache/` and
88
`inductor_cache/` inside rank directories
9-
2. **vLLM Binary Cache Format** (new) - Stores prefix directories
10-
(e.g., `backbone/`) inside rank directories
9+
2. **vLLM Binary Cache Format** (default) - Stores compiled artifacts in prefix
10+
directories with embedded Triton kernels
11+
3. **vLLM AOT Cache Format** (advanced) - Uses `VLLM_USE_MEGA_AOT_ARTIFACT=true`
12+
for fully self-contained portable artifacts
1113

12-
Both formats share the same top-level structure:
14+
All formats share the same top-level structure:
1315
`torch_compile_cache/{hash}/rank_{rank}_{dp_rank}/`
1416

1517
The key differences are **inside the rank directory**:
@@ -18,10 +20,87 @@ The key differences are **inside the rank directory**:
1820
subdirectories with unpacked artifacts
1921
- **Binary format**: Contains prefix directories
2022
(e.g., `backbone/`, `eagle_head/`) with `cache_key_factors.json`
21-
and artifacts that can be either binary files or unpacked directories
23+
and binary artifacts containing embedded Triton kernels
24+
- **AOT format**: Identical structure to binary format, but uses PyTorch's
25+
`AOTCompiledArtifact` serialization (indicated by `VLLM_USE_MEGA_AOT_ARTIFACT: true`
26+
in `cache_key_factors.json`)
2227

23-
This document describes the **vLLM Binary Cache Format** introduced in recent
24-
versions of vLLM.
28+
This document describes the **vLLM Binary and AOT Cache Formats** and how
29+
torch.compile caching works with MCV.
30+
31+
## Torch Compile Architecture
32+
33+
### How vLLM Uses torch.compile
34+
35+
When vLLM is configured with `VLLM_TORCH_COMPILE_LEVEL=1`, it uses PyTorch's
36+
`torch.compile` with TorchInductor backend to optimize model execution:
37+
38+
```
39+
Model Code → torch.compile → TorchInductor → Triton/CUDA Kernels → GPU Execution
40+
```
41+
42+
**First Run (Compilation)**:
43+
1. vLLM traces the model with Dynamo
44+
2. TorchInductor compiles the graph
45+
3. Triton generates optimized GPU kernels → `/tmp/torchinductor_root/`
46+
4. vLLM saves artifacts using `standalone_compile().save(format="binary")`
47+
5. **PyTorch bundles the Triton kernels into the artifacts**
48+
6. Complete cache saved to `~/.cache/vllm/torch_compile_cache/`
49+
50+
**Subsequent Runs (Cache Hit)**:
51+
1. vLLM loads artifacts from `~/.cache/vllm/torch_compile_cache/`
52+
2. **PyTorch extracts embedded Triton kernels → `/tmp/torchinductor_root/`**
53+
3. Execution resumes using extracted kernels (~10-20s vs 3-5min compilation)
54+
55+
### Binary vs AOT Formats
56+
57+
Both binary and AOT formats bundle Triton kernels in the artifacts, but differ
58+
in serialization:
59+
60+
**Binary Format** (default):
61+
- Uses PyTorch `standalone_compile().save(format="binary")`
62+
- Environment: `VLLM_USE_MEGA_AOT_ARTIFACT=false` (default)
63+
- Good for same PyTorch version deployments
64+
- Typical size: ~95MB for small models
65+
66+
**AOT Format** (advanced):
67+
- Uses PyTorch `AOTCompiledArtifact.serialize()`
68+
- Environment: `VLLM_USE_MEGA_AOT_ARTIFACT=true`
69+
- More portable across PyTorch versions (requires 2.10+)
70+
- Includes bundled AOT autograd cache
71+
- Typical size: ~92MB for small models
72+
73+
**Important**: From MCV's perspective, both formats are **structurally identical**
74+
and use the same detection and packaging logic.
75+
76+
### The /tmp Cache Directory
77+
78+
During compilation and execution, PyTorch creates temporary files:
79+
80+
```
81+
/tmp/torchinductor_root/
82+
├── triton/0/{hash}/
83+
│ ├── triton_.cubin # Compiled GPU binary (ELF)
84+
│ ├── triton_.source # Triton source code
85+
│ ├── triton_.ttir # Triton IR
86+
│ └── triton_.ptx # PTX assembly
87+
├── o7/, dp/, .../ # Python kernel cache
88+
└── aotautograd/ # AOT autograd cache
89+
```
90+
91+
**Size**: ~16MB for small models
92+
93+
**Lifecycle**:
94+
- **First run**: Created during compilation
95+
- **Cache hit**: Extracted from embedded artifacts
96+
- **Cleanup**: Cleared on reboot (tmpfs) or manual deletion
97+
- **Recreation**: Automatic on every vLLM start
98+
99+
**Key Insight**: This directory is **NOT needed for cache portability**.
100+
The Triton kernels are already embedded in the binary artifacts (verified by
101+
finding 42 ELF headers in a 5.3MB artifact file).
102+
103+
**MCV does NOT capture `/tmp`** - kernels auto-extract at runtime (~2 seconds).
25104

26105
## Binary Cache Format
27106

@@ -439,6 +518,233 @@ To migrate from vLLM triton cache format to vLLM binary cache format:
439518
4. Package new cache with MCV (automatically detected)
440519
5. Both vLLM cache formats are supported, no breaking changes
441520

521+
## Practical Guide
522+
523+
### Generating a Cache
524+
525+
**Environment Setup**:
526+
```bash
527+
export VLLM_TORCH_COMPILE_MODE=vllm-compile
528+
export VLLM_TORCH_COMPILE_LEVEL=1
529+
530+
# For binary format (default):
531+
export VLLM_COMPILE_CACHE_SAVE_FORMAT=binary
532+
export VLLM_USE_MEGA_AOT_ARTIFACT=false # or omit (default)
533+
534+
# For AOT format (more portable):
535+
export VLLM_COMPILE_CACHE_SAVE_FORMAT=binary
536+
export VLLM_USE_MEGA_AOT_ARTIFACT=true # requires PyTorch 2.10+
537+
```
538+
539+
**Run vLLM Warmup**:
540+
```bash
541+
vllm serve my-model --tensor-parallel-size 1
542+
543+
# Make sample requests to trigger compilation:
544+
curl http://localhost:8000/v1/completions \
545+
-H "Content-Type: application/json" \
546+
-d '{"model": "my-model", "prompt": "Hello", "max_tokens": 100}'
547+
```
548+
549+
**Verify Cache**:
550+
```bash
551+
ls -lh ~/.cache/vllm/torch_compile_cache/
552+
# Should show a 10-char hash directory (e.g., 8d0a361fbc)
553+
554+
# Check cache contents:
555+
find ~/.cache/vllm/torch_compile_cache/ -type f | head
556+
```
557+
558+
### Packaging with MCV
559+
560+
**Create Container Image**:
561+
```bash
562+
mcv -c \
563+
-d ~/.cache/vllm/torch_compile_cache/{hash} \
564+
-i quay.io/myorg/my-model-cache:v1
565+
```
566+
567+
**Verify Image Labels**:
568+
```bash
569+
skopeo inspect containers-storage:quay.io/myorg/my-model-cache:v1 \
570+
| jq '.Labels'
571+
572+
# Expected labels:
573+
# {
574+
# "cache.vllm.image/cache-size-bytes": "95000000",
575+
# "cache.vllm.image/entry-count": "1",
576+
# "cache.vllm.image/format": "binary",
577+
# "cache.vllm.image/summary": "{\"targets\":[{\"backend\":\"cuda\",...}]}"
578+
# }
579+
```
580+
581+
### Using a Cached Image
582+
583+
**Extract Cache**:
584+
```bash
585+
mcv -e -i quay.io/myorg/my-model-cache:v1
586+
587+
# MCV extracts to: ~/.cache/vllm/torch_compile_cache/{hash}/
588+
```
589+
590+
**Start vLLM**:
591+
```bash
592+
# vLLM automatically detects and uses the cache
593+
vllm serve my-model --tensor-parallel-size 1
594+
595+
# Look for log message:
596+
# INFO: Directly load the compiled graph(s) from the cache, took X.X s
597+
```
598+
599+
### Cache Compatibility
600+
601+
A cache is compatible if:
602+
1. **GPU architecture** matches (check: `nvidia-smi --query-gpu=compute_cap`)
603+
2. **CUDA/ROCm version** compatible (check: `nvcc --version` or `rocm-smi`)
604+
3. **PyTorch version** compatible
605+
4. **Model code** unchanged (code hash must match)
606+
5. **vLLM configuration** matches (TP size, compile level, etc.)
607+
608+
**Check Compatibility**:
609+
```bash
610+
# View cache metadata:
611+
cat ~/.cache/vllm/torch_compile_cache/*/rank_0_0/*/cache_key_factors.json \
612+
| jq '{target: .env.VLLM_TARGET_DEVICE, cuda: .env.VLLM_MAIN_CUDA_VERSION}'
613+
614+
# Compare with system:
615+
nvidia-smi
616+
# or
617+
rocm-smi
618+
```
619+
620+
## Troubleshooting
621+
622+
### Cache Not Being Used
623+
624+
**Symptom**: vLLM recompiles on every start despite having a cache
625+
626+
**Common Causes**:
627+
1. **Hash mismatch** - Configuration or environment changed
628+
2. **Incompatible GPU** - Different architecture (e.g., sm_75 vs sm_80)
629+
3. **PyTorch version** - Binary format sensitive to PyTorch version
630+
4. **Model code changed** - Code hash no longer matches
631+
632+
**Debug Steps**:
633+
```bash
634+
# 1. Check if cache exists
635+
ls ~/.cache/vllm/torch_compile_cache/
636+
637+
# 2. Enable debug logging
638+
export VLLM_LOGGING_LEVEL=DEBUG
639+
640+
# 3. Check for hash mismatch in logs
641+
grep "cache" vllm.log | grep -i "hash\|miss"
642+
643+
# 4. Verify GPU compatibility
644+
python -c "import torch; print(torch.cuda.get_device_capability())"
645+
```
646+
647+
### Slow Startup with Cache
648+
649+
**Symptom**: vLLM takes 20+ seconds to start with cache
650+
651+
**Normal Behavior**: 10-20 seconds for kernel extraction from artifacts is expected
652+
653+
**If Slower**:
654+
- Check disk I/O performance: `iostat -x 1`
655+
- Verify `/tmp` is not on slow storage (NFS, etc.)
656+
- Consider using `tmpfs` for `/tmp`: `df -h /tmp`
657+
658+
### Missing Kernels Error
659+
660+
**Symptom**: Runtime errors about missing Triton kernels
661+
662+
**Causes**:
663+
1. Corrupted artifacts
664+
2. Incomplete cache (warmup didn't cover all batch sizes)
665+
3. Disk space issues during generation
666+
667+
**Solutions**:
668+
```bash
669+
# 1. Delete and regenerate cache
670+
rm -rf ~/.cache/vllm/torch_compile_cache/*
671+
672+
# 2. Verify disk space
673+
df -h ~/.cache/vllm/
674+
675+
# 3. Check artifact integrity
676+
file ~/.cache/vllm/torch_compile_cache/*/rank_0_0/*/artifact_*
677+
# Should show: "data" (binary format)
678+
```
679+
680+
### AOT Format Issues
681+
682+
**Symptom**: AOT artifacts fail to load
683+
684+
**Requirements**:
685+
- PyTorch 2.10.0 or later
686+
- `VLLM_USE_MEGA_AOT_ARTIFACT=true`
687+
- `VLLM_USE_STANDALONE_COMPILE=true`
688+
689+
**Verify**:
690+
```bash
691+
# Check PyTorch version
692+
python -c "import torch; print(torch.__version__)"
693+
694+
# Verify AOT flag in cache
695+
grep "VLLM_USE_MEGA_AOT_ARTIFACT" \
696+
~/.cache/vllm/torch_compile_cache/*/rank_0_0/*/cache_key_factors.json
697+
```
698+
699+
## Advanced Topics
700+
701+
### Multi-GPU Caching
702+
703+
For tensor parallelism or pipeline parallelism:
704+
705+
```
706+
torch_compile_cache/{hash}/
707+
├── rank_0_0/ # First tensor parallel rank
708+
├── rank_0_1/ # Second tensor parallel rank
709+
├── rank_1_0/ # First pipeline parallel rank
710+
└── rank_1_1/ # Second pipeline + tensor parallel rank
711+
```
712+
713+
MCV captures all rank directories. Extract the entire hash directory for
714+
multi-GPU deployments.
715+
716+
### Multiple Model Components
717+
718+
Models with speculative decoding have multiple components:
719+
720+
```
721+
rank_0_0/
722+
├── backbone/ # Main model
723+
│ └── artifact_*
724+
└── eagle_head/ # Draft model for speculation
725+
└── artifact_*
726+
```
727+
728+
MCV captures all prefix directories automatically.
729+
730+
### Cache Size Optimization
731+
732+
**Typical Sizes**:
733+
- Small models (< 1B params): 50-100 MB
734+
- Medium models (1-10B params): 100-500 MB
735+
- Large models (10B+ params): 500 MB - 2 GB
736+
737+
**Factors Affecting Size**:
738+
- Number of compiled ranges (batch sizes)
739+
- Number of layers
740+
- Triton kernel count
741+
- Autotune configurations
742+
743+
**Reduce Size**:
744+
- Use fewer compile ranges: `VLLM_COMPILE_RANGES=[128,512]` vs default
745+
- Binary format is smaller than unpacked
746+
- AOT format is similar to binary
747+
442748
## See Also
443749

444750
- [spec-compat.md](./spec-compat.md) - OCI image specification

0 commit comments

Comments
 (0)