22
33## Overview
44
5- MCV supports two vLLM cache formats:
5+ MCV supports three vLLM cache formats:
66
771 . ** vLLM Triton Cache Format** (legacy) - Stores ` triton_cache/ ` and
88 ` inductor_cache/ ` inside rank directories
9- 2 . ** vLLM Binary Cache Format** (new) - Stores prefix directories
10- (e.g., ` backbone/ ` ) inside rank directories
9+ 2 . ** vLLM Binary Cache Format** (default) - Stores compiled artifacts in prefix
10+ directories with embedded Triton kernels
11+ 3 . ** vLLM AOT Cache Format** (advanced) - Uses ` VLLM_USE_MEGA_AOT_ARTIFACT=true `
12+ for fully self-contained portable artifacts
1113
12- Both formats share the same top-level structure:
14+ All formats share the same top-level structure:
1315` torch_compile_cache/{hash}/rank_{rank}_{dp_rank}/ `
1416
1517The key differences are ** inside the rank directory** :
@@ -18,10 +20,87 @@ The key differences are **inside the rank directory**:
1820 subdirectories with unpacked artifacts
1921- ** Binary format** : Contains prefix directories
2022 (e.g., ` backbone/ ` , ` eagle_head/ ` ) with ` cache_key_factors.json `
21- and artifacts that can be either binary files or unpacked directories
23+ and binary artifacts containing embedded Triton kernels
24+ - ** AOT format** : Identical structure to binary format, but uses PyTorch's
25+ ` AOTCompiledArtifact ` serialization (indicated by ` VLLM_USE_MEGA_AOT_ARTIFACT: true `
26+ in ` cache_key_factors.json ` )
2227
23- This document describes the ** vLLM Binary Cache Format** introduced in recent
24- versions of vLLM.
28+ This document describes the ** vLLM Binary and AOT Cache Formats** and how
29+ torch.compile caching works with MCV.
30+
31+ ## Torch Compile Architecture
32+
33+ ### How vLLM Uses torch.compile
34+
35+ When vLLM is configured with ` VLLM_TORCH_COMPILE_LEVEL=1 ` , it uses PyTorch's
36+ ` torch.compile ` with TorchInductor backend to optimize model execution:
37+
38+ ```
39+ Model Code → torch.compile → TorchInductor → Triton/CUDA Kernels → GPU Execution
40+ ```
41+
42+ ** First Run (Compilation)** :
43+ 1 . vLLM traces the model with Dynamo
44+ 2 . TorchInductor compiles the graph
45+ 3 . Triton generates optimized GPU kernels → ` /tmp/torchinductor_root/ `
46+ 4 . vLLM saves artifacts using ` standalone_compile().save(format="binary") `
47+ 5 . ** PyTorch bundles the Triton kernels into the artifacts**
48+ 6 . Complete cache saved to ` ~/.cache/vllm/torch_compile_cache/ `
49+
50+ ** Subsequent Runs (Cache Hit)** :
51+ 1 . vLLM loads artifacts from ` ~/.cache/vllm/torch_compile_cache/ `
52+ 2 . ** PyTorch extracts embedded Triton kernels → ` /tmp/torchinductor_root/ ` **
53+ 3 . Execution resumes using extracted kernels (~ 10-20s vs 3-5min compilation)
54+
55+ ### Binary vs AOT Formats
56+
57+ Both binary and AOT formats bundle Triton kernels in the artifacts, but differ
58+ in serialization:
59+
60+ ** Binary Format** (default):
61+ - Uses PyTorch ` standalone_compile().save(format="binary") `
62+ - Environment: ` VLLM_USE_MEGA_AOT_ARTIFACT=false ` (default)
63+ - Good for same PyTorch version deployments
64+ - Typical size: ~ 95MB for small models
65+
66+ ** AOT Format** (advanced):
67+ - Uses PyTorch ` AOTCompiledArtifact.serialize() `
68+ - Environment: ` VLLM_USE_MEGA_AOT_ARTIFACT=true `
69+ - More portable across PyTorch versions (requires 2.10+)
70+ - Includes bundled AOT autograd cache
71+ - Typical size: ~ 92MB for small models
72+
73+ ** Important** : From MCV's perspective, both formats are ** structurally identical**
74+ and use the same detection and packaging logic.
75+
76+ ### The /tmp Cache Directory
77+
78+ During compilation and execution, PyTorch creates temporary files:
79+
80+ ```
81+ /tmp/torchinductor_root/
82+ ├── triton/0/{hash}/
83+ │ ├── triton_.cubin # Compiled GPU binary (ELF)
84+ │ ├── triton_.source # Triton source code
85+ │ ├── triton_.ttir # Triton IR
86+ │ └── triton_.ptx # PTX assembly
87+ ├── o7/, dp/, .../ # Python kernel cache
88+ └── aotautograd/ # AOT autograd cache
89+ ```
90+
91+ ** Size** : ~ 16MB for small models
92+
93+ ** Lifecycle** :
94+ - ** First run** : Created during compilation
95+ - ** Cache hit** : Extracted from embedded artifacts
96+ - ** Cleanup** : Cleared on reboot (tmpfs) or manual deletion
97+ - ** Recreation** : Automatic on every vLLM start
98+
99+ ** Key Insight** : This directory is ** NOT needed for cache portability** .
100+ The Triton kernels are already embedded in the binary artifacts (verified by
101+ finding 42 ELF headers in a 5.3MB artifact file).
102+
103+ ** MCV does NOT capture ` /tmp ` ** - kernels auto-extract at runtime (~ 2 seconds).
25104
26105## Binary Cache Format
27106
@@ -439,6 +518,233 @@ To migrate from vLLM triton cache format to vLLM binary cache format:
4395184 . Package new cache with MCV (automatically detected)
4405195 . Both vLLM cache formats are supported, no breaking changes
441520
521+ ## Practical Guide
522+
523+ ### Generating a Cache
524+
525+ ** Environment Setup** :
526+ ``` bash
527+ export VLLM_TORCH_COMPILE_MODE=vllm-compile
528+ export VLLM_TORCH_COMPILE_LEVEL=1
529+
530+ # For binary format (default):
531+ export VLLM_COMPILE_CACHE_SAVE_FORMAT=binary
532+ export VLLM_USE_MEGA_AOT_ARTIFACT=false # or omit (default)
533+
534+ # For AOT format (more portable):
535+ export VLLM_COMPILE_CACHE_SAVE_FORMAT=binary
536+ export VLLM_USE_MEGA_AOT_ARTIFACT=true # requires PyTorch 2.10+
537+ ```
538+
539+ ** Run vLLM Warmup** :
540+ ``` bash
541+ vllm serve my-model --tensor-parallel-size 1
542+
543+ # Make sample requests to trigger compilation:
544+ curl http://localhost:8000/v1/completions \
545+ -H " Content-Type: application/json" \
546+ -d ' {"model": "my-model", "prompt": "Hello", "max_tokens": 100}'
547+ ```
548+
549+ ** Verify Cache** :
550+ ``` bash
551+ ls -lh ~ /.cache/vllm/torch_compile_cache/
552+ # Should show a 10-char hash directory (e.g., 8d0a361fbc)
553+
554+ # Check cache contents:
555+ find ~ /.cache/vllm/torch_compile_cache/ -type f | head
556+ ```
557+
558+ ### Packaging with MCV
559+
560+ ** Create Container Image** :
561+ ``` bash
562+ mcv -c \
563+ -d ~ /.cache/vllm/torch_compile_cache/{hash} \
564+ -i quay.io/myorg/my-model-cache:v1
565+ ```
566+
567+ ** Verify Image Labels** :
568+ ``` bash
569+ skopeo inspect containers-storage:quay.io/myorg/my-model-cache:v1 \
570+ | jq ' .Labels'
571+
572+ # Expected labels:
573+ # {
574+ # "cache.vllm.image/cache-size-bytes": "95000000",
575+ # "cache.vllm.image/entry-count": "1",
576+ # "cache.vllm.image/format": "binary",
577+ # "cache.vllm.image/summary": "{\"targets\":[{\"backend\":\"cuda\",...}]}"
578+ # }
579+ ```
580+
581+ ### Using a Cached Image
582+
583+ ** Extract Cache** :
584+ ``` bash
585+ mcv -e -i quay.io/myorg/my-model-cache:v1
586+
587+ # MCV extracts to: ~/.cache/vllm/torch_compile_cache/{hash}/
588+ ```
589+
590+ ** Start vLLM** :
591+ ``` bash
592+ # vLLM automatically detects and uses the cache
593+ vllm serve my-model --tensor-parallel-size 1
594+
595+ # Look for log message:
596+ # INFO: Directly load the compiled graph(s) from the cache, took X.X s
597+ ```
598+
599+ ### Cache Compatibility
600+
601+ A cache is compatible if:
602+ 1 . ** GPU architecture** matches (check: ` nvidia-smi --query-gpu=compute_cap ` )
603+ 2 . ** CUDA/ROCm version** compatible (check: ` nvcc --version ` or ` rocm-smi ` )
604+ 3 . ** PyTorch version** compatible
605+ 4 . ** Model code** unchanged (code hash must match)
606+ 5 . ** vLLM configuration** matches (TP size, compile level, etc.)
607+
608+ ** Check Compatibility** :
609+ ``` bash
610+ # View cache metadata:
611+ cat ~ /.cache/vllm/torch_compile_cache/* /rank_0_0/* /cache_key_factors.json \
612+ | jq ' {target: .env.VLLM_TARGET_DEVICE, cuda: .env.VLLM_MAIN_CUDA_VERSION}'
613+
614+ # Compare with system:
615+ nvidia-smi
616+ # or
617+ rocm-smi
618+ ```
619+
620+ ## Troubleshooting
621+
622+ ### Cache Not Being Used
623+
624+ ** Symptom** : vLLM recompiles on every start despite having a cache
625+
626+ ** Common Causes** :
627+ 1 . ** Hash mismatch** - Configuration or environment changed
628+ 2 . ** Incompatible GPU** - Different architecture (e.g., sm_75 vs sm_80)
629+ 3 . ** PyTorch version** - Binary format sensitive to PyTorch version
630+ 4 . ** Model code changed** - Code hash no longer matches
631+
632+ ** Debug Steps** :
633+ ``` bash
634+ # 1. Check if cache exists
635+ ls ~ /.cache/vllm/torch_compile_cache/
636+
637+ # 2. Enable debug logging
638+ export VLLM_LOGGING_LEVEL=DEBUG
639+
640+ # 3. Check for hash mismatch in logs
641+ grep " cache" vllm.log | grep -i " hash\|miss"
642+
643+ # 4. Verify GPU compatibility
644+ python -c " import torch; print(torch.cuda.get_device_capability())"
645+ ```
646+
647+ ### Slow Startup with Cache
648+
649+ ** Symptom** : vLLM takes 20+ seconds to start with cache
650+
651+ ** Normal Behavior** : 10-20 seconds for kernel extraction from artifacts is expected
652+
653+ ** If Slower** :
654+ - Check disk I/O performance: ` iostat -x 1 `
655+ - Verify ` /tmp ` is not on slow storage (NFS, etc.)
656+ - Consider using ` tmpfs ` for ` /tmp ` : ` df -h /tmp `
657+
658+ ### Missing Kernels Error
659+
660+ ** Symptom** : Runtime errors about missing Triton kernels
661+
662+ ** Causes** :
663+ 1 . Corrupted artifacts
664+ 2 . Incomplete cache (warmup didn't cover all batch sizes)
665+ 3 . Disk space issues during generation
666+
667+ ** Solutions** :
668+ ``` bash
669+ # 1. Delete and regenerate cache
670+ rm -rf ~ /.cache/vllm/torch_compile_cache/*
671+
672+ # 2. Verify disk space
673+ df -h ~ /.cache/vllm/
674+
675+ # 3. Check artifact integrity
676+ file ~ /.cache/vllm/torch_compile_cache/* /rank_0_0/* /artifact_*
677+ # Should show: "data" (binary format)
678+ ```
679+
680+ ### AOT Format Issues
681+
682+ ** Symptom** : AOT artifacts fail to load
683+
684+ ** Requirements** :
685+ - PyTorch 2.10.0 or later
686+ - ` VLLM_USE_MEGA_AOT_ARTIFACT=true `
687+ - ` VLLM_USE_STANDALONE_COMPILE=true `
688+
689+ ** Verify** :
690+ ``` bash
691+ # Check PyTorch version
692+ python -c " import torch; print(torch.__version__)"
693+
694+ # Verify AOT flag in cache
695+ grep " VLLM_USE_MEGA_AOT_ARTIFACT" \
696+ ~ /.cache/vllm/torch_compile_cache/* /rank_0_0/* /cache_key_factors.json
697+ ```
698+
699+ ## Advanced Topics
700+
701+ ### Multi-GPU Caching
702+
703+ For tensor parallelism or pipeline parallelism:
704+
705+ ```
706+ torch_compile_cache/{hash}/
707+ ├── rank_0_0/ # First tensor parallel rank
708+ ├── rank_0_1/ # Second tensor parallel rank
709+ ├── rank_1_0/ # First pipeline parallel rank
710+ └── rank_1_1/ # Second pipeline + tensor parallel rank
711+ ```
712+
713+ MCV captures all rank directories. Extract the entire hash directory for
714+ multi-GPU deployments.
715+
716+ ### Multiple Model Components
717+
718+ Models with speculative decoding have multiple components:
719+
720+ ```
721+ rank_0_0/
722+ ├── backbone/ # Main model
723+ │ └── artifact_*
724+ └── eagle_head/ # Draft model for speculation
725+ └── artifact_*
726+ ```
727+
728+ MCV captures all prefix directories automatically.
729+
730+ ### Cache Size Optimization
731+
732+ ** Typical Sizes** :
733+ - Small models (< 1B params): 50-100 MB
734+ - Medium models (1-10B params): 100-500 MB
735+ - Large models (10B+ params): 500 MB - 2 GB
736+
737+ ** Factors Affecting Size** :
738+ - Number of compiled ranges (batch sizes)
739+ - Number of layers
740+ - Triton kernel count
741+ - Autotune configurations
742+
743+ ** Reduce Size** :
744+ - Use fewer compile ranges: ` VLLM_COMPILE_RANGES=[128,512] ` vs default
745+ - Binary format is smaller than unpacked
746+ - AOT format is similar to binary
747+
442748## See Also
443749
444750- [ spec-compat.md] ( ./spec-compat.md ) - OCI image specification
0 commit comments