Skip to content

Comments

Mcv binary cache#166

Merged
maryamtahhan merged 14 commits intoredhat-et:mainfrom
maryamtahhan:mcv-binary-cache
Feb 25, 2026
Merged

Mcv binary cache#166
maryamtahhan merged 14 commits intoredhat-et:mainfrom
maryamtahhan:mcv-binary-cache

Conversation

@maryamtahhan
Copy link
Collaborator

@maryamtahhan maryamtahhan commented Feb 9, 2026

Enable vllm binary cache support for MCV

fixes: #147

@maryamtahhan maryamtahhan force-pushed the mcv-binary-cache branch 3 times, most recently from c602a2a to 89b9145 Compare February 9, 2026 13:16
@maryamtahhan maryamtahhan marked this pull request as ready for review February 9, 2026 13:17
@maryamtahhan maryamtahhan requested a review from Billy99 February 10, 2026 10:43
@maryamtahhan
Copy link
Collaborator Author

TODO - add torch_inductor dir

@maryamtahhan maryamtahhan removed the request for review from Billy99 February 10, 2026 16:07
@maryamtahhan maryamtahhan marked this pull request as draft February 10, 2026 16:07
@maryamtahhan maryamtahhan marked this pull request as ready for review February 23, 2026 11:13
@maryamtahhan
Copy link
Collaborator Author

No precache

(EngineCore_DP0 pid=22) INFO 02-23 01:35:37 [backends.py:812] Using cache directory: /root/.cache/vllm/torch_compile_cache/8d0a361fbc/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=22) INFO 02-23 01:35:37 [backends.py:872] Dynamo bytecode transform time: 28.30 s
(EngineCore_DP0 pid=22) [rank0]:W0223 01:35:45.613000 22 torch/_inductor/utils.py:1613] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=22) INFO 02-23 01:35:55 [backends.py:302] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=22) INFO 02-23 01:36:01 [backends.py:319] Compiling a graph for compile range (1, 2048) takes 18.20 s
(EngineCore_DP0 pid=22) INFO 02-23 01:36:01 [monitor.py:34] torch.compile takes 46.50 s in total

with pre-cache:

(EngineCore_DP0 pid=22) INFO 02-23 03:12:47 [backends.py:812] Using cache directory: /root/.cache/vllm/torch_compile_cache/8d0a361fbc/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=22) INFO 02-23 03:12:47 [backends.py:872] Dynamo bytecode transform time: 7.85 s
(EngineCore_DP0 pid=22) INFO 02-23 03:12:54 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.273 s
(EngineCore_DP0 pid=22) INFO 02-23 03:12:54 [monitor.py:34] torch.compile takes 9.12 s in total

Copy link
Collaborator

@Billy99 Billy99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only minor comments.
One part I was sure about is the GPU detection. I feel like there is a limited (small) number of GPUs that we detect, or am I missing something?

@maryamtahhan
Copy link
Collaborator Author

Looks good, only minor comments.
One part I was sure about is the GPU detection. I feel like there is a limited (small) number of GPUs that we detect, or am I missing something?

ATM it's using CUDA or ROCM - so should detect all NVIDIA or AMD GPUs?

this is something I plan on changing and doing through kube moving forward

maryamtahhan and others added 12 commits February 25, 2026 14:57
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Bumps [github.com/sigstore/fulcio](https://github.com/sigstore/fulcio) from 1.8.3 to 1.8.5.
- [Release notes](https://github.com/sigstore/fulcio/releases)
- [Changelog](https://github.com/sigstore/fulcio/blob/main/CHANGELOG.md)
- [Commits](sigstore/fulcio@v1.8.3...v1.8.5)

---
updated-dependencies:
- dependency-name: github.com/sigstore/fulcio
  dependency-version: 1.8.5
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
_has_artifact_compile_range_with_triton() only checked for a triton/
  subdirectory, which cannot exist when the artifact is a packed binary
  file. Recognize binary artifact_compile_range_* files as valid vLLM
  cache indicators so detect_cache_mode() returns 'vllm' instead of
  falling through to 'triton'.

  Also:
  - sync requirements.txt with pyproject.toml (typer[all], structlog)
  - silence pylint R0903 on Pydantic data models
  - disable pylint import-error for declared but not-installed deps

Signed-off-by: Alessandro Sangiorgi <asangior@redhat.com>
@maryamtahhan maryamtahhan merged commit e59d588 into redhat-et:main Feb 25, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCV: Add support for vllm binary cache (labelling)

3 participants