Bdellabe/shared tensor caching by brian-dellabetta · Pull Request #2367 · vllm-project/llm-compressor

brian-dellabetta · 2026-02-14T01:02:33Z

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

github-actions · 2026-02-14T01:02:41Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist · 2026-02-14T01:02:46Z

Summary of Changes

Hello @brian-dellabetta, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant memory optimizations for the intermediate values cache by implementing a shared tensor caching mechanism. It ensures that tensors referenced multiple times are offloaded only once, reducing overall memory footprint. Additionally, the cache size calculation has been refined to accurately reflect memory usage, and a debugging print statement was added to monitor activation memory during pipeline execution.

Highlights

Shared Tensor Caching: Implemented a mechanism within IntermediatesCache to cache offloaded shared tensors, preventing redundant memory allocations when multiple references point to the same tensor. This uses torch.hash_tensor as a key for weak references to offloaded tensors.
Accurate Cache Size Calculation: Modified the size method in IntermediatesCache to correctly account for shared tensors by using a memoization set, ensuring that the memory footprint of shared tensors is not overcounted.
Memory Monitoring: Added a debug print statement in the sequential pipeline to monitor the CPU activation memory usage, aiding in performance analysis and debugging.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/llmcompressor/pipelines/cache.py
- Imported WeakKeyDictionary, ReferenceType, and ref for weak reference handling.
- Added offload_values dictionary to IntermediatesCache to store weak references to offloaded tensors, keyed by tensor hash.
- Modified the size method to use a memo set, preventing shared tensors from being counted multiple times when calculating cache size.
- Updated the _offload_value method to check the offload_values cache before offloading a tensor, reusing an existing offloaded tensor if a hash match is found.
src/llmcompressor/pipelines/sequential/pipeline.py
- Added a print statement to output the CPU activation size in gigabytes during pipeline execution, likely for debugging or monitoring purposes.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2026-02-14T01:03:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brian-dellabetta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

The pull request introduces shared tensor caching to optimize memory usage during offloading. While the intent is good, there are several critical issues in the implementation, including a guaranteed crash in the size() method due to unhashable tensors being added to a set, and a potential NoneType error when retrieving values from the weak reference cache. Additionally, the use of a class-level dictionary for caching may lead to memory leaks of keys and unexpected behavior across different cache instances. I have also noted a debug print statement that should be removed.

I am having trouble creating individual review comments. Click here to see my feedback.

src/llmcompressor/pipelines/cache.py (172-174)

This code will raise a TypeError: unhashable type: 'Tensor' because torch.Tensor objects are not hashable and cannot be added to a set. Additionally, even if it were hashable, value not in memo would perform a value-based comparison which is ambiguous for tensors. To correctly track unique tensor objects for memory calculation, use id(value).

                    if id(value) not in memo:
                        sizes[value.device] += value.nbytes
                        memo.add(id(value))

src/llmcompressor/pipelines/cache.py (258-263)

The weak reference retrieval cls.offload_values[value_hash]() can return None if the tensor has been garbage collected. This will cause the code to return an IntermediateValue with value=None, leading to crashes during onloading. You must check if the retrieved value is None and re-offload if necessary.

                offloaded = cls.offload_values.get(value_hash, lambda: None)()
                if offloaded is None:
                    # move to offload if no hit or reference expired
                    offloaded = value.to(device=offload_device)
                    cls.offload_values[value_hash] = ref(offloaded)

src/llmcompressor/pipelines/cache.py (51)

Defining offload_values as a class attribute makes the cache global across all instances of IntermediatesCache. This can lead to unexpected side effects and memory leaks if multiple models or independent runs are processed in the same process. It is generally safer to make this an instance attribute initialized in __init__. Furthermore, using a standard dict to store ref objects will leak the ref objects (and their keys) even after the tensors are collected. Consider using weakref.WeakValueDictionary instead.

src/llmcompressor/pipelines/cache.py (257)

torch.hash_tensor is not a standard PyTorch API and will likely raise an AttributeError. If the intention is to identify tensors with the same content, a custom hashing function is needed, but note that hashing large tensors is computationally expensive. If the goal is to identify shared storage, consider using value.untyped_storage().data_ptr().

src/llmcompressor/pipelines/sequential/pipeline.py (139)

This print statement appears to be left over from debugging. Please remove it or replace it with a proper logging call.

mergify · 2026-02-14T01:04:18Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

…ensor_caching Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

using torch hash_tensor and weakref

31b7cbe

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>

brian-dellabetta requested a review from kylesayrs February 14, 2026 01:02

brian-dellabetta changed the base branch from main to kylesayrs/shared_tensor_caching February 14, 2026 01:03

mergify bot added the needs-rebase label Feb 14, 2026

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

mergify bot added the quality-failed label Feb 14, 2026

brian-dellabetta marked this pull request as ready for review February 14, 2026 01:05

Merge branch 'kylesayrs/shared_tensor_caching' into bdellabe/shared_t…

ee91ec9

…ensor_caching Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>

mergify bot removed needs-rebase quality-failed labels Feb 14, 2026

brian-dellabetta merged commit f2e6efc into kylesayrs/shared_tensor_caching Feb 14, 2026
2 of 3 checks passed

brian-dellabetta deleted the bdellabe/shared_tensor_caching branch February 14, 2026 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bdellabe/shared tensor caching#2367

Bdellabe/shared tensor caching#2367
brian-dellabetta merged 2 commits intokylesayrs/shared_tensor_cachingfrom
bdellabe/shared_tensor_caching

brian-dellabetta commented Feb 14, 2026

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

mergify bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brian-dellabetta commented Feb 14, 2026

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

src/llmcompressor/pipelines/cache.py (172-174)

src/llmcompressor/pipelines/cache.py (258-263)

src/llmcompressor/pipelines/cache.py (51)

src/llmcompressor/pipelines/cache.py (257)

src/llmcompressor/pipelines/sequential/pipeline.py (139)

Uh oh!

mergify bot commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant