Skip to content

Bdellabe/shared tensor caching#2367

Merged
brian-dellabetta merged 2 commits intokylesayrs/shared_tensor_cachingfrom
bdellabe/shared_tensor_caching
Feb 14, 2026
Merged

Bdellabe/shared tensor caching#2367
brian-dellabetta merged 2 commits intokylesayrs/shared_tensor_cachingfrom
bdellabe/shared_tensor_caching

Conversation

@brian-dellabetta
Copy link
Collaborator

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @brian-dellabetta, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant memory optimizations for the intermediate values cache by implementing a shared tensor caching mechanism. It ensures that tensors referenced multiple times are offloaded only once, reducing overall memory footprint. Additionally, the cache size calculation has been refined to accurately reflect memory usage, and a debugging print statement was added to monitor activation memory during pipeline execution.

Highlights

  • Shared Tensor Caching: Implemented a mechanism within IntermediatesCache to cache offloaded shared tensors, preventing redundant memory allocations when multiple references point to the same tensor. This uses torch.hash_tensor as a key for weak references to offloaded tensors.
  • Accurate Cache Size Calculation: Modified the size method in IntermediatesCache to correctly account for shared tensors by using a memoization set, ensuring that the memory footprint of shared tensors is not overcounted.
  • Memory Monitoring: Added a debug print statement in the sequential pipeline to monitor the CPU activation memory usage, aiding in performance analysis and debugging.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/pipelines/cache.py
    • Imported WeakKeyDictionary, ReferenceType, and ref for weak reference handling.
    • Added offload_values dictionary to IntermediatesCache to store weak references to offloaded tensors, keyed by tensor hash.
    • Modified the size method to use a memo set, preventing shared tensors from being counted multiple times when calculating cache size.
    • Updated the _offload_value method to check the offload_values cache before offloading a tensor, reusing an existing offloaded tensor if a hash match is found.
  • src/llmcompressor/pipelines/sequential/pipeline.py
    • Added a print statement to output the CPU activation size in gigabytes during pipeline execution, likely for debugging or monitoring purposes.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@brian-dellabetta brian-dellabetta changed the base branch from main to kylesayrs/shared_tensor_caching February 14, 2026 01:03
@mergify
Copy link
Contributor

mergify bot commented Feb 14, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brian-dellabetta.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 14, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces shared tensor caching to optimize memory usage during offloading. While the intent is good, there are several critical issues in the implementation, including a guaranteed crash in the size() method due to unhashable tensors being added to a set, and a potential NoneType error when retrieving values from the weak reference cache. Additionally, the use of a class-level dictionary for caching may lead to memory leaks of keys and unexpected behavior across different cache instances. I have also noted a debug print statement that should be removed.

I am having trouble creating individual review comments. Click here to see my feedback.

src/llmcompressor/pipelines/cache.py (172-174)

critical

This code will raise a TypeError: unhashable type: 'Tensor' because torch.Tensor objects are not hashable and cannot be added to a set. Additionally, even if it were hashable, value not in memo would perform a value-based comparison which is ambiguous for tensors. To correctly track unique tensor objects for memory calculation, use id(value).

                    if id(value) not in memo:
                        sizes[value.device] += value.nbytes
                        memo.add(id(value))

src/llmcompressor/pipelines/cache.py (258-263)

critical

The weak reference retrieval cls.offload_values[value_hash]() can return None if the tensor has been garbage collected. This will cause the code to return an IntermediateValue with value=None, leading to crashes during onloading. You must check if the retrieved value is None and re-offload if necessary.

                offloaded = cls.offload_values.get(value_hash, lambda: None)()
                if offloaded is None:
                    # move to offload if no hit or reference expired
                    offloaded = value.to(device=offload_device)
                    cls.offload_values[value_hash] = ref(offloaded)

src/llmcompressor/pipelines/cache.py (51)

high

Defining offload_values as a class attribute makes the cache global across all instances of IntermediatesCache. This can lead to unexpected side effects and memory leaks if multiple models or independent runs are processed in the same process. It is generally safer to make this an instance attribute initialized in __init__. Furthermore, using a standard dict to store ref objects will leak the ref objects (and their keys) even after the tensors are collected. Consider using weakref.WeakValueDictionary instead.

src/llmcompressor/pipelines/cache.py (257)

high

torch.hash_tensor is not a standard PyTorch API and will likely raise an AttributeError. If the intention is to identify tensors with the same content, a custom hashing function is needed, but note that hashing large tensors is computationally expensive. If the goal is to identify shared storage, consider using value.untyped_storage().data_ptr().

src/llmcompressor/pipelines/sequential/pipeline.py (139)

medium

This print statement appears to be left over from debugging. Please remove it or replace it with a proper logging call.

@mergify
Copy link
Contributor

mergify bot commented Feb 14, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@brian-dellabetta brian-dellabetta marked this pull request as ready for review February 14, 2026 01:05
…ensor_caching

Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
@brian-dellabetta brian-dellabetta merged commit f2e6efc into kylesayrs/shared_tensor_caching Feb 14, 2026
2 of 3 checks passed
@brian-dellabetta brian-dellabetta deleted the bdellabe/shared_tensor_caching branch February 14, 2026 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant