Bdellabe/shared tensor caching#2367
Bdellabe/shared tensor caching#2367brian-dellabetta merged 2 commits intokylesayrs/shared_tensor_cachingfrom
Conversation
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @brian-dellabetta, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant memory optimizations for the intermediate values cache by implementing a shared tensor caching mechanism. It ensures that tensors referenced multiple times are offloaded only once, reducing overall memory footprint. Additionally, the cache size calculation has been refined to accurately reflect memory usage, and a debugging print statement was added to monitor activation memory during pipeline execution. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
The pull request introduces shared tensor caching to optimize memory usage during offloading. While the intent is good, there are several critical issues in the implementation, including a guaranteed crash in the size() method due to unhashable tensors being added to a set, and a potential NoneType error when retrieving values from the weak reference cache. Additionally, the use of a class-level dictionary for caching may lead to memory leaks of keys and unexpected behavior across different cache instances. I have also noted a debug print statement that should be removed.
I am having trouble creating individual review comments. Click here to see my feedback.
src/llmcompressor/pipelines/cache.py (172-174)
This code will raise a TypeError: unhashable type: 'Tensor' because torch.Tensor objects are not hashable and cannot be added to a set. Additionally, even if it were hashable, value not in memo would perform a value-based comparison which is ambiguous for tensors. To correctly track unique tensor objects for memory calculation, use id(value).
if id(value) not in memo:
sizes[value.device] += value.nbytes
memo.add(id(value))
src/llmcompressor/pipelines/cache.py (258-263)
The weak reference retrieval cls.offload_values[value_hash]() can return None if the tensor has been garbage collected. This will cause the code to return an IntermediateValue with value=None, leading to crashes during onloading. You must check if the retrieved value is None and re-offload if necessary.
offloaded = cls.offload_values.get(value_hash, lambda: None)()
if offloaded is None:
# move to offload if no hit or reference expired
offloaded = value.to(device=offload_device)
cls.offload_values[value_hash] = ref(offloaded)
src/llmcompressor/pipelines/cache.py (51)
Defining offload_values as a class attribute makes the cache global across all instances of IntermediatesCache. This can lead to unexpected side effects and memory leaks if multiple models or independent runs are processed in the same process. It is generally safer to make this an instance attribute initialized in __init__. Furthermore, using a standard dict to store ref objects will leak the ref objects (and their keys) even after the tensors are collected. Consider using weakref.WeakValueDictionary instead.
src/llmcompressor/pipelines/cache.py (257)
torch.hash_tensor is not a standard PyTorch API and will likely raise an AttributeError. If the intention is to identify tensors with the same content, a custom hashing function is needed, but note that hashing large tensors is computationally expensive. If the goal is to identify shared storage, consider using value.untyped_storage().data_ptr().
src/llmcompressor/pipelines/sequential/pipeline.py (139)
This print statement appears to be left over from debugging. Please remove it or replace it with a proper logging call.
|
The quality checks have failed. Please run |
…ensor_caching Signed-off-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
f2e6efc
into
kylesayrs/shared_tensor_caching
SUMMARY:
"please provide a brief summary"
TEST PLAN:
"please outline how the changes were tested"