-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[fix](kt-kernel): Drop Python-loaded weights after C++ weight loading in LlamafileMoEWrapper #1695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[fix](kt-kernel): Drop Python-loaded weights after C++ weight loading in LlamafileMoEWrapper #1695
Conversation
… operation in C++
Summary of ChangesHello @SCDESPERTATE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a memory optimization within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a memory management issue in LlamafileMoEWrapper by releasing references to large weight tensors after they have been loaded into the C++ backend. This prevents unnecessary memory retention. My review includes a suggestion to further simplify the code by removing the member variable used for lifetime management, which would make the implementation cleaner and more idiomatic.
| # Drop original weights after loading | ||
| self.weights_to_keep = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While setting self.weights_to_keep = None correctly frees up the memory, a cleaner approach would be to remove the self.weights_to_keep member variable entirely.
The tensors (gate_data, up_data, down_data) are local to the load_weights method and will remain in scope until the method returns. Since self.cpu_infer.sync() blocks until the C++ copy is complete, the lifetime of these local variables is sufficient to ensure the memory is valid for the C++ backend.
This makes self.weights_to_keep redundant. You could consider a refactoring to simplify the code:
- Remove
self.weights_to_keep = Nonefrom the__init__method (line 135). - Remove the assignment
self.weights_to_keep = (gate_data, up_data, down_data)on line 182. - Remove this newly added code block.
This would make the lifetime management implicit and rely on standard Python garbage collection, resulting in cleaner, more idiomatic code.
What does this PR do?
In the
LlamafileMoEWrapper's load_weights method, weights loaded by the Python loader are retained in weights_to_keep for TP-splitting and copying to the C++ backend. However, these weights are no longer needed after the C++ backend completes the loading process.This PR ensures that the Python-loaded weights are explicitly dropped after the C++ backend finishes its operation, freeing up memory and preventing unnecessary retention of large weight tensors.
Fixes # (issue)
reset the
self.weights_to_keepat the end ofLlamafileMoEWrapper.load_weightsBefore submitting