Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions kt-kernel/python/utils/llamafile.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,3 +217,6 @@ def load_weights(self, physical_to_logical_map_cpu: Optional[torch.Tensor] = Non
# Load weights
self.cpu_infer.submit(self.moe.load_weights_task(physical_to_logical_map_cpu.data_ptr()))
self.cpu_infer.sync()

# Drop original weights after loading
self.weights_to_keep = None
Comment on lines +221 to +222
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While setting self.weights_to_keep = None correctly frees up the memory, a cleaner approach would be to remove the self.weights_to_keep member variable entirely.

The tensors (gate_data, up_data, down_data) are local to the load_weights method and will remain in scope until the method returns. Since self.cpu_infer.sync() blocks until the C++ copy is complete, the lifetime of these local variables is sufficient to ensure the memory is valid for the C++ backend.

This makes self.weights_to_keep redundant. You could consider a refactoring to simplify the code:

  1. Remove self.weights_to_keep = None from the __init__ method (line 135).
  2. Remove the assignment self.weights_to_keep = (gate_data, up_data, down_data) on line 182.
  3. Remove this newly added code block.

This would make the lifetime management implicit and rely on standard Python garbage collection, resulting in cleaner, more idiomatic code.