-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Fix LLMEngine.del dp_group cleanup condition #29954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`LLMEngine.__del__` previously used the walrus operator together with `and`, which caused `dp_group` to be assigned the result of the boolean expression instead of the actual DP group object due to operator precedence. As a result, `stateless_destroy_torch_distributed_process_group` could receive `True`/`False` instead of a process group instance. Signed-off-by: Yongtao Huang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a bug in the LLMEngine.__del__ method where incorrect operator precedence with the walrus operator could lead to a crash during resource cleanup. The fix is sound. However, I have also added a comment regarding the use of __del__ for resource management, as it is not a reliable mechanism and can lead to resource leaks. I've suggested a more robust design pattern for consideration.
| def __del__(self): | ||
| if ( | ||
| dp_group := getattr(self, "dp_group", None) | ||
| and not self.external_launcher_dp | ||
| ): | ||
| dp_group = getattr(self, "dp_group", None) | ||
| if dp_group is not None and not self.external_launcher_dp: | ||
| stateless_destroy_torch_distributed_process_group(dp_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly fixes the immediate bug, relying on __del__ for critical resource cleanup like destroying a process group is unreliable. The __del__ method is not guaranteed to be called in all circumstances, for example, if the object is part of a reference cycle. This can lead to resource leaks, which can be particularly problematic in a distributed environment. A more robust approach would be to implement an explicit close() or shutdown() method that handles this cleanup. Ideally, LLMEngine could be made a context manager (by implementing __enter__ and __exit__) to ensure deterministic cleanup using a with statement.
markmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct to me
For reference, this was introduced by #24899
Signed-off-by: Yongtao Huang <[email protected]>
Signed-off-by: Yongtao Huang <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>
Signed-off-by: Yongtao Huang <[email protected]>
Purpose
LLMEngine.__del__previously used the walrus operator together withand, which causeddp_groupto be assigned the result of the boolean expression instead of the actual DP group object due to operator precedence. As a result,stateless_destroy_torch_distributed_process_groupcould receiveTrue/Falseinstead of a process group instance.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.