Skip to content

Address teardown issue#638

Merged
Binyang2014 merged 3 commits intomainfrom
binyli/fix
Sep 25, 2025
Merged

Address teardown issue#638
Binyang2014 merged 3 commits intomainfrom
binyli/fix

Conversation

@Binyang2014
Copy link
Contributor

Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses teardown issues by modifying error handling during cleanup operations. The changes improve robustness by ignoring certain CUDA/CU errors that may occur when pointers become invalid during teardown.

  • Adds new CUDA error types to the teardown error detection functions
  • Introduces a new macro to completely ignore CUDA driver API errors during cleanup
  • Updates memory test to properly manage process group lifecycle

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
test/torch/memory_report.py Improves process group management by using distinct variable names and explicitly destroying groups
src/gpu_utils.cc Enhances error handling during teardown by expanding recognized teardown errors and adding ignore-only macro

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@caiomcbr caiomcbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@caiomcbr caiomcbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Binyang2014 Binyang2014 merged commit 5ac4276 into main Sep 25, 2025
14 checks passed
@Binyang2014 Binyang2014 deleted the binyli/fix branch September 25, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants