-
Notifications
You must be signed in to change notification settings - Fork 75
[TEST_DEBUG] Enable test device assert #5395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There is no updates from pytorch/pytorch#142135 for a while, do you know what's the ETA for that? |
To my knowledge, sometime next year. I'll send you an internal ticket for that. |
python/test/unit/test_debug.py
Outdated
|
|
||
| if should_fail: | ||
| abort_or_runtime_error = ( | ||
| result.returncode == 1 or # RuntimeError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases do we get a RuntimeError? At first glance, I would expect all errors of SIGABRT type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is needed for other devices, then you can leave the old implementation for devices that are not XPU: if not is_xpu() and add a separate branch for us with what you wrote, perhaps this can simplify the code. This is optional because I don't know if it will result in less code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe easier to resolve merge conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving the old implementation would result in many redundant parts.
I agree however that RuntimeError should not be in the assert condition for xpu as only SIGABRTs are thrown for us.
Ill create a seperate branch for that - that way when the pytorch bug is fixed this test will start failing and we can then go back to the common implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the new commit I used if device == 'xpu' as is_xpu() fails with "Cannot re-initialize XPU in forked subprocess." due to @pytest.mark.forked
python/test/unit/test_debug.py
Outdated
| mask_str = "None" if mask is None else str(mask) | ||
| opt_flag_str = "None" if opt_flag is None else str(opt_flag) | ||
|
|
||
| result = subprocess.run([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can set environment variables directly when calling subprocess.run, for example for TRITON_DEBUG. This way you can pass fewer parameters into test_debug_kernels.py.
anmyachev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I only have minor comments.
Comes from #2755.
This PR is a temporary fix leveraging subprocess for catching the abort signal, enabling test_device_assert until pytorch/pytorch#142135 is resolved.