Skip to content

Commit 9ca7e14

Browse files
authored
mm: discard async errors from pinning failures (#10738)
Pretty much every error cudaHostRegister can throw also queues the same error on the async GPU queue. This was fixed for repinning error case, but there is the bad mmap and just enomem cases that are harder to detect. Do some dummy GPU work to clean the error state.
1 parent 8fd0717 commit 9ca7e14

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

comfy/model_management.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1126,6 +1126,16 @@ def cast_to_device(tensor, device, dtype, copy=False):
11261126

11271127
PINNING_ALLOWED_TYPES = set(["Parameter", "QuantizedTensor"])
11281128

1129+
def discard_cuda_async_error():
1130+
try:
1131+
a = torch.tensor([1], dtype=torch.uint8, device=get_torch_device())
1132+
b = torch.tensor([1], dtype=torch.uint8, device=get_torch_device())
1133+
_ = a + b
1134+
torch.cuda.synchronize()
1135+
except torch.AcceleratorError:
1136+
#Dump it! We already know about it from the synchronous return
1137+
pass
1138+
11291139
def pin_memory(tensor):
11301140
global TOTAL_PINNED_MEMORY
11311141
if MAX_PINNED_MEMORY <= 0:
@@ -1158,6 +1168,8 @@ def pin_memory(tensor):
11581168
PINNED_MEMORY[ptr] = size
11591169
TOTAL_PINNED_MEMORY += size
11601170
return True
1171+
else:
1172+
discard_cuda_async_error()
11611173

11621174
return False
11631175

@@ -1186,6 +1198,8 @@ def unpin_memory(tensor):
11861198
if len(PINNED_MEMORY) == 0:
11871199
TOTAL_PINNED_MEMORY = 0
11881200
return True
1201+
else:
1202+
discard_cuda_async_error()
11891203

11901204
return False
11911205

0 commit comments

Comments
 (0)