runtime error CUDA error: device-side assert triggered #5785

Girosella · 2022-12-23T08:22:57Z

Girosella
Dec 23, 2022

Dear all,
I just tried to run the monai classification model, but encountered the following problem.
it seems some gradient calculation was wrong. hope anyone can help.
Internet experience suspect the label was wrong. and I checked the label, only 1 and 0 are sent to the model. the label is correct.

RuntimeError Traceback (most recent call last)
Input In [8], in <cell line: 7>()
19 outputs = model(torch.cat((inputs,msk),dim=1))
20 loss = loss_function(outputs,labels)
---> 21 loss.backward()
22 optimizer.step()
23 epoch_loss +=loss.item()

File ~/anaconda3/lib/python3.9/site-packages/torch/_tensor.py:355, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
308 r"""Computes the gradient of current tensor w.r.t. graph leaves.
309
310 The graph is differentiated using the chain rule. If the tensor is
(...)
352 used to compute the attr::tensors.
353 """
354 if has_torch_function_unary(self):
--> 355 return handle_torch_function(
356 Tensor.backward,
357 (self,),
358 self,
359 gradient=gradient,
360 retain_graph=retain_graph,
361 create_graph=create_graph,
362 inputs=inputs)
363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File ~/anaconda3/lib/python3.9/site-packages/torch/overrides.py:1394, in handle_torch_function(public_api, relevant_args, *args, **kwargs)
1388 warnings.warn("Defining your __torch_function__ as a plain method is deprecated and " 1389 "will be an error in PyTorch 1.11, please define it as a classmethod.", 1390 DeprecationWarning) 1392 # Use public_apiinstead ofimplementation` so torch_function
1393 # implementations can do equality/identity comparisons.
-> 1394 result = torch_func_method(public_api, types, args, kwargs)
1396 if result is not NotImplemented:
1397 return result

File ~/anaconda3/lib/python3.9/site-packages/monai/data/meta_tensor.py:249, in MetaTensor.torch_function(cls, func, types, args, kwargs)
247 if kwargs is None:
248 kwargs = {}
--> 249 ret = super().torch_function(func, types, args, kwargs)
250 # if out has been used as argument, metadata is not copied, nothing to do.
251 # if "out" in kwargs:
252 # return ret
253 # we might have 1 or multiple outputs. Might be MetaTensor, might be something
254 # else (e.g., __repr__ returns a string).
255 # Convert to list (if necessary), process, and at end remove list if one was added.
256 if (
257 hasattr(torch, "return_types")
258 and hasattr(func, "name")
(...)
262 ):
263 # for torch.max(torch.tensor(1.0), dim=0), the return type is named-tuple like

File ~/anaconda3/lib/python3.9/site-packages/torch/_tensor.py:1142, in Tensor.torch_function(cls, func, types, args, kwargs)
1139 return NotImplemented
1141 with _C.DisableTorchFunction():
-> 1142 ret = func(*args, **kwargs)
1143 if func in get_default_nowrap_functions():
1144 return ret

File ~/anaconda3/lib/python3.9/site-packages/torch/_tensor.py:363, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
354 if has_torch_function_unary(self):
355 return handle_torch_function(
356 Tensor.backward,
357 (self,),
(...)
361 create_graph=create_graph,
362 inputs=inputs)
--> 363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File ~/anaconda3/lib/python3.9/site-packages/torch/autograd/init.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Maxime-Perret · 2022-12-27T12:15:27Z

Maxime-Perret
Dec 27, 2022

Did you try debugging using CUDA_LAUNCH_BLOCKING=1 or by using the CPU instead of GPU?

1 reply

Girosella Jan 5, 2023
Author

nope, I checked the data and find some problems there. I intended to close this post while didn't find a clue how. this problem is solved and the reason is about the dataset correctness. Thanks for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime error CUDA error: device-side assert triggered #5785

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

runtime error CUDA error: device-side assert triggered #5785

Uh oh!

Uh oh!

Girosella Dec 23, 2022

Replies: 1 comment · 1 reply

Uh oh!

Maxime-Perret Dec 27, 2022

Uh oh!

Girosella Jan 5, 2023 Author

Girosella
Dec 23, 2022

Replies: 1 comment 1 reply

Maxime-Perret
Dec 27, 2022

Girosella Jan 5, 2023
Author