Replies: 1 comment 1 reply
-
Did you try debugging using CUDA_LAUNCH_BLOCKING=1 or by using the CPU instead of GPU? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear all,
I just tried to run the monai classification model, but encountered the following problem.
it seems some gradient calculation was wrong. hope anyone can help.
Internet experience suspect the label was wrong. and I checked the label, only 1 and 0 are sent to the model. the label is correct.
RuntimeError Traceback (most recent call last)
Input In [8], in <cell line: 7>()
19 outputs = model(torch.cat((inputs,msk),dim=1))
20 loss = loss_function(outputs,labels)
---> 21 loss.backward()
22 optimizer.step()
23 epoch_loss +=loss.item()
File ~/anaconda3/lib/python3.9/site-packages/torch/_tensor.py:355, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
308 r"""Computes the gradient of current tensor w.r.t. graph leaves.
309
310 The graph is differentiated using the chain rule. If the tensor is
(...)
352 used to compute the attr::tensors.
353 """
354 if has_torch_function_unary(self):
--> 355 return handle_torch_function(
356 Tensor.backward,
357 (self,),
358 self,
359 gradient=gradient,
360 retain_graph=retain_graph,
361 create_graph=create_graph,
362 inputs=inputs)
363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File ~/anaconda3/lib/python3.9/site-packages/torch/overrides.py:1394, in handle_torch_function(public_api, relevant_args, *args, **kwargs)
1388 warnings.warn("Defining your
__torch_function__ as a plain method is deprecated and " 1389 "will be an error in PyTorch 1.11, please define it as a classmethod.", 1390 DeprecationWarning) 1392 # Use
public_apiinstead of
implementation` so torch_function1393 # implementations can do equality/identity comparisons.
-> 1394 result = torch_func_method(public_api, types, args, kwargs)
1396 if result is not NotImplemented:
1397 return result
File ~/anaconda3/lib/python3.9/site-packages/monai/data/meta_tensor.py:249, in MetaTensor.torch_function(cls, func, types, args, kwargs)
247 if kwargs is None:
248 kwargs = {}
--> 249 ret = super().torch_function(func, types, args, kwargs)
250 # if
out
has been used as argument, metadata is not copied, nothing to do.251 # if "out" in kwargs:
252 # return ret
253 # we might have 1 or multiple outputs. Might be MetaTensor, might be something
254 # else (e.g.,
__repr__
returns a string).255 # Convert to list (if necessary), process, and at end remove list if one was added.
256 if (
257 hasattr(torch, "return_types")
258 and hasattr(func, "name")
(...)
262 ):
263 # for torch.max(torch.tensor(1.0), dim=0), the return type is named-tuple like
File ~/anaconda3/lib/python3.9/site-packages/torch/_tensor.py:1142, in Tensor.torch_function(cls, func, types, args, kwargs)
1139 return NotImplemented
1141 with _C.DisableTorchFunction():
-> 1142 ret = func(*args, **kwargs)
1143 if func in get_default_nowrap_functions():
1144 return ret
File ~/anaconda3/lib/python3.9/site-packages/torch/_tensor.py:363, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
354 if has_torch_function_unary(self):
355 return handle_torch_function(
356 Tensor.backward,
357 (self,),
(...)
361 create_graph=create_graph,
362 inputs=inputs)
--> 363 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File ~/anaconda3/lib/python3.9/site-packages/torch/autograd/init.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
--> 173 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Beta Was this translation helpful? Give feedback.
All reactions