-
Hello, Currently I am using Monai with AWS SageMaker DataParallel. I have a validation loop that uses something like:
I've noticed that under a few different conditions i get the following error:
For instance, this occurs if my batch_size = 1 rather than 2 or 4. I've also thrown in some debugging statements that write info to files which have caused this. Question: is there some sort of timeout value that I can adjust to prevent this error? I'm not sure this is necessarily a Monai issue, but i figured this would be the best place to start. I see something similar discussed in #6120, but i'm not sure it's the same issue and the solutions proposed do not appear to help me. Any assistance would be greatly appreciated. Monai v1.1.0 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @csheaff, I just noticed that the error will only happened when batch_size=1. And I'm not sure if you are only using one data, how to using DDP, maybe just remove DDP and if you are using more data then you could add DDP. |
Beta Was this translation helpful? Give feedback.
A condition where this timeout occurs: monitoring model weights and biases using
These are getting saved to S3 as TensorBoard metrics. I wonder if the network activity is somehow blocking the retrieval of the monai metric.
update: reducing the rate at which i'm logging data during the validation step appears to resolve this.