Replies: 2 comments 3 replies
-
Another question is that: for example, setting |
Beta Was this translation helpful? Give feedback.
1 reply
-
It is hard for me to give good advice from the snippet above. Is there a particular reason you are using a |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
For the PyTorch lightning dataset in this library, it only supports ddp_spawn for multi-GPU training. I have a dataset for large point clouds, in order to sample evenly distributed points during the training, a shared buffer recording already sampled parts are necessary. It is updated dynamically during getting the data. An example can be found in KPConf as below.
I use the
LightningDataset
to wrap the dataloader and useDDPSpawnPlugin
for strategy.However, during training, the lock causes a segmentation fault
ERROR: Unexpected segmentation fault encountered in worker.
I found in PyTorch, it says here.
Best practices and tips
Avoiding and fighting deadlocks
There are a lot of things that can go wrong when a new process is spawned, with the most common cause of deadlocks being background threads. If there’s any thread that holds a lock or imports a module, and fork is called, it’s very likely that the subprocess will be in a corrupted state and will deadlock or fail in a different way. Note that even if you don’t, Python built in libraries do - no need to look further than multiprocessing. multiprocessing.Queue is actually a very complex class, that spawns multiple threads used to serialize, send and receive objects, and they can cause aforementioned problems too. If you find yourself in such situation try using a multiprocessing.queues.SimpleQueue, that doesn’t use any additional threads.
We’re trying our best to make it easy for you and ensure these deadlocks don’t happen but some things are out of our control. If you have any issues you can’t cope with for a while, try reaching out on forums, and we’ll see if it’s an issue we can fix.
Is there a solution or workaround for this problem?
Thanks,
Han
Beta Was this translation helpful? Give feedback.
All reactions