Skip to content

Conversation

@javak87
Copy link
Contributor

@javak87 javak87 commented Dec 29, 2025

Description

The issue described in #1399 suggests there’s a bottleneck when transferring data from CPU to GPU. Experiments show that the batch is not pinned even though the DataLoader is configured with pin_memory=True. Therefore, manual pinning is necessary to improve performance.

Here are the profiling results after manual pinning on JWB:

pinned_mem

Here are the profiling results after manual pinning on Santis:

pinned_mem

Before manual pinning, the data transfer throughput was ~6 GB/s on JWB and 248 MB/s on Santis (!!!). As shown in the profiling, throughput on JWB increased to 17 GB/s, and on Santis it jumped to 360 GB/s. Achieving 360 GB/s is close to what we expect from the CPU–GPU NVLink on Santis. The maximum theoretical throughput on Santis is 450 GB/s.

To verify the manual pinning behavior, the code was also run with:

../WeatherGenerator-private/hpc/launch-slurm.py --time 180 --nodes=1
Here are the training time results:

run_id HPC PR Ingested Samples per GPU
kb9uki4x Santis develop (1 node) (180 mins) 9366
lptxb12a Santis javad/dev/manual-mem-pinning-1399 (1 node) (180 mins) 10320

The above performance check related to 23 Dec. 2026 develop branch.

As shown, Santis performance improved by ~10% (with throughput increasing from 248 MB/s to 360 GB/s), while there was no noticeable change on JWB.

Issue Number

Closes #1399

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@clessig
Copy link
Collaborator

clessig commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

@javak87
Copy link
Contributor Author

javak87 commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.

You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

@clessig
Copy link
Collaborator

clessig commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.

You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)
. Then the batch is completed but it's still running in parallel.

@javak87
Copy link
Contributor Author

javak87 commented Dec 29, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

CPU RAM consumed (without pinning) for the first 10 batches:

Batch 0: RAM = 1694.7 MB
Batch 1: RAM = 3297.4 MB
Batch 2: RAM = 3371.8 MB
Batch 3: RAM = 3371.8 MB
Batch 4: RAM = 3374.2 MB
Batch 5: RAM = 3376.3 MB
Batch 6: RAM = 3376.3 MB
Batch 7: RAM = 3377.8 MB
Batch 8: RAM = 3379.9 MB
Batch 9: RAM = 3382.1 MB
Batch 10: RAM = 3383.6 MB

CPU RAM consumed (with pinning) for the first 10 batches:

Batch 0: RAM = 1093.4 MB
Batch 1: RAM = 3435.7 MB
Batch 2: RAM = 3441.2 MB
Batch 3: RAM = 3441.2 MB
Batch 4: RAM = 3443.7 MB
Batch 5: RAM = 3443.7 MB
Batch 6: RAM = 3445.7 MB
Batch 7: RAM = 3447.2 MB
Batch 8: RAM = 3449.6 MB
Batch 9: RAM = 3452.6 MB
Batch 10: RAM = 3454.5 MB

Pinning increases the CPU RAM by approximately 80 MB

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

Done.

@clessig
Copy link
Collaborator

clessig commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

Done.

And does this change the performance behaviour? (Can you also lint the code.)

@tjhunter
Copy link
Collaborator

Thanks @javak87 for the investigation. It is not entirely surprising since the tensors sent to the GPU are heavily fragmented due to all the transforms. pinning forces assembling them first in a single memory aligned page.

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

Done.

And does this change the performance behaviour? (Can you also lint the code.)

Regarding ruffing the code, I did it, but because I need to import torch in packages/common/src/weathergen/common/io.py, I’m getting the following error:

 WARN /home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml: Extra keys found in config: ignores
ERROR Could not find import of `torch` [import-error]
  --> packages/common/src/weathergen/common/io.py:19:8
   |
19 | import torch
   |        ^^^^^
   |
  Looked in these locations (from config in `/home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml`):

I think packages/common/pyproject.toml should be changed.

@tjhunter do you have any idea how to solve this import error?

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

I made a mistake and ran a different branch. If I add pinning after this line in multi_stream_data_sampler.py, I get an error:

batch = self._get_batch(idx, forecast_dt)

.

The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside __iter__), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly.
I think the initial setup was correct.

Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javak87 this looks very hepful, and you can make your life much easier. All you need is to traverse the data structures to trigger a side effect (we are not dealing with async optimizations yet). Also, the Protocol concept in python is exactly for that purpose.

Define the protocol and the traversal function in a pin.py module:

from typing import Protocol, runtime_checkable
import torch
from weathergen.common.io import IOReaderData

@runtime_checkable
class Pinnable(Protocol):
    """
    Protocol that allows the pytorch content of this data structure 
    to be pinned to the memory of the current accelerator.

    This extends the pin_memory() capability of a torch Tensor 
    to other classes.

    It is blocking.
    """
    def pin_memory(self): ...


def pin_object(obj: Pinnable | torch.Tensor | IOReaderData | list | dict | None):
    if obj is None:
        return
    elif isinstance(obj, torch.Tensor) and obj.numel() > 0:
        obj.pin_memory()
    elif isinstance(obj, Pinnable):
        obj.pin_memory()
    elif isinstance(obj, IOReaderData):
        # Special case for that class because it is in common
        # Should not be the case, it is a numpy array
        pin_object(obj.coords)
        ...
    elif isinstance(obj, list):
        # Assume the list is a list of potentially pinnable objects and traverse it.
        for e in obj:
            pin_object(e)
    elif isinstance(obj, dict):
        # Assume the values are pinnable.
        for e in obj.values():
            pin_object(e)

and then the changes in each class are very tiny:

from weathergen.datasets.pin import Pinnable, pin_object
...
class Sample(Pinnable):
...
    def pin_memory(self):
        pin_object(self.streams_data)
        pin_object(self.meta_info)

No need to do more checks for attributes etc., this is all done for you by the protocol.

@tjhunter
Copy link
Collaborator

Also, using a protocol has the advantage of clearly documenting all the classes which deal with memory pinning.

@tjhunter
Copy link
Collaborator

The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside iter), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly.
I think the initial setup was correct.

Interesting, that sounds like a bug on torch side (at least regarding pinning the CPU memory, I can imagine that they take shortcuts and mix CPU and GPU logic)

@clessig
Copy link
Collaborator

clessig commented Dec 30, 2025

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.
You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

Can you try to pin here:

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

I made a mistake and ran a different branch. If I add pinning after this line in multi_stream_data_sampler.py, I get an error:

batch = self._get_batch(idx, forecast_dt)

.
The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside __iter__), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly. I think the initial setup was correct.

Yes, it's what I would expected (since CUDA doesn't work in the parallel processes), but was worth a try. But then we should still do it as early as possible in Trainer.train() and Trainer.validate().

torch should at least generate a warning.

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

@clessig
Since pinning is manual and dataloader pinning isn’t working with the current setup, I think it’s better to remove it.

@javak87
Copy link
Contributor Author

javak87 commented Dec 30, 2025

Also, using a protocol has the advantage of clearly documenting all the classes which deal with memory pinning.

Thanks for your suggestion — it’s pretty convenient.

@sophie-xhonneux
Copy link
Contributor

DINOv2 hangs with FSDP2 and your memory pinning, but I don't know why.

everything else worked in my testing (integration tests, JEPA, Physical modelling with and without FSDP2)

@clessig
Copy link
Collaborator

clessig commented Jan 6, 2026

DINOv2 hangs with FSDP2 and your memory pinning, but I don't know why.

everything else worked in my testing (integration tests, JEPA, Physical modelling with and without FSDP2)

Do you know where it is hanging? Is there any log? Eventually it should time out and point you to the location where it hangs.

@sophie-xhonneux
Copy link
Contributor

I waited for over 10 minutes and got no error

@javak87
Copy link
Contributor Author

javak87 commented Jan 6, 2026

I waited for over 10 minutes and got no error

Could you share the configuration you’re using to run the code?

@sophie-xhonneux
Copy link
Contributor

So I ran FSDP2 with DINOv2 on develop and I get the same hanging behaviour (the error below in case anyone is wondering), so please proceed with merging this, though I would still advocate for it coming with a flag to toggle it (even if it just makes debugging later easier).

[rank2]:[E112 15:47:18.728729935 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600045 milliseconds before timing out.
[rank2]:[E112 15:47:18.729000049 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank2]:[E112 15:47:18.729013681 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E112 15:47:18.763162126 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
[rank3]:[E112 15:47:18.763461617 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank3]:[E112 15:47:18.763478129 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:19.142036236 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
[rank0]:[E112 15:47:19.142320846 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank0]:[E112 15:47:19.142336846 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E112 15:47:19.144332829 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank1]:[E112 15:47:19.144484862 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank1]:[E112 15:47:19.144495230 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:20.734418267 ProcessGroupNCCL.cpp:681] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E112 15:47:20.734432987 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E112 15:47:20.734446107 ProcessGroupNCCL.cpp:695] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.734459099 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.735249281 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

[rank0]:[E112 15:47:20.735249057 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::DistBackendErrorc10::DistBackendError'
'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x4000417c5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x40004180daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x400022cf5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x400022d3daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

W0112 15:47:25.476000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284494 closing signal SIGTERM
W0112 15:47:25.484000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284495 closing signal SIGTERM
E0112 15:47:25.490000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 284492) of binary: /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/python
Traceback (most recent call last):
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
src/weathergen/run_train.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 284493)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284493
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 284492)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284492
=======================================================

@javak87
Copy link
Contributor Author

javak87 commented Jan 13, 2026

So I ran FSDP2 with DINOv2 on develop and I get the same hanging behaviour (the error below in case anyone is wondering), so please proceed with merging this, though I would still advocate for it coming with a flag to toggle it (even if it just makes debugging later easier).

[rank2]:[E112 15:47:18.728729935 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600045 milliseconds before timing out.
[rank2]:[E112 15:47:18.729000049 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank2]:[E112 15:47:18.729013681 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E112 15:47:18.763162126 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
[rank3]:[E112 15:47:18.763461617 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank3]:[E112 15:47:18.763478129 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:19.142036236 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
[rank0]:[E112 15:47:19.142320846 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank0]:[E112 15:47:19.142336846 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E112 15:47:19.144332829 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank1]:[E112 15:47:19.144484862 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank1]:[E112 15:47:19.144495230 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:20.734418267 ProcessGroupNCCL.cpp:681] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E112 15:47:20.734432987 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E112 15:47:20.734446107 ProcessGroupNCCL.cpp:695] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.734459099 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.735249281 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

[rank0]:[E112 15:47:20.735249057 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::DistBackendErrorc10::DistBackendError'
'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x4000417c5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x40004180daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x400022cf5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x400022d3daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

W0112 15:47:25.476000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284494 closing signal SIGTERM
W0112 15:47:25.484000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284495 closing signal SIGTERM
E0112 15:47:25.490000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 284492) of binary: /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/python
Traceback (most recent call last):
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
src/weathergen/run_train.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 284493)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284493
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 284492)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284492
=======================================================

I added a memory_pinning flag to the config and to the training/validation loops. However, checking this flag on every batch is a bit ugly. We could move the check outside the loop, but we’d still have to iterate over all batches in the dataset, which is much slower.
I think in the future it might be better to remove this flag.

@sophie-xhonneux
Copy link
Contributor

Alternatively, you can overwrite the pin_memory() function if the flag is off

@javak87
Copy link
Contributor Author

javak87 commented Jan 13, 2026

Alternatively, you can overwrite the pin_memory() function if the flag is off

Yep. For now, I don’t think it’s a big deal and it’s not affecting performance too much. We can merge this and maybe create another PR for it later.

@clessig
Copy link
Collaborator

clessig commented Jan 14, 2026

Through the rebasing there was one change necessary. And memory_pinning should be a param in data_loading. I cannot commit to your branch so here are the diffs. Can you please make it's all working. Then we can merge.

diff --git a/config/default_config.yml b/config/default_config.yml
index f8f377e6..4a871bb8 100644
--- a/config/default_config.yml
+++ b/config/default_config.yml
@@ -83,9 +83,6 @@ latent_noise_use_additive_noise: False
 latent_noise_deterministic_latents: True 
 
 
-# It’s possible that enabling memory_pinning with FSDP2 + DINOv2 can cause the job to hang and trigger a PyTorch timeout error.
-# If this happens, you can disable the flag, but performance will drop on GH200.
-memory_pinning: True
 freeze_modules: ""
 
 norm_type: "LayerNorm"
@@ -128,6 +125,11 @@ data_loading :
   rng_seed: ???
   repeat_data_in_mini_epoch : False
 
+  # pin GPU memory for faster transfer; it is possible that enabling memory_pinning with 
+  # FSDP2 + DINOv2 can cause the job to hang and trigger a PyTorch timeout error.
+  # If this happens, you can disable the flag, but performance will drop on GH200.
+  memory_pinning: True
+
 
 # config for training
 training_config:
diff --git a/config/config_physical_jepa.yml b/config/config_physical_jepa.yml
index 12eddf65..82ab0e0d 100644
--- a/config/config_physical_jepa.yml
+++ b/config/config_physical_jepa.yml
@@ -126,6 +126,11 @@ data_loading :
   num_workers: 12
   rng_seed: ???
 
+  # pin GPU memory for faster transfer; it is possible that enabling memory_pinning with 
+  # FSDP2 + DINOv2 can cause the job to hang and trigger a PyTorch timeout error.
+  # If this happens, you can disable the flag, but performance will drop on GH200.
+  memory_pinning: True
+
 
 # config for training
 training_config:
diff --git a/src/weathergen/datasets/batch.py b/src/weathergen/datasets/batch.py
index da797e00..e0dc59d6 100644
--- a/src/weathergen/datasets/batch.py
+++ b/src/weathergen/datasets/batch.py
@@ -175,6 +175,19 @@ class BatchSamples:
         """
         return self.device
 
+    def pin_memory(self):
+        """Pin all tensors in this batch to CPU pinned memory"""
+
+        # pin all samples
+        for sample in self.samples:
+            sample.pin_memory()
+
+        # pin source_tokens_lens
+        if isinstance(self.tokens_lens, torch.Tensor):
+            self.tokens_lens = self.tokens_lens.pin_memory()
+
+        return self
+
 
 class ModelBatch:
     """
@@ -208,17 +221,11 @@ class ModelBatch:
     def pin_memory(self):
         """Pin all tensors in this batch to CPU pinned memory"""
 
-        # Pin all source samples
-        for sample in self.source_samples:
-            sample.pin_memory()
-
-        # Pin all target samples
-        for sample in self.target_samples:
-            sample.pin_memory()
+        # pin source samples
+        self.source_samples.pin_memory()
 
-        # Pin source_tokens_lens
-        if isinstance(self.source_tokens_lens, torch.Tensor):
-            self.source_tokens_lens = self.source_tokens_lens.pin_memory()  # pylint: disable=attribute-defined-outside-init
+        # pin target samples
+        self.target_samples.pin_memory()
 
         return self
diff --git a/src/weathergen/train/trainer.py b/src/weathergen/train/trainer.py
index 1bee60e7..7cb480af 100644
--- a/src/weathergen/train/trainer.py
+++ b/src/weathergen/train/trainer.py
@@ -396,7 +396,7 @@ class Trainer(TrainerBase):
        # training loop
        self.t_start = time.time()
        for bidx, batch in enumerate(dataset_iter):
-            if cf.memory_pinning:
+            if cf.data_loading.get("memory_pinning", False):
                # pin memory for faster CPU-GPU transfer
                batch = batch.pin_memory()

@@ -514,7 +514,7 @@ class Trainer(TrainerBase):
            # print progress bar but only in interactive mode, i.e. when without ddp
            with tqdm.tqdm(total=mode_cfg.samples_per_mini_epoch, disable=self.cf.with_ddp) as pbar:
                for bidx, batch in enumerate(dataset_val_iter):
-                    if cf.memory_pinning:
+                    if cf.data_loading.get("memory_pinning", False):
                        # pin memory for faster CPU-GPU transfer
                        batch = batch.pin_memory()

@clessig clessig mentioned this pull request Jan 15, 2026
4 tasks
@clessig
Copy link
Collaborator

clessig commented Jan 15, 2026

Created #1615 with the changes

@javak87
Copy link
Contributor Author

javak87 commented Jan 16, 2026

Created #1615 with the changes

Thanks. I checked #1615 and update this based on #1615.

@clessig
Copy link
Collaborator

clessig commented Jan 19, 2026

Closed via #1615

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Data Transfer from CPU to GPU is not optimized

4 participants