Manual memory pinning #1533

javak87 · 2025-12-29T13:10:35Z

Description

The issue described in #1399 suggests there’s a bottleneck when transferring data from CPU to GPU. Experiments show that the batch is not pinned even though the DataLoader is configured with pin_memory=True. Therefore, manual pinning is necessary to improve performance.

Here are the profiling results after manual pinning on JWB:

Here are the profiling results after manual pinning on Santis:

Before manual pinning, the data transfer throughput was ~6 GB/s on JWB and 248 MB/s on Santis (!!!). As shown in the profiling, throughput on JWB increased to 17 GB/s, and on Santis it jumped to 360 GB/s. Achieving 360 GB/s is close to what we expect from the CPU–GPU NVLink on Santis. The maximum theoretical throughput on Santis is 450 GB/s.

To verify the manual pinning behavior, the code was also run with:

../WeatherGenerator-private/hpc/launch-slurm.py --time 180 --nodes=1
Here are the training time results:

run_id	HPC	PR	Ingested Samples per GPU
kb9uki4x	Santis	develop (1 node) (180 mins)	9366
lptxb12a	Santis	javad/dev/manual-mem-pinning-1399 (1 node) (180 mins)	10320

The above performance check related to 23 Dec. 2026 develop branch.

As shown, Santis performance improved by ~10% (with throughput increasing from 248 MB/s to 360 GB/s), while there was no noticeable change on JWB.

Issue Number

Closes #1399

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig · 2025-12-29T13:28:25Z

Thanks @javak87, this looks interesting. Wouldn't it be more sensible to move to pinned memory already in the MultiStreamDataSampler when we convert to torch.tensor() or when we complete the batch (we cannot move it to the GPU there but maybe to pinned memory). Also, one problem with pinned memory is that this reduces the CPU RAM. Did you observe a reduced available CPU RAM?

javak87 · 2025-12-29T13:55:25Z

Regarding pinning, I thought about this too. I tried pinning earlier, but some objects in the stream (or some tensors) were still pageable and not actually pinned. It’s recommended to first assemble the full batch and then pin the batch afterward.

You’re right that this increases the CPU RAM footprint, but I haven’t measured it yet. With the current setup, GPU memory usage is around ~12 GB, and compared to a node with 512 GB of RAM, I don’t think it will have a significant impact.

clessig · 2025-12-29T14:17:53Z

Can you try to pin here:

WeatherGenerator/src/weathergen/datasets/multi_stream_data_sampler.py

Line 763 in 1776b0a

batch = self._get_batch(idx, forecast_dt)

. Then the batch is completed but it's still running in parallel.

javak87 · 2025-12-29T14:40:46Z

CPU RAM consumed (without pinning) for the first 10 batches:

Batch 0: RAM = 1694.7 MB
Batch 1: RAM = 3297.4 MB
Batch 2: RAM = 3371.8 MB
Batch 3: RAM = 3371.8 MB
Batch 4: RAM = 3374.2 MB
Batch 5: RAM = 3376.3 MB
Batch 6: RAM = 3376.3 MB
Batch 7: RAM = 3377.8 MB
Batch 8: RAM = 3379.9 MB
Batch 9: RAM = 3382.1 MB
Batch 10: RAM = 3383.6 MB

CPU RAM consumed (with pinning) for the first 10 batches:

Batch 0: RAM = 1093.4 MB
Batch 1: RAM = 3435.7 MB
Batch 2: RAM = 3441.2 MB
Batch 3: RAM = 3441.2 MB
Batch 4: RAM = 3443.7 MB
Batch 5: RAM = 3443.7 MB
Batch 6: RAM = 3445.7 MB
Batch 7: RAM = 3447.2 MB
Batch 8: RAM = 3449.6 MB
Batch 9: RAM = 3452.6 MB
Batch 10: RAM = 3454.5 MB

Pinning increases the CPU RAM by approximately 80 MB

javak87 · 2025-12-30T10:52:14Z

Done.

clessig · 2025-12-30T10:56:01Z

And does this change the performance behaviour? (Can you also lint the code.)

tjhunter · 2025-12-30T11:17:41Z

Thanks @javak87 for the investigation. It is not entirely surprising since the tensors sent to the GPU are heavily fragmented due to all the transforms. pinning forces assembling them first in a single memory aligned page.

javak87 · 2025-12-30T11:18:24Z

Regarding ruffing the code, I did it, but because I need to import torch in packages/common/src/weathergen/common/io.py, I’m getting the following error:

 WARN /home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml: Extra keys found in config: ignores
ERROR Could not find import of `torch` [import-error]
  --> packages/common/src/weathergen/common/io.py:19:8
   |
19 | import torch
   |        ^^^^^
   |
  Looked in these locations (from config in `/home/runner/work/WeatherGenerator/WeatherGenerator/packages/common/pyproject.toml`):

I think packages/common/pyproject.toml should be changed.

@tjhunter do you have any idea how to solve this import error?

src/weathergen/datasets/stream_data.py

javak87 · 2025-12-30T12:02:03Z

I made a mistake and ran a different branch. If I add pinning after this line in multi_stream_data_sampler.py, I get an error:

WeatherGenerator/src/weathergen/datasets/multi_stream_data_sampler.py

Line 763 in 1776b0a

batch = self._get_batch(idx, forecast_dt)

.

The issue is that worker processes are forked from the main process before CUDA is initialized, and CUDA does not support forking after initialization. When I call .pin_memory() in the worker process (inside __iter__), it attempts to access CUDA, but the CUDA context hasn’t been properly initialized in that worker. Therefore, pinning needs to be done in the main process, where CUDA is already initialized correctly.
I think the initial setup was correct.

tjhunter

@javak87 this looks very hepful, and you can make your life much easier. All you need is to traverse the data structures to trigger a side effect (we are not dealing with async optimizations yet). Also, the Protocol concept in python is exactly for that purpose.

Define the protocol and the traversal function in a pin.py module:

from typing import Protocol, runtime_checkable
import torch
from weathergen.common.io import IOReaderData

@runtime_checkable
class Pinnable(Protocol):
    """
    Protocol that allows the pytorch content of this data structure 
    to be pinned to the memory of the current accelerator.

    This extends the pin_memory() capability of a torch Tensor 
    to other classes.

    It is blocking.
    """
    def pin_memory(self): ...


def pin_object(obj: Pinnable | torch.Tensor | IOReaderData | list | dict | None):
    if obj is None:
        return
    elif isinstance(obj, torch.Tensor) and obj.numel() > 0:
        obj.pin_memory()
    elif isinstance(obj, Pinnable):
        obj.pin_memory()
    elif isinstance(obj, IOReaderData):
        # Special case for that class because it is in common
        # Should not be the case, it is a numpy array
        pin_object(obj.coords)
        ...
    elif isinstance(obj, list):
        # Assume the list is a list of potentially pinnable objects and traverse it.
        for e in obj:
            pin_object(e)
    elif isinstance(obj, dict):
        # Assume the values are pinnable.
        for e in obj.values():
            pin_object(e)

and then the changes in each class are very tiny:

from weathergen.datasets.pin import Pinnable, pin_object
...
class Sample(Pinnable):
...
    def pin_memory(self):
        pin_object(self.streams_data)
        pin_object(self.meta_info)

No need to do more checks for attributes etc., this is all done for you by the protocol.

packages/common/src/weathergen/common/io.py

packages/evaluate/src/weathergen/evaluate/export/export_inference.py

packages/common/pyproject.toml

packages/common/src/weathergen/common/io.py

tjhunter · 2025-12-30T12:49:38Z

Also, using a protocol has the advantage of clearly documenting all the classes which deal with memory pinning.

tjhunter · 2025-12-30T12:54:11Z

Interesting, that sounds like a bug on torch side (at least regarding pinning the CPU memory, I can imagine that they take shortcuts and mix CPU and GPU logic)

clessig · 2025-12-30T14:01:30Z

Yes, it's what I would expected (since CUDA doesn't work in the parallel processes), but was worth a try. But then we should still do it as early as possible in Trainer.train() and Trainer.validate().

torch should at least generate a warning.

javak87 · 2025-12-30T15:56:04Z

@clessig
Since pinning is manual and dataloader pinning isn’t working with the current setup, I think it’s better to remove it.

WeatherGenerator/src/weathergen/train/trainer.py

Line 142 in dfcf3e1

"pin_memory": True,

…e change

javak87 · 2025-12-30T16:52:39Z

Thanks for your suggestion — it’s pretty convenient.

src/weathergen/datasets/memory_pinning.py

sophie-xhonneux · 2026-01-06T16:50:26Z

DINOv2 hangs with FSDP2 and your memory pinning, but I don't know why.

everything else worked in my testing (integration tests, JEPA, Physical modelling with and without FSDP2)

clessig · 2026-01-06T16:53:44Z

Do you know where it is hanging? Is there any log? Eventually it should time out and point you to the location where it hangs.

sophie-xhonneux · 2026-01-06T16:55:11Z

I waited for over 10 minutes and got no error

javak87 · 2026-01-06T19:50:13Z

Could you share the configuration you’re using to run the code?

sophie-xhonneux · 2026-01-12T14:49:39Z

So I ran FSDP2 with DINOv2 on develop and I get the same hanging behaviour (the error below in case anyone is wondering), so please proceed with merging this, though I would still advocate for it coming with a flag to toggle it (even if it just makes debugging later easier).

[rank2]:[E112 15:47:18.728729935 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600045 milliseconds before timing out.
[rank2]:[E112 15:47:18.729000049 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank2]:[E112 15:47:18.729013681 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E112 15:47:18.763162126 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
[rank3]:[E112 15:47:18.763461617 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank3]:[E112 15:47:18.763478129 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:19.142036236 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
[rank0]:[E112 15:47:19.142320846 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank0]:[E112 15:47:19.142336846 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E112 15:47:19.144332829 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank1]:[E112 15:47:19.144484862 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank1]:[E112 15:47:19.144495230 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:20.734418267 ProcessGroupNCCL.cpp:681] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E112 15:47:20.734432987 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E112 15:47:20.734446107 ProcessGroupNCCL.cpp:695] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.734459099 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.735249281 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

[rank0]:[E112 15:47:20.735249057 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::DistBackendErrorc10::DistBackendError'
'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x4000417c5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x40004180daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x400022cf5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x400022d3daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

W0112 15:47:25.476000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284494 closing signal SIGTERM
W0112 15:47:25.484000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284495 closing signal SIGTERM
E0112 15:47:25.490000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 284492) of binary: /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/python
Traceback (most recent call last):
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
src/weathergen/run_train.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 284493)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284493
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 284492)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284492
=======================================================

javak87 · 2026-01-13T08:44:04Z

So I ran FSDP2 with DINOv2 on develop and I get the same hanging behaviour (the error below in case anyone is wondering), so please proceed with merging this, though I would still advocate for it coming with a flag to toggle it (even if it just makes debugging later easier).

[rank2]:[E112 15:47:18.728729935 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600045 milliseconds before timing out.
[rank2]:[E112 15:47:18.729000049 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank2]:[E112 15:47:18.729013681 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E112 15:47:18.763162126 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=ALLREDUCE, NumelIn=4096, NumelOut=4096, Timeout(ms)=600000) ran for 600080 milliseconds before timing out.
[rank3]:[E112 15:47:18.763461617 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1119, last completed work: 1084
[rank3]:[E112 15:47:18.763478129 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:19.142036236 ProcessGroupNCCL.cpp:629] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
[rank0]:[E112 15:47:19.142320846 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank0]:[E112 15:47:19.142336846 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E112 15:47:19.144332829 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank1]:[E112 15:47:19.144484862 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 1085 PG status: last enqueued work: 1100, last completed work: 1084
[rank1]:[E112 15:47:19.144495230 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E112 15:47:20.734418267 ProcessGroupNCCL.cpp:681] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E112 15:47:20.734432987 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E112 15:47:20.734446107 ProcessGroupNCCL.cpp:695] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.734459099 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E112 15:47:20.735249281 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

[rank0]:[E112 15:47:20.735249057 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

terminate called after throwing an instance of 'terminate called after throwing an instance of 'c10::DistBackendErrorc10::DistBackendError'
'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x40004180abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x40004180cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x40004180d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x40007770a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x4000417c5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x40004180daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x400030c3d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40002e63875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40002e89ff2c in /lib64/libc.so.6)

  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1085, OpType=_ALLGATHER_BASE, NumelIn=4194304, NumelOut=16777216, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f0 (0x400022d3abe0 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x824 (0x400022d3cd24 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x114 (0x400022d3d7a4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #5: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #6: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x400058c3a9e4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1465870 (0x400022cf5870 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x464 (0x400022d3daf4 in /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdd88c (0x40001216d88c in /usr/lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x875c (0x40000fb6875c in /lib64/libpthread.so.0)
frame #5: <unknown function> + 0xdff2c (0x40000fdcff2c in /lib64/libc.so.6)

W0112 15:47:25.476000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284494 closing signal SIGTERM
W0112 15:47:25.484000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 284495 closing signal SIGTERM
E0112 15:47:25.490000 284287 /capstor/store/cscs/userlab/ch17/uv_cache_shared/archive-v0/HiFXmMqcadA7ZHtmfpaDI/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 284492) of binary: /users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/python
Traceback (most recent call last):
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/sxhonneu/projects/clean-testbed/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
src/weathergen/run_train.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 284493)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284493
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-12_15:47:25
  host      : nid005232
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 284492)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 284492
=======================================================

I added a memory_pinning flag to the config and to the training/validation loops. However, checking this flag on every batch is a bit ugly. We could move the check outside the loop, but we’d still have to iterate over all batches in the dataset, which is much slower.
I think in the future it might be better to remove this flag.

sophie-xhonneux · 2026-01-13T08:53:25Z

Alternatively, you can overwrite the pin_memory() function if the flag is off

javak87 · 2026-01-13T09:39:24Z

Yep. For now, I don’t think it’s a big deal and it’s not affecting performance too much. We can merge this and maybe create another PR for it later.

clessig · 2026-01-14T21:49:01Z

Through the rebasing there was one change necessary. And memory_pinning should be a param in data_loading. I cannot commit to your branch so here are the diffs. Can you please make it's all working. Then we can merge.

diff --git a/config/default_config.yml b/config/default_config.yml
index f8f377e6..4a871bb8 100644
--- a/config/default_config.yml
+++ b/config/default_config.yml
@@ -83,9 +83,6 @@ latent_noise_use_additive_noise: False
 latent_noise_deterministic_latents: True 
 
 
-# It’s possible that enabling memory_pinning with FSDP2 + DINOv2 can cause the job to hang and trigger a PyTorch timeout error.
-# If this happens, you can disable the flag, but performance will drop on GH200.
-memory_pinning: True
 freeze_modules: ""
 
 norm_type: "LayerNorm"
@@ -128,6 +125,11 @@ data_loading :
   rng_seed: ???
   repeat_data_in_mini_epoch : False
 
+  # pin GPU memory for faster transfer; it is possible that enabling memory_pinning with 
+  # FSDP2 + DINOv2 can cause the job to hang and trigger a PyTorch timeout error.
+  # If this happens, you can disable the flag, but performance will drop on GH200.
+  memory_pinning: True
+
 
 # config for training
 training_config:

diff --git a/config/config_physical_jepa.yml b/config/config_physical_jepa.yml
index 12eddf65..82ab0e0d 100644
--- a/config/config_physical_jepa.yml
+++ b/config/config_physical_jepa.yml
@@ -126,6 +126,11 @@ data_loading :
   num_workers: 12
   rng_seed: ???
 
+  # pin GPU memory for faster transfer; it is possible that enabling memory_pinning with 
+  # FSDP2 + DINOv2 can cause the job to hang and trigger a PyTorch timeout error.
+  # If this happens, you can disable the flag, but performance will drop on GH200.
+  memory_pinning: True
+
 
 # config for training
 training_config:

diff --git a/src/weathergen/datasets/batch.py b/src/weathergen/datasets/batch.py
index da797e00..e0dc59d6 100644
--- a/src/weathergen/datasets/batch.py
+++ b/src/weathergen/datasets/batch.py
@@ -175,6 +175,19 @@ class BatchSamples:
         """
         return self.device
 
+    def pin_memory(self):
+        """Pin all tensors in this batch to CPU pinned memory"""
+
+        # pin all samples
+        for sample in self.samples:
+            sample.pin_memory()
+
+        # pin source_tokens_lens
+        if isinstance(self.tokens_lens, torch.Tensor):
+            self.tokens_lens = self.tokens_lens.pin_memory()
+
+        return self
+
 
 class ModelBatch:
     """
@@ -208,17 +221,11 @@ class ModelBatch:
     def pin_memory(self):
         """Pin all tensors in this batch to CPU pinned memory"""
 
-        # Pin all source samples
-        for sample in self.source_samples:
-            sample.pin_memory()
-
-        # Pin all target samples
-        for sample in self.target_samples:
-            sample.pin_memory()
+        # pin source samples
+        self.source_samples.pin_memory()
 
-        # Pin source_tokens_lens
-        if isinstance(self.source_tokens_lens, torch.Tensor):
-            self.source_tokens_lens = self.source_tokens_lens.pin_memory()  # pylint: disable=attribute-defined-outside-init
+        # pin target samples
+        self.target_samples.pin_memory()
 
         return self

diff --git a/src/weathergen/train/trainer.py b/src/weathergen/train/trainer.py
index 1bee60e7..7cb480af 100644
--- a/src/weathergen/train/trainer.py
+++ b/src/weathergen/train/trainer.py
@@ -396,7 +396,7 @@ class Trainer(TrainerBase):
        # training loop
        self.t_start = time.time()
        for bidx, batch in enumerate(dataset_iter):
-            if cf.memory_pinning:
+            if cf.data_loading.get("memory_pinning", False):
                # pin memory for faster CPU-GPU transfer
                batch = batch.pin_memory()

@@ -514,7 +514,7 @@ class Trainer(TrainerBase):
            # print progress bar but only in interactive mode, i.e. when without ddp
            with tqdm.tqdm(total=mode_cfg.samples_per_mini_epoch, disable=self.cf.with_ddp) as pbar:
                for bidx, batch in enumerate(dataset_val_iter):
-                    if cf.memory_pinning:
+                    if cf.data_loading.get("memory_pinning", False):
                        # pin memory for faster CPU-GPU transfer
                        batch = batch.pin_memory()

clessig · 2026-01-15T06:40:27Z

Created #1615 with the changes

javak87 · 2026-01-16T10:39:50Z

Thanks. I checked #1615 and update this based on #1615.

…m/javak87/WeatherGenerator into javad/dev/manual-mem-pinning-1399

clessig · 2026-01-19T05:41:02Z

Closed via #1615

Javad Kasravi and others added 5 commits December 24, 2025 12:29

add pin mem to IOReaderData

970c4dc

add pin mem to sample & modelbatch class

5c566df

add pin mem to stream data

e85309d

add pin mem to training loop

ac3b089

run /scripts/actions.sh lint

c3fc9a7

github-project-automation bot added this to WeatherGen-dev Dec 29, 2025

run ./scripts/actions.sh unit-test

7ac3b3e

ignore check torch import in package

a65f561

move pinning to MultiStreamDataSampler

98f4e0b

clessig reviewed Dec 30, 2025

View reviewed changes

src/weathergen/datasets/stream_data.py Outdated Show resolved Hide resolved

Javad Kasravi added 2 commits December 30, 2025 13:46

add _pin_tensor & _pin_tensor_list helper func

bc80b26

ruff the code

8f98482

tjhunter reviewed Dec 30, 2025

View reviewed changes

move back pin mem. to train loop

ea8f16c

Javad Kasravi added 3 commits December 30, 2025 17:16

Remove the ignore-import-error rule and revert to the state before th…

61433eb

…e change

create protocol for pinnable obj

48c51e3

remove pin_mem from IOReaderData class

dc40a2f

Rever export/export_inference.py to state before c3fc9a7

62c4e02

clessig reviewed Jan 5, 2026

View reviewed changes

src/weathergen/datasets/memory_pinning.py Show resolved Hide resolved

src/weathergen/datasets/memory_pinning.py Outdated Show resolved Hide resolved

Javad Kasravi added 2 commits January 6, 2026 16:13

change name

6a22234

revise Pinnable class description

3796bc8

Merge branch 'ecmwf:develop' into javad/dev/manual-mem-pinning-1399

e29160a

add memory_pinning in config, train & va loop

7fe5b44

javak87 and others added 5 commits January 14, 2026 09:54

Merge branch 'develop' into javad/dev/manual-mem-pinning-1399

20944f3

use getattr to avoid CICD warning

08078e8

use setattr to avoid CICD warning

bd57cf4

disable pylint for self.source_tokens_lens

503d742

Merge branch 'develop' into javad/dev/manual-mem-pinning-1399

71461b6

clessig mentioned this pull request Jan 15, 2026

Memory pinning #1615

Merged

4 tasks

Javad Kasravi and others added 3 commits January 16, 2026 12:48

changes based on ecmwf#1615

a31d6ea

Merge branch 'javad/dev/manual-mem-pinning-1399' of https://github.co…

7a98a08

…m/javak87/WeatherGenerator into javad/dev/manual-mem-pinning-1399

Merge branch 'develop' into javad/dev/manual-mem-pinning-1399

039121b

clessig closed this Jan 19, 2026

github-project-automation bot moved this to Done in WeatherGen-dev Jan 19, 2026

Manual memory pinning #1533

Manual memory pinning #1533

Uh oh!

Conversation

javak87 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Dec 29, 2025

Uh oh!

javak87 commented Dec 29, 2025

Uh oh!

clessig commented Dec 29, 2025

Uh oh!

javak87 commented Dec 29, 2025

Uh oh!

javak87 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clessig commented Dec 30, 2025

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

javak87 commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

javak87 commented Dec 30, 2025

Uh oh!

tjhunter left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

clessig commented Dec 30, 2025

Uh oh!

javak87 commented Dec 30, 2025

Uh oh!

javak87 commented Dec 30, 2025

Uh oh!

Uh oh!

Uh oh!

sophie-xhonneux commented Jan 6, 2026

Uh oh!

clessig commented Jan 6, 2026

Uh oh!

sophie-xhonneux commented Jan 6, 2026

Uh oh!

javak87 commented Jan 6, 2026

Uh oh!

sophie-xhonneux commented Jan 12, 2026

Uh oh!

javak87 commented Jan 13, 2026

Uh oh!

sophie-xhonneux commented Jan 13, 2026

Uh oh!

javak87 commented Jan 13, 2026

Uh oh!

clessig commented Jan 14, 2026

Uh oh!

clessig commented Jan 15, 2026

Uh oh!

javak87 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clessig commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

javak87 commented Dec 29, 2025 •

edited

Loading

javak87 commented Dec 30, 2025 •

edited

Loading

javak87 commented Dec 30, 2025 •

edited

Loading

tjhunter left a comment •

edited

Loading

javak87 commented Jan 16, 2026 •

edited

Loading