Enable to set persistent_workers through CLI #126
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses a memory accumulation issue observed during training with the
ResGatedDynamicGNImodel whenpersistent_workers=Trueis enabled in theDataLoader.⚙️ Existing Problem
The
ResGatedDynamicGNImodel performs per-forward random feature initialization for both node and edge features (new_x,new_edge_attr) on the GPU.When combined with persistent DataLoader workers, these per-batch random allocations are not released properly because:
While setting
persistent_workers=Truecan improve performance when input features remain constant throughout training (as noted in the Lightning documentation), it becomes problematic when input features are dynamically initialized in each forward pass. In such cases, the DataLoader workers retain these transient tensors in memory, expecting reuse across epochs. Since they are never reused, this leads to progressive GPU memory accumulation and can eventually cause out-of-memory (OOM) errors. See the related issue and logs here.
Refer: https://lightning.ai/docs/pytorch/stable/advanced/speed.html#persistent-workers
🧠 Root Cause
persistent_workers=Truekeeps worker subprocesses alive between epochs, retaining CUDA contexts and cached memory allocations that theResGatedDynamicGNImodel reinitializes each forward pass.🔧 Fix Implemented
Enabled to set
persistent_workers=Falsein allDataLoaderthrough CLI for theResGatedDynamicGNImodel training.This ensures that:
Default is set to True as before, ensuring no disruption for other existing pipelines