How to train in a multi-node environment? #18423
Unanswered
rahaazad2
asked this question in
DDP / multi-GPU / multi-node
Replies: 2 comments
-
hi @rahaazad2 👋! are you changing the |
Beta Was this translation helpful? Give feedback.
0 replies
-
I had a similar problem which was caused from an interaction of argparse with PL multi gpu instantiating to solve the problem, i adjusted the parsing: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I want to train a pytorch-lightning code in a cluster of 6 nodes (each node 1 gpu). Here's the code for training:
`
import argparse
import json
import os
import pytorch_lightning as pl
import src.data_loaders as module_data
import torch
from pytorch_lightning.callbacks import ModelCheckpoint
from src.utils import get_model_and_tokenizer
from torch.nn import functional as F
from torch.utils.data import DataLoader
class MyClassifier(pl.LightningModule):
def init(self, config):
super().init()
self.save_hyperparameters()
self.num_classes = config["arch"]["args"]["num_classes"]
self.model_args = config["arch"]["args"]
self.model, self.tokenizer = get_model_and_tokenizer(**self.model_args)
self.bias_loss = False
def cli_main():
pl.seed_everything(1234)
if name == "main":
cli_main()
`
It works fine on a single node with 4 GPUs but in multi-node setting, it seems there is no difference with single-node settings. Specifically, from the logs of the nodes I see that the RANK is correctly set (i.e., RANK 0 for master and RANK 1 to 5 for workers). However, there are two issues:
I run the code using this command
python train.py --config PATH
Beta Was this translation helpful? Give feedback.
All reactions