Script Killed with DDP + wandb? #11727

dmandair · 2022-02-03T20:20:35Z

dmandair
Feb 3, 2022

Hi all, appreciate the help on this! I've been having some issues with DDP in pytorch lightning while also using wandb. As a preface, I tested my script on a development node with a small batch size using only one GPU and it ran just fine (just slow) - the issues come up when I start to use multiple GPUs (4) using DDP with pytorch lightening. Unfortunately, I don't have much of an error to give - the script runs but ends quickly with the log file only reporting 'Killed barlow.py'. I thought initially this might be a memory issue and have since decreased my batch size significantly - from the gpu memory usage report, however, I don't think it's a memory issue. Is there something that pops at as wrong with how I'm initializing/using wandb along with pytorch lightening in these runs? Any thoughts on what exactly is going on? Also of note, when I log validation losses, I use the flag 'sync_dist' = True. A portion of my code is reproduced below. Happy to show any other parts of my code if helpful as well.

    # Setting default run config for wandb
    default_config = {'LR': 1e-4,
                'WD': 1e-6, 'Z_DIM': 8192, 'LAMBDA': 5e-3
                }
    #added start method 'thread'
    wandb.init(config=default_config, settings=wandb.Settings(start_method='thread'), mode="offline", project="Barlow")
    # Config parameters are automatically set by W&B sweep agent
    config = wandb.config
    
    
    def main(config):
        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        encoder = resnet50(zero_init_residual=True)
    
        # replace classification fc layer of Resnet to obtain representations from the backbone
        encoder.fc = nn.Identity()
        encoder_out_dim = 2048
        BATCH_SIZE = 256
        MAX_EPOCHS = 1000
    
        run_tags = [
            f'ENCODER OUT DIM: {encoder_out_dim}',
            f'Z DIM: {config.Z_DIM}']
    
        # ------------------------
        # 2 WANDB LOGGER
        # ------------------------
        wandb_logger = WandbLogger(tags=run_tags)
    
        model = BarlowTwins(
            encoder=encoder,
            encoder_out_dim=encoder_out_dim,
            learning_rate = config.LR,
            weight_decay = config.WD,
            num_training_samples=262144,
            batch_size=BATCH_SIZE,
            z_dim=config.Z_DIM,
            lambda_coeff = config.LAMBDA,
            max_epochs=MAX_EPOCHS
        )
    
        checkpoint_callback = ModelCheckpoint(dirpath='/wynton/protected/home/ichs/dmandair/BRCAness/datasets/train/pcam/', monitor = 'val_loss', every_n_epochs=50, save_top_k=1)
    
        trainer = pl.Trainer(
            max_epochs=MAX_EPOCHS,
            precision=16 if torch.cuda.device_count() > 0 else 32,
            callbacks=[checkpoint_callback],
            logger = wandb_logger,
            gpus=4,
            strategy='ddp'
        )
    
        trainer.fit(model)
    
    
    if __name__ == '__main__':
        # print(f'Starting a run with {config}')
        main(config)
        wandb.finish()
        #added wandb finish so job doesn't lag

Just so you have it, the debug.log file from wandb is here:

              2022-02-03 00:03:10,481 INFO    MainThread:35223 [wandb_setup.py:_flush():71] setting env: {}
              2022-02-03 00:03:10,542 INFO    MainThread:35223 [wandb_init.py:_log_setup():371] Logging user logs to /wynton/protected/home/ichs/dmandair/BRCA/wandb/offline-run-20220203_000310-eikdsoq4/logs/debug.log
              2022-02-03 00:03:10,543 INFO    MainThread:35223 [wandb_init.py:_log_setup():372] Logging internal logs to /wynton/protected/home/ichs/dmandair/BRCA/wandb/offline-run-20220203_000310-eikdsoq4/logs/debug-internal.log
              2022-02-03 00:03:10,543 INFO    MainThread:35223 [wandb_init.py:init():404] calling init triggers
              2022-02-03 00:03:10,544 INFO    MainThread:35223 [wandb_init.py:init():409] wandb.init called with sweep_config: {}
              config: {'LR': 0.0001, 'WD': 1e-06, 'Z_DIM': 8192, 'LAMBDA': 0.005}
              2022-02-03 00:03:10,544 INFO    MainThread:35223 [wandb_init.py:init():460] starting backend
              2022-02-03 00:03:10,544 INFO    MainThread:35223 [backend.py:_multiprocessing_setup():99] multiprocessing start_methods=fork,spawn,forkserver, using: fork
              2022-02-03 00:03:10,550 INFO    MainThread:35223 [backend.py:ensure_launched():216] starting backend process...
              2022-02-03 00:03:10,559 INFO    MainThread:35223 [backend.py:ensure_launched():221] started backend process with pid: 35314
              2022-02-03 00:03:10,561 INFO    MainThread:35223 [wandb_init.py:init():469] backend started and connected
              2022-02-03 00:03:10,562 INFO    MainThread:35314 [internal.py:wandb_internal():87] W&B internal server running at pid: 35314, started at: 2022-02-03 00:03:10.561082
              2022-02-03 00:03:10,568 INFO    MainThread:35223 [wandb_init.py:init():533] updated telemetry
              2022-02-03 00:03:10,569 INFO    MainThread:35223 [wandb_init.py:init():606] starting run threads in backend
              2022-02-03 00:03:10,570 INFO    WriterThread:35314 [datastore.py:open_for_write():77] open: /wynton/protected/home/ichs/dmandair/BRCA/wandb/offline-run-20220203_000310-eikdsoq4/run-eikdsoq4.wandb
              2022-02-03 00:03:15,574 INFO    MainThread:35223 [wandb_run.py:_console_start():1810] atexit reg
              2022-02-03 00:03:15,600 INFO    MainThread:35223 [wandb_run.py:_redirect():1684] redirect: SettingsConsole.REDIRECT
              2022-02-03 00:03:15,600 INFO    MainThread:35223 [wandb_run.py:_redirect():1689] Redirecting console.
              2022-02-03 00:03:15,602 INFO    MainThread:35223 [wandb_run.py:_redirect():1745] Redirects installed.
              2022-02-03 00:03:15,603 INFO    MainThread:35223 [wandb_init.py:init():633] run started, returning control to user process

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Script Killed with DDP + wandb? #11727

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Script Killed with DDP + wandb? #11727

Uh oh!

Uh oh!

dmandair Feb 3, 2022

Replies: 0 comments

dmandair
Feb 3, 2022