how to load dataset only once on the same machine? #8112

yllgl · 2021-06-24T08:09:33Z

yllgl
Jun 24, 2021

My dataset is large, with total CPU memory usage of 20 GB. I train on 2 nodes with 8 GPU. And I use slurm to train it. But I found that each process will consume 20 GB memory, which is equivelence to 80 GB each node. That's not what I want. I want a node to consume only 20GB in total. Is there a way to do that?

class DataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
        self.batch_size = 1
        self.CT_dataset=np.load("./CT_dataset.npy")#shape:(7000,1,512,512)
        self.MR_dataset=np.load("./MR_dataset.npy")#shape:(7000,1,512,512)
        self.batch_size=1
        self.CT_dataset = torch.from_numpy(self.CT_dataset)
        self.CT_dataset = self.CT_dataset.float()
        self.MR_dataset = torch.from_numpy(self.MR_dataset)
        self.MR_dataset = self.MR_dataset.float()
        self.train_dataset, self.test_dataset = random_split(TensorDataset(self.MR_dataset,self.CT_dataset), [len(self.CT_dataset)-100, 100])
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)
    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size)
model = CycleGAN()
ds = DataModule()
logger = TensorBoardLogger(save_dir="./run")
trainer = pl.Trainer(max_epochs=1,fast_dev_run=False,profiler="pytorch",overfit_batches=8,gpus=4,logger=logger,accelerator='ddp',num_nodes=2,auto_scale_batch_size='power',weights_summary='full')
trainer.fit(model, ds)
trainer.test(model,datamodule=ds)

My code will raise MemoryError: Unable to allocate 7.00 GiB for an array with shape (1879572480,) and data type int32, I can't think of a way to solve it.

Answered by awaelchli

Jun 24, 2021

Since your data is in one single binary file, it won't be possible to reduce the memory footprint. Each ddp process is independent from the others, there is no shared memory. You will have to save each dataset sample individually, so each process can access a subset of these samples through the dataloader and sampler.

View full answer

awaelchli · 2021-06-24T09:55:11Z

awaelchli
Jun 24, 2021

Since your data is in one single binary file, it won't be possible to reduce the memory footprint. Each ddp process is independent from the others, there is no shared memory. You will have to save each dataset sample individually, so each process can access a subset of these samples through the dataloader and sampler.

5 replies

yllgl Jun 24, 2021
Author

Could you give me an example code? I don't know how to create my own dataset sample and distribute it to each gpu because pytorch_lightning will do it automatically. I should write some code like this in DataModule, is that right?

self.CT_dataset=np.load(f"./CT_dataset_{global_rank}.npy")#shape:(7000,1,512,512)

awaelchli Jun 24, 2021

instead of saving one file
./CT_dataset.npy" shape:(7000,1,512,512)

You should save 7000 files to disk:

./CT_dataset/00000.npy" shape:(1,512,512)
./CT_dataset/00001.npy" shape:(1,512,512)
./CT_dataset/00002.npy" shape:(1,512,512)
./CT_dataset/00003.npy" shape:(1,512,512)

...

you do this splitting once with a simple script.

Then you implement a

class CTDataset(torch.utils.Dataset):
    
    def __getitem__(self, item):
         return np.load(f"./CT_dataset/{item}.npy")

I cannot implement it for you, but this should be enough to get you started.
It's very simple. And then you give it to the dataloader and set the batch size and Lightning will do the distributed sampling automatially.

cheers

yllgl Jun 24, 2021
Author

Thank you very much! I have another question, where should I convert numpy array to tensor in my code?

awaelchli Jun 24, 2021

in the dataset code, for example the getitem method above. tensor = torch.tensor(numpy_array) :)

tchaton Jun 25, 2021
Maintainer

Hey @yllgl,

You can also save your data directly into tensors, and just use torch.load(data_path).

Best,
T.C

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to load dataset only once on the same machine? #8112

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

how to load dataset only once on the same machine? #8112

Uh oh!

yllgl Jun 24, 2021

Replies: 1 comment · 5 replies

Uh oh!

awaelchli Jun 24, 2021

Uh oh!

yllgl Jun 24, 2021 Author

Uh oh!

Uh oh!

awaelchli Jun 24, 2021

Uh oh!

yllgl Jun 24, 2021 Author

Uh oh!

awaelchli Jun 24, 2021

Uh oh!

tchaton Jun 25, 2021 Maintainer

yllgl
Jun 24, 2021

Replies: 1 comment 5 replies

awaelchli
Jun 24, 2021

yllgl Jun 24, 2021
Author

yllgl Jun 24, 2021
Author

tchaton Jun 25, 2021
Maintainer