Dataset/dataloader sharing (CPU or GPU) #6804

RomanKoshkin · 2023-02-27T04:44:24Z

RomanKoshkin
Feb 27, 2023

I have a small GNN (~ 2K parameters) and a rather large (~1.5GB) dataset. I want to train multiple instances of my model in parallel with different HPs (and initializations). I could do it with an MLP using torch.multiprocessing, but it didn't work with a PyG model and a PyG dataset/dataloader object. Even if torch.multiprocessing did work, I suspect the dataset would be copied for each process (and I want each processes to use one single copy of the dataset). Any suggestions?

rusty1s · 2023-02-27T13:04:23Z

rusty1s
Feb 27, 2023
Maintainer

torch.multiprocessing would share the dataset, and as such would not create a copy of it. Can you clarify what's the issue with PyG and torch.multiprocessing? This works just fine in our examples, see, e.g., https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_batching.py.

4 replies

RomanKoshkin Feb 27, 2023
Author

I'll try the example, I just thought DDP is useful when you have one large model and you have more than one device available. Just to clarify, will that example allow me to train in parallel multiple copies of the same architecture on one device? I don't need gradient synchronization, i.e. the compies of my model must be completely isolated. For context: I want to get confidence intervals on the predictions of my model, which is trained in an unsupervised way. After some fumbling, this is what seems to be working (but likely flat wrong in terms of saving memory):

import torch, time, sys, os

sys.path.append('../')

import torch, threading
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
import torch.multiprocessing as mp
from torch_geometric.loader import DataLoader as pygDataLoader
from torch.optim import AdamW
from models.models import WeightedGCN
from models.Kmeans import Kmeans
from utils.losses import ProximityLoss, DispersionLoss, L2_loss, DimensionSimilarityLoss, SelfSimilarityLoss
from utils.utils import load_config
from termcolor import cprint


def trainer(rank, params, loader):
    global DATA

    pbar = range(15000)

    model = WeightedGCN(params).to(params.device)

    optimizer = AdamW(model.parameters())  # tell the optimizer which var we want optimized
    

    for j in pbar:

        optimizer.zero_grad()
        for batch in loader:
           
            # do forward pass 
            embeds, logits = model(batch.x, batch.edge_index, batch.edge_weight, batch.batch, deterministic=False)

            # compute losses...

            loss.backward()
        
        # optimize once per epoch            
        optimizer.step()
        
        print(f"Process {rank} | Epoch [{i}")


NUM_MODEL_COPIES = 20


params.device = 'cuda:0'

winsize = 200
step_sz = 4
tau = 0.04
prefix = 'CA1'
X, DATA = torch.load(f"../datasets/{prefix}_DATA_{f'tau_{tau}_ws_{winsize}_ss_{step_sz}'}.pth").values()

# move data to device
for i in range(len(DATA)):
    DATA[i] = DATA[i].to(device=params.device)

if __name__ == '__main__':

    loader = pygDataLoader(
        DATA,
        batch_size=600,
        num_workers=0,
        shuffle=True,
        pin_memory=False,
    )

    processes = []
    for i in range(NUM_MODEL_COPIES):
        process = threading.Thread(target=trainer, args=(i, params, loader))
        process.start()
        processes.append(process)

    # Wait for all processes to finish
    for process in processes:
        process.join()

rusty1s Feb 27, 2023
Maintainer

Yes, I think this is how you would do it. I think it might make sense though to define a DataLoader within each process.

RomanKoshkin Feb 28, 2023
Author

Yes, but there's a problem with this apprach. Pytorch seems to still create copies of the data in GPU memory (see screenshot, device: A6000). I create 16 copies of the model (each with roughly 4500 parameters), the whole dataset is about 1.5 GB (when stored on disk via torch.save)

rusty1s Feb 28, 2023
Maintainer

I see. Is there any reason you need to move the dataset to each device beforehand? I would advice to keep it on CPU (in which case it should get correctly shared), and then just move sampled mini-batches to the GPU.

RomanKoshkin · 2023-02-28T10:31:00Z

RomanKoshkin
Feb 28, 2023
Author

The reason to move to dataset to the device is to avoid the time-consuming CPU-to-GPU copying overhead. I have a big good A6000 with 48GB of RAM, so why not put the entire dataset on GPU? This is what I always do as long as both the model and the data can fit into GPU memory.

…

On Tue, Feb 28, 2023, 5:45 PM Matthias Fey ***@***.***> wrote: I see. Is there any reason you need to move the dataset to each device beforehand? I would advice to keep it on CPU (in which case it should get correctly shared), and then just move sampled mini-batches to the GPU. — Reply to this email directly, view it on GitHub <#6804 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGIRPJYMZ6CU7Q46CO5SEG3WZW3I3ANCNFSM6AAAAAAVI5LCWQ> . You are receiving this because you authored the thread.Message ID: ***@***.*** com>

1 reply

rusty1s Feb 28, 2023
Maintainer

Ok, but in this case you can no longer share the data between processes AFAIK.

RomanKoshkin · 2023-02-28T11:42:27Z

RomanKoshkin
Feb 28, 2023
Author

In the code I provided above, the sharing is between _threads_, not processes. Am I right that, threads do support memory sharing, but processes don’t? That was my idea for using threads instead of `torch.multiprocessing`.

…

On 28 Feb 2023, at 19:58, Matthias Fey ***@***.***> wrote: Ok, but in this case you can no longer share the data between processes AFAIK. — Reply to this email directly, view it on GitHub <#6804 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGIRPJ546MWC4MPXZNLSE5LWZXK6NANCNFSM6AAAAAAVI5LCWQ>. You are receiving this because you authored the thread.

1 reply

rusty1s Mar 1, 2023
Maintainer

Can you take a look at https://pytorch.org/docs/stable/notes/multiprocessing.html? It looks like CUDA is indeed supported when using torch.multiprocessing and using spawn.

RomanKoshkin · 2023-03-02T05:00:44Z

RomanKoshkin
Mar 2, 2023
Author

@rusty1s It seems that with torch.multiprocessing the underlying data storage for the data tensor put into the torch.multiprocessing.Queue is indeed shared across the processes (at least data_ptr() points to the same memory location, if I call it from inside each process). But looking at nvidia-smi, it seems that each process still creates copies. Maybe it just reserves some space to perform training? In the example below,

import torch, time, sys, os, copy

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
sys.path.append('../')

import torch
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, TensorDataset
from termcolor import cprint

queue = mp.Queue()


# Define your model
class MyModel(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(10, 10)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(10, 2)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x


# Define a function to train a single copy of the model
def train_model(rank, queue, DEVICE):
    # Set the random seed for reproducibility
    torch.manual_seed(rank)

    X, y = queue.get()
    cprint(f'Rank: {rank}, X data_ptr: {X.data_ptr()}', color='yellow')

    # Load your dataset
    dataset = TensorDataset(
        X,
        y,
    )

    # Set the device to the current process's device
    model = MyModel().to(DEVICE)
    cprint(f'Rank: {rank}, model data_ptr: {list(model.parameters())[0].data_ptr()}', color='blue')

    # Create a DataLoader for your dataset
    dataloader = DataLoader(dataset, batch_size=32, shuffle=False)

    # Define the loss function and optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    # Train the model
    for epoch in range(100):
        for i, (inputs, labels) in enumerate(dataloader):

            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()

            optimizer.step()

            if (i + 1) % 10 == 0:
                print(
                    f"Process {rank} Epoch [{epoch + 1}/{100}], Step [{i + 1}/{len(dataloader)}], Loss: {loss.item():.4f}"
                )
    cprint(f'{rank} finished!', color='yellow')


# Spawn a separate process for each copy of the model
# mp.set_start_method('spawn')  # must be not fork, but spawn

NUM_MODEL_COPIES = 10
DEVICE = 'cuda:0'

processes = []
for rank in range(NUM_MODEL_COPIES):
    process = mp.Process(target=train_model, args=(rank, queue, DEVICE))
    process.start()
    processes.append(process)

time.sleep(2)

X = torch.rand(size=(10000, 10)).to(DEVICE)
y = torch.randint(2, size=(10000,)).to(DEVICE)

for rank in range(NUM_MODEL_COPIES):
    queue.put((X, y))

# Wait for all processes to finish
for process in processes:
    process.join()

While the dataset tensors in each process point to the same location in GPU memory, the model (each of which is created inside the separate processes) point to the same (!) address. How is that possible, if they are created inside indepedent processes?

Here's the output of nvidia-smi:

1 reply

rusty1s Mar 2, 2023
Maintainer

Mh, I am not totally sure TBH. Can you confirm that they are truly shared? Does incrementing one parameter also increments the parameter in the other process?

RomanKoshkin · 2023-03-15T07:06:27Z

RomanKoshkin
Mar 15, 2023
Author

I figured it out (almost everything).

When you create a model inside a separate process with torch.multiprocessing, the parameters of the model would have the same pointers, because the pointers are apparently not global, but are relative to the process's memory space. In the example below (almost the same as above), I tampered with the weights from Process 0 and the weights were not changed in the other processes (I checked Process 8). As an additional check, I tried passing another shared tensor (shared_bias) to each process. In that case, if I tamper with this shared bias, the change will be reflected in all of the subprocesses (and that is intended). So everything checks out: you CAN share CUDA tensors (e.g. datasets) across processes, each of which running a different model. Moreover, it is possible to share some of the models' parameters across the processes. My confusion was with the pointers of the model weight tensors: now I realize that the same pointers in different processes don't mean the same underlying data.

One remaining concern is that a small model inside each process allocates 1 GB of GPU memory. I though that I would be able to train hundreds of small models on one GPU in parallel, but now that seems impossible (or is it?).

Below is the complete code snippet to reproduce:

import torch, time, sys, os, copy

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
sys.path.append('../')

import torch
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, TensorDataset
from termcolor import cprint

# Spawn a separate process for each copy of the model
# mp.set_start_method('spawn')  # must be not fork, but spawn

queue = mp.Queue()


# Define your model
class MyModel(torch.nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(10, 10)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(10, 2)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x


# Define a function to train a single copy of the model
def train_model(rank, queue, DEVICE):
    # Set the random seed for reproducibility
    torch.manual_seed(rank)

    X, y, bias = queue.get()
    cprint(f'Rank: {rank}, X data_ptr: {X.data_ptr()}', color='yellow')

    # Load your dataset
    dataset = TensorDataset(
        X,
        y,
    )

    # Set the device to the current process's device
    with torch.no_grad():
        model = MyModel().to(DEVICE)
        model.fc1.bias = torch.nn.Parameter(bias)

        if rank == 0:
            # changing weight in one model in a separate process doesn't affect the weights in the model in another process, because the weight tensors are not shared
            model.fc1.weight[0][0] = -33.0

            # but changing bias (which is a shared tensor) should affect biases in the other processes
            model.fc1.bias *= 4

            cprint(f'RANK: {rank} | {list(model.parameters())[0][0,0]}', color='magenta')

        if rank == 8:
            cprint(f'RANK: {rank} | {list(model.parameters())[0][0,0]}', color='red')
            cprint(f'RANK: {rank} | BIAS: {model.fc1.bias}', color='red')

    ptr = model.fc1.weight[0][0].storage().data_ptr()
    cprint(f'Rank: {rank}, model data_ptr: {ptr}', color='blue')

    # Create a DataLoader for your dataset
    dataloader = DataLoader(dataset, batch_size=32, shuffle=False)

    # Define the loss function and optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    # Train the model
    for epoch in range(100):
        for i, (inputs, labels) in enumerate(dataloader):

            if rank == 0:
                cprint(f'RANK: {rank} | {list(model.parameters())[0][0,0]}', color='magenta')
                cprint(f'RANK: {rank} | BIAS: {model.fc1.bias}', color='magenta')
            if rank == 8:
                cprint(f'RANK: {rank} | {list(model.parameters())[0][0,0]}', color='red')
                cprint(f'RANK: {rank} | BIAS: {model.fc1.bias}', color='red')

            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()

            # optimizer.step()

            if (i + 1) % 10 == 0:
                print(
                    f"Process {rank} Epoch [{epoch + 1}/{100}], Step [{i + 1}/{len(dataloader)}], Loss: {loss.item():.4f}"
                )
    cprint(f'{rank} finished!', color='yellow')


NUM_MODEL_COPIES = 10
DEVICE = 'cuda:0'

processes = []
for rank in range(NUM_MODEL_COPIES):
    process = mp.Process(target=train_model, args=(rank, queue, DEVICE))
    process.start()
    processes.append(process)

time.sleep(2)

X = torch.rand(size=(10000, 10)).to(DEVICE)
y = torch.randint(2, size=(10000,)).to(DEVICE)
shared_bias = torch.ones(size=(10,), device=DEVICE)
for rank in range(NUM_MODEL_COPIES):
    queue.put((X, y, shared_bias))

# Wait for all processes to finish
for process in processes:
    process.join()

0 replies

Dataset/dataloader sharing (CPU or GPU) #6804

Uh oh!

RomanKoshkin Feb 27, 2023

Replies: 5 comments · 7 replies

Uh oh!

rusty1s Feb 27, 2023 Maintainer

Uh oh!

Uh oh!

RomanKoshkin Feb 27, 2023 Author

Uh oh!

rusty1s Feb 27, 2023 Maintainer

Uh oh!

RomanKoshkin Feb 28, 2023 Author

Uh oh!

rusty1s Feb 28, 2023 Maintainer

Uh oh!

RomanKoshkin Feb 28, 2023 Author

Uh oh!

rusty1s Feb 28, 2023 Maintainer

Uh oh!

RomanKoshkin Feb 28, 2023 Author

Uh oh!

rusty1s Mar 1, 2023 Maintainer

Uh oh!

Uh oh!

RomanKoshkin Mar 2, 2023 Author

Uh oh!

rusty1s Mar 2, 2023 Maintainer

Uh oh!

RomanKoshkin Mar 15, 2023 Author

RomanKoshkin
Feb 27, 2023

Replies: 5 comments 7 replies

rusty1s
Feb 27, 2023
Maintainer

RomanKoshkin Feb 27, 2023
Author

rusty1s Feb 27, 2023
Maintainer

RomanKoshkin Feb 28, 2023
Author

rusty1s Feb 28, 2023
Maintainer

RomanKoshkin
Feb 28, 2023
Author

rusty1s Feb 28, 2023
Maintainer

RomanKoshkin
Feb 28, 2023
Author

rusty1s Mar 1, 2023
Maintainer

RomanKoshkin
Mar 2, 2023
Author

rusty1s Mar 2, 2023
Maintainer

RomanKoshkin
Mar 15, 2023
Author