Understanding the "recipe" for GPU-accelerated transforms #5000

nvahmadi · 2022-08-25T22:10:50Z

nvahmadi
Aug 25, 2022
Collaborator

Hi,
just trying to understand the mechanism of GPU-accelerated transforms, because I don't see performance gains that I would expect. On a very basic level, I thought that the recipe looks like this:

Compose a transform chain, and place an EnsureTyped(..., device='cuda:0') transform right before the first random transforms.
Use a CacheDataset(..., cache_rate=1.0)
Use a ThreadDataLoader(...)

I am running this on a dataset with volumes of shape ~(200,160,100), my entire training dataset fits in the VRAM of a single GPU. Without caching to GPU and a regular DataLoader(workers=4), an epoch takes around 200s for me. With caching to GPU and a ThreadDataLoader(workers=0), an epoch takes roughly 500s. My expectation was that with GPU caching, there would be near-constant usage of the GPU and very little load on the CPU. Instead, I still see very long breaks between very short intervals of GPU usage, and I still see 900-1100% CPU usage in top for the python process. Seems like most of the augmentation is still running on CPU?

The concrete code sections can be seen below. More or less, I follow the GPU-caching example in fast_training_tutorial.ipynb, but it seems that I am still missing a piece of the puzzle?
Thanks in advance!

device = torch.device("cuda:0")

train_transforms_gpucached = Compose(
    [
        LoadImaged(keys=["img", "seg"], image_only=True),
        EnsureChannelFirstd(keys=["img", "seg"]),
        SpatialPadd(keys=["img", "seg"],spatial_size=[208, 160, 112]),
        ScaleIntensityd(keys="img"),
        EnsureTyped(keys=["img", "seg"], 
                    device=device, 
                    track_meta=False),
        RandAdjustContrastd(keys="img", prob=0.9, gamma=(0.3, 1.5)),
        RandGaussianNoised(keys="img", prob=0.5),
        RandFlipd(keys=["img","seg"],prob=0.5, spatial_axis=[0]),
        Rand3DElasticd(
            keys=["img", "seg"],
            mode=("bilinear", "nearest"),
            prob=0.75,
            sigma_range=(5, 8),
            magnitude_range=(5, 100),
            translate_range=np.array([30, 20, 10]),
            rotate_range=(0.3,0.3,0.3),
            scale_range=(0.15, 0.15, 0.15),
            padding_mode="border",
        ),        
    ]
)

# create a cached dataset
train_ds = CacheDataset(
    data=data_train, 
    transform=train_transforms_gpucached,
    cache_rate=1.0,
    num_workers=4)
train_loader = ThreadDataLoader(
    train_ds,
    num_workers=0,
    batch_size=4,
    shuffle=True
)

# create UNet/VNet, DiceLoss, DiceMetric and Adam optimizer
model = monai.networks.nets.UNet(
    spatial_dims=3,
    in_channels=1,
    out_channels=2,
    channels=(16, 32, 64, 128, 256),
    strides=(2, 2, 2, 2),
    dropout=0.5,
    num_res_units=2).to(device)
loss_function = DiceCELoss(include_background=False)
optimizer = torch.optim.Adam(model.parameters(), 1e-4)
dice_metric = DiceMetric(include_background=False, reduction="mean")

# training loop
for epoch in range(n_epochs):
    # ...
    for batch_data in train_loader:
        inputs, labels = batch_data["img"].to(device), batch_data["seg"].to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimizer.step()
    # ...

Answered by nvahmadi

Aug 26, 2022

Hi again,
so, I was able to solve this myself. I will document it here though in case others come across this problem.
The "recipe" above is correct, but certain transforms need extra care. Simple transforms like ScaleIntensityd are no problem, as they operate on the tensor in-place, but more complex transforms like Rand3DElasticd require a bit more attention.

I looked into the source code of the Rand3DElasticd transform (which is the main difference to the fast_training_tutorial.ipynb), and saw that the resampling grid by default does get initialized on CPU, even if the tensors are cached to GPU beforehand via EnsureTyped..., device=device).
Reading the API more carefully, I could have g…

View full answer

nvahmadi · 2022-08-26T21:02:31Z

nvahmadi
Aug 26, 2022
Collaborator Author

Hi again,
so, I was able to solve this myself. I will document it here though in case others come across this problem.
The "recipe" above is correct, but certain transforms need extra care. Simple transforms like ScaleIntensityd are no problem, as they operate on the tensor in-place, but more complex transforms like Rand3DElasticd require a bit more attention.

I looked into the source code of the Rand3DElasticd transform (which is the main difference to the fast_training_tutorial.ipynb), and saw that the resampling grid by default does get initialized on CPU, even if the tensors are cached to GPU beforehand via EnsureTyped..., device=device).
Reading the API more carefully, I could have guessed that ahead of time I think: there are two parameters, device ("device on which the tensor will be allocated") and spatial_size ("specifying output image spatial size").
Making sure that Rand3DElasticd runs on GPU as well solved the issue. I measured some epoch times, they roughly came out as follows:

Without Rand3DElasticd on device: 512.1s
With Rand3DElasticd on device, without specifying spatial_size: 88.9s
With Rand3DElasticd on device, with specifying : 88.6s

Overall a speedup of factor ~5.8x, and no long GPU dead-time anymore, which is what I was looking for.
There is probably still room for improvement, the GPU usage is still fluctuating quite wildly, but the speedup is already substantial, my training runs in 3h instead of 18h now.

One last note: I thought that specifying the spatial_size upfront would save more time because the resampling grid would not have to be newly instantiated in every round, but it barely had any effect over the whole epoch.

Bottomline: GPU transforms are easy to achieve and really fun! :)

The final recipe:

Compose a transform chain, and place an EnsureTyped(..., device='cuda:0') transform right before the first random transforms.
Make sure that all transforms following the EnsureTyped transform are GPU-enabled (some require constructing with device parameter, e.g. Rand3DElasticd).
Use a CacheDataset(..., cache_rate=1.0)
Use a ThreadDataLoader(...)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding the "recipe" for GPU-accelerated transforms #5000

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Understanding the "recipe" for GPU-accelerated transforms #5000

Uh oh!

Uh oh!

nvahmadi Aug 25, 2022 Collaborator

Replies: 1 comment

Uh oh!

Uh oh!

nvahmadi Aug 26, 2022 Collaborator Author

nvahmadi
Aug 25, 2022
Collaborator

nvahmadi
Aug 26, 2022
Collaborator Author