-
Hi,
I am running this on a dataset with volumes of shape ~(200,160,100), my entire training dataset fits in the VRAM of a single GPU. Without caching to GPU and a regular DataLoader(workers=4), an epoch takes around 200s for me. With caching to GPU and a ThreadDataLoader(workers=0), an epoch takes roughly 500s. My expectation was that with GPU caching, there would be near-constant usage of the GPU and very little load on the CPU. Instead, I still see very long breaks between very short intervals of GPU usage, and I still see 900-1100% CPU usage in The concrete code sections can be seen below. More or less, I follow the GPU-caching example in device = torch.device("cuda:0")
train_transforms_gpucached = Compose(
[
LoadImaged(keys=["img", "seg"], image_only=True),
EnsureChannelFirstd(keys=["img", "seg"]),
SpatialPadd(keys=["img", "seg"],spatial_size=[208, 160, 112]),
ScaleIntensityd(keys="img"),
EnsureTyped(keys=["img", "seg"],
device=device,
track_meta=False),
RandAdjustContrastd(keys="img", prob=0.9, gamma=(0.3, 1.5)),
RandGaussianNoised(keys="img", prob=0.5),
RandFlipd(keys=["img","seg"],prob=0.5, spatial_axis=[0]),
Rand3DElasticd(
keys=["img", "seg"],
mode=("bilinear", "nearest"),
prob=0.75,
sigma_range=(5, 8),
magnitude_range=(5, 100),
translate_range=np.array([30, 20, 10]),
rotate_range=(0.3,0.3,0.3),
scale_range=(0.15, 0.15, 0.15),
padding_mode="border",
),
]
)
# create a cached dataset
train_ds = CacheDataset(
data=data_train,
transform=train_transforms_gpucached,
cache_rate=1.0,
num_workers=4)
train_loader = ThreadDataLoader(
train_ds,
num_workers=0,
batch_size=4,
shuffle=True
)
# create UNet/VNet, DiceLoss, DiceMetric and Adam optimizer
model = monai.networks.nets.UNet(
spatial_dims=3,
in_channels=1,
out_channels=2,
channels=(16, 32, 64, 128, 256),
strides=(2, 2, 2, 2),
dropout=0.5,
num_res_units=2).to(device)
loss_function = DiceCELoss(include_background=False)
optimizer = torch.optim.Adam(model.parameters(), 1e-4)
dice_metric = DiceMetric(include_background=False, reduction="mean")
# training loop
for epoch in range(n_epochs):
# ...
for batch_data in train_loader:
inputs, labels = batch_data["img"].to(device), batch_data["seg"].to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = loss_function(outputs, labels)
loss.backward()
optimizer.step()
# ... |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi again, I looked into the source code of the
Overall a speedup of factor ~5.8x, and no long GPU dead-time anymore, which is what I was looking for. One last note: I thought that specifying the Bottomline: GPU transforms are easy to achieve and really fun! :) The final recipe:
|
Beta Was this translation helpful? Give feedback.
Hi again,
so, I was able to solve this myself. I will document it here though in case others come across this problem.
The "recipe" above is correct, but certain transforms need extra care. Simple transforms like
ScaleIntensityd
are no problem, as they operate on the tensor in-place, but more complex transforms likeRand3DElasticd
require a bit more attention.I looked into the source code of the
Rand3DElasticd
transform (which is the main difference to thefast_training_tutorial.ipynb
), and saw that the resampling grid by default does get initialized on CPU, even if the tensors are cached to GPU beforehand viaEnsureTyped..., device=device)
.Reading the API more carefully, I could have g…