Training is slow on GPU #14917

mtomic123 · 2022-09-28T12:05:24Z

mtomic123
Sep 28, 2022

I built a Temporal Fusion Transformer model from Pytorch-Forecasting using the guide here:
https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/stallion.html

I used my own data which is a time-series with 62k samples. I set training to be on GPU by specifying accelerator="gpu" in pl.Trainer. The issue is that training is quite slow considering this dataset is not that large.

I first ran the training on my laptop GPU GTX 1650 Ti, then on a A100 40GB and I got only 2x uplift in performance. A100 is many many times faster than my laptop and performance uplift should be much bigger than 2x. I have NVIDIA drivers installed, cuDNN and other things installed (A100 is on google cloud which comes preinstalled with all of that). The GPU utilisation is low (10-15%), but I can see that the data has been loaded into GPU memory.

Things I tried:

Tried small batch sizes (32) and large ones (8192)
Double checked training is done on the GPU
Set num_workers to 8 in dataloader

Is there some other bottleneck in my model? Below are the results from the profiler and snippets of my model configuration.

Dataloaders

batch_size = batch_size
train_dataloader = training.to_dataloader(
    train=True,
    batch_size=batch_size,
    num_workers=8,
    batch_sampler="synchronized",
    pin_memory=True,
)
val_dataloader = validation.to_dataloader(
    train=False,
    batch_size=batch_size,
    num_workers=8,
    batch_sampler="synchronized",
    pin_memory=True,
)

Model Configuration

early_stop_callback = EarlyStopping(
        monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min"
    )
	
trainer = pl.Trainer(
	logger=wandb_logger,
	max_epochs=max_epochs,
	accelerator="gpu",
	devices=-1,
	gradient_clip_val=0.1,
	limit_train_batches=1.0,  # comment in for training, running validation every 30 batches
	callbacks=[lr_logger, early_stop_callback],
	enable_model_summary=True,
	profiler="simple",
	)
	
tft = TemporalFusionTransformer.from_dataset(
	training,
	learning_rate=0.003,
	hidden_size=16,
	attention_head_size=1,
	dropout=0.1,
	hidden_continuous_size=8,
	output_size=7,  # 7 quantiles by default
	loss=QuantileLoss(),
	reduce_on_plateau_patience=4,
)


trainer.fit(
	tft,
	train_dataloaders=train_dataloader,
	val_dataloaders=val_dataloader,
)

Profiler (Only the most intensive processes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training is slow on GPU #14917

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Training is slow on GPU #14917

Uh oh!

Uh oh!

mtomic123 Sep 28, 2022

Replies: 0 comments

mtomic123
Sep 28, 2022