torch.Tensor.grad spike in compute time on Gaudi #1

daniel-de-leon-user293 · 2024-12-05T23:46:13Z

Summary

The purpose of this PR is to demonstrate the spikes in compute time when calling torch.Tensor.grad on Gaudi. The plot below show these spikes on a full completion of this benchmark:

This benchmark uses GradCAM to explain ResNet50 image classifications on the Intel Image Classification dataset. I've isolated this spike in compute time down to the input_img.grad execution in guided_backprop.py:104. The screen shot below shows the output for 3 individual examples collected in the middle of the benchmark. Gaudi metrics were collected for each execution of input_img.grad. The first and third examples show regular compute times with no spike. The second example shows a spike. Highlighted in blue is the graph compilation metrics where it took 1.341 seconds to compile a graph during the gradient calculation. Notice in the other two examples that graphs were not compiled and thus had an autograd time of only .0044 seconds.

What could be causing this sporadic re-compiling of a graph and is there a way to reduce or avoid it from happening?

Steps to reproduce

Clone this branch.

git clone -b daniel/ht_core https://github.com/daniel-de-leon-user293/pytorch-grad-cam.git

Download the Intel Image Classification dataset from Kaggle (https://www.kaggle.com/datasets/puneet6060/intel-image-classification) and save in the root of pytorch-grad-cam. We will only be using seg-pred for this benchmark.
Run an interactive Gaudi container.

cd pytorch-grad-cam

docker run -d --rm -it --name gradcam-bm \
-v ${PWD}:/workdir \
--runtime=habana \
-e http_proxy=$http_proxy \
-e https_proxy=$https_proxy \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--net=host \
--ipc=host \
vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest

Exec into container.

docker exec -it gradcam-bm bash

Navigate repo and run benchmark. cam_multiple_images.py is a modification of cam.py to include multiple GradCAM calls on a Gaudi machine.

cd workdir
pip install -e .
python cam_multiple_images.py --image-path intel_image_classification_dataset/seg_pred/seg_pred/ --device hpu

The first minute of examples take longer than usual because of warmup but you can still begin to see the spike examples similar to what is shown in the screen shot.

Signed-off-by: Daniel Deleon <[email protected]>

daniel-de-leon-user293 added 12 commits October 25, 2024 16:30

add Gaudi optimization

6703c97

Signed-off-by: Daniel Deleon <[email protected]>

initial benchmarks

29637e9

add print

ed0c472

update device config

bf83a4b

change list to dict

b71e88b

nb update

94561ff

add arg

dbc08d3

new results

45f5297

gaudi metrics ouput

1aa814a

isolate further

9122596

variable name

e1c80b7

Merge branch 'master' into daniel/ht_core

b694739

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch.Tensor.grad spike in compute time on Gaudi #1

torch.Tensor.grad spike in compute time on Gaudi #1

Uh oh!

daniel-de-leon-user293 commented Dec 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

torch.Tensor.grad spike in compute time on Gaudi #1

Are you sure you want to change the base?

torch.Tensor.grad spike in compute time on Gaudi #1

Uh oh!

Conversation

daniel-de-leon-user293 commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Steps to reproduce

Uh oh!

Uh oh!

daniel-de-leon-user293 commented Dec 5, 2024 •

edited

Loading