Profiler unable to record CUDA activities in Tensorboard.

I am trying to run pytorch profiler with tensorboard tutorial from [pytorch/tutorial](https://github.com/pytorch/tutorials/blob/main/intermediate_source/tensorboard_profiler_tutorial.py) in Windows 11 in a conda environment and following version

> python=3.12.4
>pytorch=2.4.0
>torch-tb-profiler=0.4.3
>cuda-version=12.5

The code executes with only a single warning message `[W904 11:50:36.000000000 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event`. However, the tensorboard shows only CPU as device and dataloader time as 0.
![Screenshot 2024-09-04 120822](https://github.com/user-attachments/assets/9ea3f8f2-c7b5-4fa4-a966-a0288623dccf)

I am not able to figure out if it is a bug or because of version mismatch. 
Simplified code to replicate error:
```
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T

######################################################################
# Then prepare the input data. For this tutorial, we use the CIFAR10 dataset.
# Transform it to the desired format and use ``DataLoader`` to load each batch.

transform = T.Compose(
    [T.Resize(224),
     T.ToTensor(),
     T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)

######################################################################
# Next, create Resnet model, loss function, and optimizer objects.
# To run on GPU, move model and loss to GPU device.

device = torch.device("cuda:0")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()


######################################################################
# Define the training step for each batch of input data.

def train(data):
    inputs, labels = data[0].to(device=device), data[1].to(device=device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


with torch.profiler.profile(
        activities=[
                torch.profiler.ProfilerActivity.CPU,
                torch.profiler.ProfilerActivity.CUDA,
            ],
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
) as prof:
    for step, batch_data in enumerate(train_loader):
        prof.step()  # Need to call this at each step to notify profiler of steps' boundary.
        if step >= 1 + 1 + 3:
            break
        train(batch_data)
```

cc @aaronenyeshi @chaekit @jcarreiro

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiler unable to record CUDA activities in Tensorboard. #3028

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Profiler unable to record CUDA activities in Tensorboard. #3028

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions