Skip to content

Training fails with tensor size mismatch (5 vs 10 channels) after migrating project from 1.4.x to 1.5 #2496

@hlanovoi

Description

@hlanovoi

Bug description

Training fails with RuntimeError: The size of tensor a (5) must match the size of tensor b (10) on a project migrated from SLEAP 1.4.x to 1.5. The model is correctly configured for 5 channels (matching the 5-node skeleton), but the training data loader reports 10 channels, causing a tensor broadcast error.

Expected behaviour

Training should succeed with a 5-channel model matching the 5-node skeleton in the project.

Actual behaviour

Training fails immediately with:

RuntimeError: The size of tensor a (5) must match the size of tensor b (10) at non-singleton dimension 1

The error occurs in sleap_nn/training/lightning_modules.py, line 537 during loss computation. The model head correctly outputs 5 channels, but the target tensor from the data pipeline has 10 channels.

Your personal set up

  • OS: macOS (Apple Silicon - M-series chip)
  • Version(s): SLEAP 1.5.x, Python 3.13
  • SLEAP installation method: uv tool install "sleap[nn]"
# paste relevant logs here, if any

/Users/HL801/.local/share/uv/tools/sleap/lib/python3.13/site-packages/torch/nn/modules/loss.py:616: UserWarning: Using a target size (torch.Size([4, 10, 240, 320])) that is different to the input size (torch.Size([4, 5, 240, 320])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)

Traceback (most recent call last):
File "/Users/HL801/.local/share/uv/tools/sleap/lib/python3.13/site-packages/sleap_nn/training/lightning_modules.py", line 537, in training_step
loss = nn.MSELoss()(y_preds, y)
File "/Users/HL801/.local/share/uv/tools/sleap/lib/python3.13/site-packages/torch/nn/modules/loss.py", line 616, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/Users/HL801/.local/share/uv/tools/sleap/lib/python3.13/site-packages/torch/nn/functional.py", line 3868, in mse_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/Users/HL801/.local/share/uv/tools/sleap/lib/python3.13/site-packages/torch/functional.py", line 77, in broadcast_tensors
return _VF.broadcast_tensors(tensors)
RuntimeError: The size of tensor a (5) must match the size of tensor b (10) at non-singleton dimension 1

**Training output shows:**
- Model head: `Conv2d(32, 5, kernel_size=(1, 1))` 
- Target tensor: `torch.Size([4, 10, 240, 320])` 

## Screenshots
N/A

## How to reproduce
1. Open a SLEAP project that was created in v1.4.x (now opened in v1.5)
2. Project has a single skeleton with 5 nodes
3. Attempt to train a single instance model using the SLEAP GUI (Predict → Run Training → Single Instance)
4. Training fails immediately with tensor size mismatch error

**Additional diagnostic steps I've run through:**
- Verified only 1 skeleton with 5 nodes exists using `sleap_io.load_slp()` 
- Exported clean package with only hand-labeled frames (Predict → Export Labels Package → "Labeled frames")
- Created new project from clean package and attempted training
- Error persists even with fresh export

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions