Crash during training of the stft-tcnn model for speech enhancement

I wanted to use the training script for the speech enhancement use case through a ssh connection  but i couldn't start it. The script crashes the moment it start the first loss calculation.

Information on the server's OS : 

- Distributor ID:	Debian
- Description:	Debian GNU/Linux 13 (trixie)
- Release:	13
- Codename:	trixie

I am using uv to setup the python environnement, details of the Python version and the versions of the packages used are attached.

[python_environment.txt](https://github.com/user-attachments/files/24257423/python_environment.txt)

I am using the default training config file (speech_enhancement/src/config_file_examples/training_config.yaml) with the valentini dataset. I only changed the device to "cpu". 

I get the following warnings :
```
/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/munch/__init__.py:24: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
```
```
2025/12/19 12:45:45 WARNING mlflow.utils.autologging_utils: MLflow pytorch autologging is known to be compatible with 1.9.0 <= torch <= 2.5.1, but the installed version is 2.5.1+cu124. If you encounter errors during autologging, try upgrading / downgrading torch to a compatible version, or try upgrading MLflow.
```

```
2025-12-19 12:45:45.783206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2025-12-19 12:45:45.783226: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
```

Here is the exception preceded by a warning  :

```
/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/loss.py:608: UserWarning: Using a target size (torch.Size([16, 257, 706])) that is different to the input size (torch.Size([16, 257, 1052])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
  0%|                                                            | 0/1436 [00:00<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/stm32ai_main.py", line 251, in main
    _process_mode(cfg)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/stm32ai_main.py", line 38, in _process_mode
    onnx_model_path, _ = train(cfg)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/train.py", line 160, in train
    model, best_model = _train(model=model,
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/train.py", line 62, in _train
    model, best_model = trainer.train(n_epochs=n_epochs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/base.py", line 203, in train
    self._run_train_epoch(epoch)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/spec.py", line 147, in _run_train_epoch
    batch_loss = self._run_train_batch(batch)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/spec.py", line 189, in _run_train_batch
    loss_r = self.loss_function(pred_frames.real, clean_signal.real)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 608, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/functional.py", line 3791, in mse_loss
    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/functional.py", line 76, in broadcast_tensors
    return _VF.broadcast_tensors(tensors)  # type: ignore[attr-defined]
RuntimeError: The size of tensor a (1052) must match the size of tensor b (706) at non-singleton dimension 2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crash during training of the stft-tcnn model for speech enhancement #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crash during training of the stft-tcnn model for speech enhancement #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions