Skip to content

Crash during training of the stft-tcnn model for speech enhancement #37

@Guguitino

Description

@Guguitino

I wanted to use the training script for the speech enhancement use case through a ssh connection but i couldn't start it. The script crashes the moment it start the first loss calculation.

Information on the server's OS :

  • Distributor ID: Debian
  • Description: Debian GNU/Linux 13 (trixie)
  • Release: 13
  • Codename: trixie

I am using uv to setup the python environnement, details of the Python version and the versions of the packages used are attached.

python_environment.txt

I am using the default training config file (speech_enhancement/src/config_file_examples/training_config.yaml) with the valentini dataset. I only changed the device to "cpu".

I get the following warnings :

/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/munch/__init__.py:24: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
2025/12/19 12:45:45 WARNING mlflow.utils.autologging_utils: MLflow pytorch autologging is known to be compatible with 1.9.0 <= torch <= 2.5.1, but the installed version is 2.5.1+cu124. If you encounter errors during autologging, try upgrading / downgrading torch to a compatible version, or try upgrading MLflow.
2025-12-19 12:45:45.783206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2025-12-19 12:45:45.783226: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Here is the exception preceded by a warning :

/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/loss.py:608: UserWarning: Using a target size (torch.Size([16, 257, 706])) that is different to the input size (torch.Size([16, 257, 1052])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
  0%|                                                            | 0/1436 [00:00<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/stm32ai_main.py", line 251, in main
    _process_mode(cfg)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/stm32ai_main.py", line 38, in _process_mode
    onnx_model_path, _ = train(cfg)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/train.py", line 160, in train
    model, best_model = _train(model=model,
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/train.py", line 62, in _train
    model, best_model = trainer.train(n_epochs=n_epochs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/base.py", line 203, in train
    self._run_train_epoch(epoch)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/spec.py", line 147, in _run_train_epoch
    batch_loss = self._run_train_batch(batch)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/spec.py", line 189, in _run_train_batch
    loss_r = self.loss_function(pred_frames.real, clean_signal.real)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 608, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/functional.py", line 3791, in mse_loss
    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
  File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/functional.py", line 76, in broadcast_tensors
    return _VF.broadcast_tensors(tensors)  # type: ignore[attr-defined]
RuntimeError: The size of tensor a (1052) must match the size of tensor b (706) at non-singleton dimension 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions