-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Description
I wanted to use the training script for the speech enhancement use case through a ssh connection but i couldn't start it. The script crashes the moment it start the first loss calculation.
Information on the server's OS :
- Distributor ID: Debian
- Description: Debian GNU/Linux 13 (trixie)
- Release: 13
- Codename: trixie
I am using uv to setup the python environnement, details of the Python version and the versions of the packages used are attached.
I am using the default training config file (speech_enhancement/src/config_file_examples/training_config.yaml) with the valentini dataset. I only changed the device to "cpu".
I get the following warnings :
/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/munch/__init__.py:24: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
2025/12/19 12:45:45 WARNING mlflow.utils.autologging_utils: MLflow pytorch autologging is known to be compatible with 1.9.0 <= torch <= 2.5.1, but the installed version is 2.5.1+cu124. If you encounter errors during autologging, try upgrading / downgrading torch to a compatible version, or try upgrading MLflow.
2025-12-19 12:45:45.783206: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2025-12-19 12:45:45.783226: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Here is the exception preceded by a warning :
/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/loss.py:608: UserWarning: Using a target size (torch.Size([16, 257, 706])) that is different to the input size (torch.Size([16, 257, 1052])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)
0%| | 0/1436 [00:00<?, ?it/s]
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/stm32ai_main.py", line 251, in main
_process_mode(cfg)
File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/stm32ai_main.py", line 38, in _process_mode
onnx_model_path, _ = train(cfg)
File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/train.py", line 160, in train
model, best_model = _train(model=model,
File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/train.py", line 62, in _train
model, best_model = trainer.train(n_epochs=n_epochs)
File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/base.py", line 203, in train
self._run_train_epoch(epoch)
File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/spec.py", line 147, in _run_train_epoch
batch_loss = self._run_train_batch(batch)
File "/home/agugliel/ST/stm32ai-modelzoo-services/speech_enhancement/src/trainers/spec.py", line 189, in _run_train_batch
loss_r = self.loss_function(pred_frames.real, clean_signal.real)
File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 608, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/nn/functional.py", line 3791, in mse_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/agugliel/ST/stm32ai-modelzoo-services/st_zoo/lib/python3.10/site-packages/torch/functional.py", line 76, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore[attr-defined]
RuntimeError: The size of tensor a (1052) must match the size of tensor b (706) at non-singleton dimension 2
Metadata
Metadata
Assignees
Labels
No labels