-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Notice: In order to resolve issues more efficiently, please raise issue following the template.
(注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)
🐛 Bug
Everytime when the trainning was finished,I can saw the same error in the log file,seemed that the clean work encountered problems:
average_checkpoints: ['./outputs2/model.pt.ep1', './outputs2/model.pt.ep2', './outputs2/model.pt.ep3', './outputs2/model.pt.ep4', './outputs2/model.pt.ep5', './outputs2/model.pt.ep6', './outputs2/model.pt.ep7', './outputs2/model.pt.ep8', './outputs2/model.pt.ep9', './outputs2/model.pt.ep10', './outputs2/model.pt.ep11', './outputs2/model.pt.ep12', './outputs2/model.pt.ep13', './outputs2/model.pt.ep14', './outputs2/model.pt.ep15', './outputs2/model.pt.ep16', './outputs2/model.pt.ep17', './outputs2/model.pt.ep18', './outputs2/model.pt.ep19', './outputs2/model.pt.ep20'] Checkpoint file ./outputs2/model.pt.ep1 not found. Checkpoint file ./outputs2/model.pt.ep2 not found. Checkpoint file ./outputs2/model.pt.ep3 not found. Checkpoint file ./outputs2/model.pt.ep4 not found. Checkpoint file ./outputs2/model.pt.ep5 not found. Checkpoint file ./outputs2/model.pt.ep6 not found. Checkpoint file ./outputs2/model.pt.ep7 not found. Checkpoint file ./outputs2/model.pt.ep8 not found. Checkpoint file ./outputs2/model.pt.ep9 not found. Checkpoint file ./outputs2/model.pt.ep10 not found. Checkpoint file ./outputs2/model.pt.ep11 not found. Checkpoint file ./outputs2/model.pt.ep12 not found. Checkpoint file ./outputs2/model.pt.ep13 not found. Checkpoint file ./outputs2/model.pt.ep14 not found. Checkpoint file ./outputs2/model.pt.ep15 not found. Checkpoint file ./outputs2/model.pt.ep16 not found. Checkpoint file ./outputs2/model.pt.ep17 not found. Checkpoint file ./outputs2/model.pt.ep18 not found. Checkpoint file ./outputs2/model.pt.ep19 not found. [W106 21:41:32.566797026 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) Exception ignored in atexit callback: <function FileWriter.__init__.<locals>.cleanup at 0x7f8065a77060> Traceback (most recent call last): File "/www/ai/ocr/asr/FunASR/myFunASR/lib/python3.12/site-packages/tensorboardX/writer.py", line 122, in cleanup self.event_writer.close() File "/www/ai/ocr/asr/FunASR/myFunASR/lib/python3.12/site-packages/tensorboardX/event_file_writer.py", line 154, in close self._worker.stop() File "/www/ai/ocr/asr/FunASR/myFunASR/lib/python3.12/site-packages/tensorboardX/event_file_writer.py", line 185, in stop self._queue.put(self._shutdown_signal) File "/root/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/queues.py", line 94, in put self._start_thread() File "/root/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/queues.py", line 177, in _start_thread self._thread.start() File "/root/.pyenv/versions/3.12.0/lib/python3.12/threading.py", line 971, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't create new thread at interpreter shutdown Exception ignored in atexit callback: <function FileWriter.__init__.<locals>.cleanup at 0x7f03cec9af20> Traceback (most recent call last): File "/www/ai/ocr/asr/FunASR/myFunASR/lib/python3.12/site-packages/tensorboardX/writer.py", line 122, in cleanup self.event_writer.close() File "/www/ai/ocr/asr/FunASR/myFunASR/lib/python3.12/site-packages/tensorboardX/event_file_writer.py", line 154, in close self._worker.stop() File "/www/ai/ocr/asr/FunASR/myFunASR/lib/python3.12/site-packages/tensorboardX/event_file_writer.py", line 185, in stop self._queue.put(self._shutdown_signal) File "/root/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/queues.py", line 94, in put self._start_thread() File "/root/.pyenv/versions/3.12.0/lib/python3.12/multiprocessing/queues.py", line 177, in _start_thread self._thread.start() File "/root/.pyenv/versions/3.12.0/lib/python3.12/threading.py", line 971, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't create new thread at interpreter shutdown
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
- Run cmd
funasr-train ++model=paraformer-zh ++train_data_set_list=train.jsonl ++valid_data_set_list=test.jsonl ++output_dir="./outputs" &> log.txt & - See error
Code sample
no code , just one cmd.
Expected behavior
Train finishes with no error or warning.
Environment
- OS (e.g., Linux): CentOS Linux release 8.5.2111
- FunASR Version (e.g., 1.0.0): 1.2.9
- ModelScope Version (e.g., 1.11.0): 1.33.0
- PyTorch Version (e.g., 2.0.0): 2.9.1
- How you installed funasr (
pip, source): pip - Python version: 3.11
- GPU (e.g., V100M32) : nvidia 4090
- CUDA/cuDNN version (e.g., cuda11.7): 12.8
- Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1): no
- Any other relevant information:none
Additional context
none