Skip to content

When I run train_ms.py on WSL2 get CUDA error and something. #224

@ramune64

Description

@ramune64

Hi, I tried train my JP multi speaker on WSL2.
But I got following logs and error message.
I'm a beginner so please tell me how to solve it.

==================================================================
......
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [82,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [83,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [84,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [85,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [86,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [87,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9403b6c446 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9403b166e4 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9403f1ba18 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1021c88 (0x7f93b987fc88 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x102a735 (0x7f93b9888735 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x5faf70 (0x7f940299af70 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x6f69f (0x7f9403b4d69f in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f9403b4637b in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f9403b46529 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: + 0x8c1a98 (0x7f9402c61a98 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f9402c61de6 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: /home/test/venv/bin/python() [0x504334]
frame #12: /home/test/venv/bin/python() [0x5102aa]
frame #13: /home/test/venv/bin/python() [0x600b4a]
frame #14: _PyEval_EvalFrameDefault + 0x5dd8 (0x51a858 in /home/test/venv/bin/python)
frame #15: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x32d (0x514dad in /home/test/venv/bin/python)
frame #17: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x302b (0x517aab in /home/test/venv/bin/python)
frame #19: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x302b (0x517aab in /home/test/venv/bin/python)
frame #21: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x734 (0x5151b4 in /home/test/venv/bin/python)
frame #23: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x734 (0x5151b4 in /home/test/venv/bin/python)
frame #25: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x32d (0x514dad in /home/test/venv/bin/python)
frame #27: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x1451 (0x515ed1 in /home/test/venv/bin/python)
frame #29: /home/test/venv/bin/python() [0x5c9dd5]
frame #30: PyEval_EvalCode + 0x80 (0x5c9d30 in /home/test/venv/bin/python)
frame #31: /home/test/venv/bin/python() [0x5fea7c]
frame #32: /home/test/venv/bin/python() [0x5fa616]
frame #33: PyRun_StringFlags + 0x82 (0x5f03a2 in /home/test/venv/bin/python)
frame #34: PyRun_SimpleStringFlags + 0x42 (0x5f01c2 in /home/test/venv/bin/python)
frame #35: Py_RunMain + 0x3c4 (0x5ef6e4 in /home/test/venv/bin/python)
frame #36: Py_BytesMain + 0x2d (0x5bd16d in /home/test/venv/bin/python)
frame #37: + 0x2a1ca (0x7f940473a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: __libc_start_main + 0x8b (0x7f940473a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #39: _start + 0x25 (0x5bd065 in /home/test/venv/bin/python)

Traceback (most recent call last):
File "/home/kense/vits/train_ms.py", line 297, in
main()
File "/home/kense/vits/train_ms.py", line 52, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/home/test/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/test/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/home/test/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

===================================================

{
"train": {
"log_interval": 200,
"eval_interval": 1000,
"seed": 1234,
"epochs": 10000,
"learning_rate": 2e-4,
"betas": [0.8, 0.99],
"eps": 1e-9,
"batch_size": 32,
"fp16_run": true,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0
},
"data": {
"training_files":"filelists/train.txt.cleaned",
"validation_files":"filelists/val.txt.cleaned",
"text_cleaners":["basic_cleaners"],
"max_wav_value": 32768.0,
"sampling_rate": 22050,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 12,
"cleaned_text": true
},
"model": {
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [8,8,2,2],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [16,16,4,4],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 256
}
}

===================================================

python==3.10.16
torch==2.5.1+cu124
If you need more information about my environment or logs, please tell me.

Number of speakers is 13(id:0~12)

I tried some methods proposed in similar situations.
But all of it didn't contribute.
Please help me and thanks for reading my poor English.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions