When I run train_ms.py on WSL2 get CUDA error and something.

Hi, I tried train my JP multi speaker on WSL2.
But I got following logs and error message.
I'm a beginner so please tell me how to solve it.

==================================================================
......
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1308: indexSelectLargeIndex: block: [44,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9403b6c446 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9403b166e4 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9403f1ba18 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1021c88 (0x7f93b987fc88 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x102a735 (0x7f93b9888735 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x5faf70 (0x7f940299af70 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6f69f (0x7f9403b4d69f in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f9403b4637b in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f9403b46529 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x8c1a98 (0x7f9402c61a98 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f9402c61de6 in /home/test/venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: /home/test/venv/bin/python() [0x504334]
frame #12: /home/test/venv/bin/python() [0x5102aa]
frame #13: /home/test/venv/bin/python() [0x600b4a]
frame #14: _PyEval_EvalFrameDefault + 0x5dd8 (0x51a858 in /home/test/venv/bin/python)
frame #15: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x32d (0x514dad in /home/test/venv/bin/python)
frame #17: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x302b (0x517aab in /home/test/venv/bin/python)
frame #19: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x302b (0x517aab in /home/test/venv/bin/python)
frame #21: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x734 (0x5151b4 in /home/test/venv/bin/python)
frame #23: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x734 (0x5151b4 in /home/test/venv/bin/python)
frame #25: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x32d (0x514dad in /home/test/venv/bin/python)
frame #27: _PyFunction_Vectorcall + 0x75 (0x525775 in /home/test/venv/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x1451 (0x515ed1 in /home/test/venv/bin/python)
frame #29: /home/test/venv/bin/python() [0x5c9dd5]
frame #30: PyEval_EvalCode + 0x80 (0x5c9d30 in /home/test/venv/bin/python)
frame #31: /home/test/venv/bin/python() [0x5fea7c]
frame #32: /home/test/venv/bin/python() [0x5fa616]
frame #33: PyRun_StringFlags + 0x82 (0x5f03a2 in /home/test/venv/bin/python)
frame #34: PyRun_SimpleStringFlags + 0x42 (0x5f01c2 in /home/test/venv/bin/python)
frame #35: Py_RunMain + 0x3c4 (0x5ef6e4 in /home/test/venv/bin/python)
frame #36: Py_BytesMain + 0x2d (0x5bd16d in /home/test/venv/bin/python)
frame #37: <unknown function> + 0x2a1ca (0x7f940473a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: __libc_start_main + 0x8b (0x7f940473a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #39: _start + 0x25 (0x5bd065 in /home/test/venv/bin/python)

Traceback (most recent call last):
  File "/home/kense/vits/train_ms.py", line 297, in <module>
    main()
  File "/home/kense/vits/train_ms.py", line 52, in main
    mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/home/test/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/test/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/home/test/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

===================================================
<config>
{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 1234,
    "epochs": 10000,
    "learning_rate": 2e-4,
    "betas": [0.8, 0.99],
    "eps": 1e-9,
    "batch_size": 32,
    "fp16_run": true,
    "lr_decay": 0.999875,
    "segment_size": 8192,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45,
    "c_kl": 1.0
  },
  "data": {
    "training_files":"filelists/train.txt.cleaned",
    "validation_files":"filelists/val.txt.cleaned",
    "text_cleaners":["basic_cleaners"],
    "max_wav_value": 32768.0,
    "sampling_rate": 22050,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "n_mel_channels": 80,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 12,
    "cleaned_text": true
  },
  "model": {
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [3,7,11],
    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
    "upsample_rates": [8,8,2,2],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [16,16,4,4],
    "n_layers_q": 3,
    "use_spectral_norm": false,
    "gin_channels": 256
  }
}

===================================================

python==3.10.16
torch==2.5.1+cu124
If you need more information about my environment or logs, please tell me.

Number of speakers is 13(id:0~12)

I tried some methods proposed in similar situations.
But all of it didn't contribute.
Please help me and thanks for reading my poor English.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When I run train_ms.py on WSL2 get CUDA error and something. #224

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

When I run train_ms.py on WSL2 get CUDA error and something. #224

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions