Skip to content

Commit 1e22d46

Browse files
feat(audio): use PyAV instead of ffmpeg (#31)
* feat(audio): use PyAV instead of ffmpeg replaced usage of ffmpeg in favor of PyAV (`av`) * refactor(audio): store all of the audio related functions in the `infer.lib.audio` refactors previous commit to have singular functions for each task, all located in `infer.lib.audio` * fix(audio): remove downsample_audio from mdxnet.py it is no longer needed, since it's imported from infer.lib.audio * docs: remove every ffmpeg mention in the documentation to avoid confusion * chore(requirements): remove ffmpeg-python and ffmpy from all requirements * fix(audio): fix loading for UVR wrapped gathering of META info from the stream into a function fixes loading for UVR * fix(audio): use np.frombuffer() instead of direct conversion of the resampled frames this fixes traceback on preprocessing * feat(audio): pre-allocate decoded_audio array in the load_audio function this should improve performance, even if just a little * Revert "docs: remove every ffmpeg mention in the documentation to avoid confusion" This reverts commit 1e05bbc. * chore(format): run black on dev * fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile * Revert "fix(requirements): revert removal of ffmpeg in unitest.yml and Dockerfile" This reverts commit e28a0ee. * feat(audio): pre-allocate numpy array to store the AudioFrame data in ndarray of dtype float32 * chore(format): run black on dev * fix(audio): fix the decoded_audio size estimation in estimated_total_samples we multiply by `sr` instead of `container.streams.audio[0].rate` since we want to estimate size of the OUTPUT file, not the input one. - Added dynamic resizing, in case something goes wrong and the size of decoded_audio is estimated incorrectly Fixed function `load_audio` when the input audio's samplerate does not match the desired samplerate (`sr`) * chore(format): run black on dev * refactor(audio): remove `clean_path()` function as it serves no purpose anymore * docs: remove everything related to ffmpeg this includes everything except for formats support specification in the training_tips docs, since it has nothing to do with what ffmpeg does/did but rather what audio formats are supported (all the ones that ffmpeg supports!) * docs: fix order of the steps in preparation in the READMEs --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent aec56ec commit 1e22d46

28 files changed

+233
-366
lines changed

.github/workflows/unitest.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ jobs:
1818
- name: Install dependencies
1919
run: |
2020
sudo apt update
21-
sudo apt -y install ffmpeg
2221
wget https://github.com/fumiama/RVC-Models-Downloader/releases/download/v0.2.3/rvcmd_linux_amd64.deb
2322
sudo apt -y install ./rvcmd_linux_amd64.deb
2423
python -m pip install --upgrade pip

.gitignore

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,3 @@ xcuserdata
1212
/logs
1313

1414
/assets/weights/*
15-
ffmpeg.*
16-
ffprobe.*

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ WORKDIR /app
88

99
# Install dependenceis to add PPAs
1010
RUN apt-get update && \
11-
apt-get install -y -qq ffmpeg aria2 && apt clean && \
11+
apt-get install -y -qq aria2 && apt clean && \
1212
apt-get install -y software-properties-common && \
1313
apt-get clean && \
1414
rm -rf /var/lib/apt/lists/*

README.md

Lines changed: 2 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -128,26 +128,7 @@ If you want to use the v2 version of the model, you need to download additional
128128
rvcmd assets/v2 # RVC-Models-Downloader command
129129
```
130130

131-
### 2. Install ffmpeg tool
132-
If `ffmpeg` and `ffprobe` have already been installed, you can skip this step.
133-
#### Ubuntu/Debian
134-
```bash
135-
sudo apt install ffmpeg
136-
```
137-
#### MacOS
138-
```bash
139-
brew install ffmpeg
140-
```
141-
#### Windows
142-
After downloading, place it in the root directory.
143-
```bash
144-
rvcmd tools/ffmpeg # RVC-Models-Downloader command
145-
```
146-
- [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
147-
148-
- [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
149-
150-
### 3. Download the required files for the rmvpe vocal pitch extraction algorithm
131+
### 2. Download the required files for the rmvpe vocal pitch extraction algorithm
151132

152133
If you want to use the latest RMVPE vocal pitch extraction algorithm, you need to download the pitch extraction model parameters and place them in `assets/rmvpe`.
153134

@@ -163,7 +144,7 @@ If you want to use the latest RMVPE vocal pitch extraction algorithm, you need t
163144
rvcmd assets/rmvpe # RVC-Models-Downloader command
164145
```
165146

166-
### 4. AMD ROCM (optional, Linux only)
147+
### 3. AMD ROCM (optional, Linux only)
167148

168149
If you want to run RVC on a Linux system based on AMD's ROCM technology, please first install the required drivers [here](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html).
169150

@@ -207,7 +188,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
207188
+ [VITS](https://github.com/jaywalnut310/vits)
208189
+ [HIFIGAN](https://github.com/jik876/hifi-gan)
209190
+ [Gradio](https://github.com/gradio-app/gradio)
210-
+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
211191
+ [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
212192
+ [audio-slicer](https://github.com/openvpi/audio-slicer)
213193
+ [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)

docs/cn/README.cn.md

Lines changed: 2 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -126,27 +126,7 @@ sh ./run.sh
126126
rvcmd assets/v2 # RVC-Models-Downloader command
127127
```
128128

129-
### 2. 安装 ffmpeg 工具
130-
若已安装`ffmpeg``ffprobe`则可跳过此步骤。
131-
132-
#### Ubuntu/Debian 用户
133-
```bash
134-
sudo apt install ffmpeg
135-
```
136-
#### MacOS 用户
137-
```bash
138-
brew install ffmpeg
139-
```
140-
#### Windows 用户
141-
下载后放置在根目录。
142-
```bash
143-
rvcmd tools/ffmpeg # RVC-Models-Downloader command
144-
```
145-
- 下载[ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
146-
147-
- 下载[ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
148-
149-
### 3. 下载 rmvpe 人声音高提取算法所需文件
129+
### 2. 下载 rmvpe 人声音高提取算法所需文件
150130

151131
如果你想使用最新的RMVPE人声音高提取算法,则你需要下载音高提取模型参数并放置于`assets/rmvpe`
152132

@@ -162,7 +142,7 @@ rvcmd tools/ffmpeg # RVC-Models-Downloader command
162142
rvcmd assets/rmvpe # RVC-Models-Downloader command
163143
```
164144

165-
### 4. AMD显卡Rocm(可选, 仅Linux)
145+
### 3. AMD显卡Rocm(可选, 仅Linux)
166146

167147
如果你想基于AMD的Rocm技术在Linux系统上运行RVC,请先在[这里](https://rocm.docs.amd.com/en/latest/deploy/linux/os-native/install.html)安装所需的驱动。
168148

@@ -207,7 +187,6 @@ rvcmd packs/general/latest # RVC-Models-Downloader command
207187
+ [VITS](https://github.com/jaywalnut310/vits)
208188
+ [HIFIGAN](https://github.com/jik876/hifi-gan)
209189
+ [Gradio](https://github.com/gradio-app/gradio)
210-
+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
211190
+ [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
212191
+ [audio-slicer](https://github.com/openvpi/audio-slicer)
213192
+ [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)

docs/cn/faq.md

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,35 @@
1-
## Q1:ffmpeg error/utf8 error.
2-
3-
大概率不是ffmpeg问题,而是音频路径问题;
4-
5-
ffmpeg读取路径带空格、()等特殊符号,可能出现ffmpeg error;训练集音频带中文路径,在写入filelist.txt的时候可能出现utf8 error;
6-
7-
8-
## Q2:一键训练结束没有索引
1+
## Q1:一键训练结束没有索引
92

103
显示"Training is done. The program is closed."则模型训练成功,后续紧邻的报错是假的;
114

125

136
一键训练结束完成没有added开头的索引文件,可能是因为训练集太大卡住了添加索引的步骤;已通过批处理add索引解决内存add索引对内存需求过大的问题。临时可尝试再次点击"训练索引"按钮。
147

158

16-
## Q3:训练结束推理没看到训练集的音色
9+
## Q2:训练结束推理没看到训练集的音色
1710
点刷新音色再看看,如果还没有看看训练有没有报错,控制台和webui的截图,logs/实验名下的log,都可以发给开发者看看。
1811

1912

20-
## Q4:如何分享模型
13+
## Q3:如何分享模型
2114
  rvc_root/logs/实验名 下面存储的pth不是用来分享模型用来推理的,而是为了存储实验状态供复现,以及继续训练用的。用来分享的模型应该是weights文件夹下大小为60+MB的pth文件;
2215

2316
  后续将把weights/exp_name.pth和logs/exp_name/added_xxx.index合并打包成weights/exp_name.zip省去填写index的步骤,那么zip文件用来分享,不要分享pth文件,除非是想换机器继续训练;
2417

2518
  如果你把logs文件夹下的几百MB的pth文件复制/分享到weights文件夹下强行用于推理,可能会出现f0,tgt_sr等各种key不存在的报错。你需要用ckpt选项卡最下面,手工或自动(本地logs下如果能找到相关信息则会自动)选择是否携带音高、目标音频采样率的选项后进行ckpt小模型提取(输入路径填G开头的那个),提取完在weights文件夹下会出现60+MB的pth文件,刷新音色后可以选择使用。
2619

2720

28-
## Q5:Connection Error.
21+
## Q4:Connection Error.
2922
也许你关闭了控制台(黑色窗口)。
3023

3124

32-
## Q6:WebUI弹出Expecting value: line 1 column 1 (char 0).
25+
## Q5:WebUI弹出Expecting value: line 1 column 1 (char 0).
3326
请关闭系统局域网代理/全局代理。
3427

3528

3629
这个不仅是客户端的代理,也包括服务端的代理(例如你使用autodl设置了http_proxy和https_proxy学术加速,使用时也需要unset关掉)
3730

3831

39-
## Q7:不用WebUI如何通过命令训练推理
32+
## Q6:不用WebUI如何通过命令训练推理
4033
训练脚本:
4134

4235
可先跑通WebUI,消息窗内会显示数据集处理和训练用命令行;
@@ -72,21 +65,21 @@ device=sys.argv[8]
7265
is_half=bool(sys.argv[9])
7366

7467

75-
## Q8:Cuda error/Cuda out of memory.
68+
## Q7:Cuda error/Cuda out of memory.
7669
小概率是cuda配置问题、设备不支持;大概率是显存不够(out of memory);
7770

7871

7972
训练的话缩小batch size(如果缩小到1还不够只能更换显卡训练),推理的话酌情缩小config.py结尾的x_pad,x_query,x_center,x_max。4G以下显存(例如1060(3G)和各种2G显卡)可以直接放弃,4G显存显卡还有救。
8073

8174

82-
## Q9:total_epoch调多少比较好
75+
## Q8:total_epoch调多少比较好
8376

8477
如果训练集音质差底噪大,20~30足够了,调太高,底模音质无法带高你的低音质训练集
8578

8679
如果训练集音质高底噪低时长多,可以调高,200是ok的(训练速度很快,既然你有条件准备高音质训练集,显卡想必条件也不错,肯定不在乎多一些训练时间)
8780

8881

89-
## Q10:需要多少训练集时长
82+
## Q9:需要多少训练集时长
9083
  推荐10min至50min
9184

9285
  保证音质高底噪低的情况下,如果有个人特色的音色统一,则多多益善
@@ -98,7 +91,7 @@ is_half=bool(sys.argv[9])
9891
  1min以下时长数据目前没见有人尝试(成功)过。不建议进行这种鬼畜行为。
9992

10093

101-
## Q11:index rate干嘛用的,怎么调(科普)
94+
## Q10:index rate干嘛用的,怎么调(科普)
10295
  如果底模和推理源的音质高于训练集的音质,他们可以带高推理结果的音质,但代价可能是音色往底模/推理源的音色靠,这种现象叫做"音色泄露";
10396

10497
  index rate用来削减/解决音色泄露问题。调到1,则理论上不存在推理源的音色泄露问题,但音质更倾向于训练集。如果训练集音质比推理源低,则index rate调高可能降低音质。调到0,则不具备利用检索混合来保护训练集音色的效果;

docs/en/faq_en.md

Lines changed: 17 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,25 @@
1-
## Q1:ffmpeg error/utf8 error.
2-
It is most likely not a FFmpeg issue, but rather an audio path issue;
3-
4-
FFmpeg may encounter an error when reading paths containing special characters like spaces and (), which may cause an FFmpeg error; and when the training set's audio contains Chinese paths, writing it into filelist.txt may cause a utf8 error.<br>
5-
6-
## Q2:Cannot find index file after "One-click Training".
1+
## Q1:Cannot find index file after "One-click Training".
72
If it displays "Training is done. The program is closed," then the model has been trained successfully, and the subsequent errors are fake;
83

94
The lack of an 'added' index file after One-click training may be due to the training set being too large, causing the addition of the index to get stuck; this has been resolved by using batch processing to add the index, which solves the problem of memory overload when adding the index. As a temporary solution, try clicking the "Train Index" button again.<br>
105

11-
## Q3:Cannot find the model in “Inferencing timbre” after training
6+
## Q2:Cannot find the model in “Inferencing timbre” after training
127
Click “Refresh timbre list” and check again; if still not visible, check if there are any errors during training and send screenshots of the console, web UI, and logs/experiment_name/*.log to the developers for further analysis.<br>
138

14-
## Q4:How to share a model/How to use others' models?
9+
## Q3:How to share a model/How to use others' models?
1510
The pth files stored in rvc_root/logs/experiment_name are not meant for sharing or inference, but for storing the experiment checkpoits for reproducibility and further training. The model to be shared should be the 60+MB pth file in the weights folder;
1611

1712
In the future, weights/exp_name.pth and logs/exp_name/added_xxx.index will be merged into a single weights/exp_name.zip file to eliminate the need for manual index input; so share the zip file, not the pth file, unless you want to continue training on a different machine;
1813

1914
Copying/sharing the several hundred MB pth files from the logs folder to the weights folder for forced inference may result in errors such as missing f0, tgt_sr, or other keys. You need to use the ckpt tab at the bottom to manually or automatically (if the information is found in the logs/exp_name), select whether to include pitch infomation and target audio sampling rate options and then extract the smaller model. After extraction, there will be a 60+ MB pth file in the weights folder, and you can refresh the voices to use it.<br>
2015

21-
## Q5:Connection Error.
16+
## Q4:Connection Error.
2217
You may have closed the console (black command line window).<br>
2318

24-
## Q6:WebUI popup 'Expecting value: line 1 column 1 (char 0)'.
19+
## Q5:WebUI popup 'Expecting value: line 1 column 1 (char 0)'.
2520
Please disable system LAN proxy/global proxy and then refresh.<br>
2621

27-
## Q7:How to train and infer without the WebUI?
22+
## Q6:How to train and infer without the WebUI?
2823
Training script:<br>
2924
You can run training in WebUI first, and the command-line versions of dataset preprocessing and training will be displayed in the message window.<br>
3025

@@ -47,17 +42,17 @@ index_rate=float(sys.argv[7])<br>
4742
device=sys.argv[8]<br>
4843
is_half=bool(sys.argv[9])<br>
4944

50-
## Q8:Cuda error/Cuda out of memory.
45+
## Q7:Cuda error/Cuda out of memory.
5146
There is a small chance that there is a problem with the CUDA configuration or the device is not supported; more likely, there is not enough memory (out of memory).<br>
5247

5348
For training, reduce the batch size (if reducing to 1 is still not enough, you may need to change the graphics card); for inference, adjust the x_pad, x_query, x_center, and x_max settings in the config.py file as needed. 4G or lower memory cards (e.g. 1060(3G) and various 2G cards) can be abandoned, while 4G memory cards still have a chance.<br>
5449

55-
## Q9:How many total_epoch are optimal?
50+
## Q8:How many total_epoch are optimal?
5651
If the training dataset's audio quality is poor and the noise floor is high, 20-30 epochs are sufficient. Setting it too high won't improve the audio quality of your low-quality training set.<br>
5752

5853
If the training set audio quality is high, the noise floor is low, and there is sufficient duration, you can increase it. 200 is acceptable (since training is fast, and if you're able to prepare a high-quality training set, your GPU likely can handle a longer training duration without issue).<br>
5954

60-
## Q10:How much training set duration is needed?
55+
## Q9:How much training set duration is needed?
6156

6257
A dataset of around 10min to 50min is recommended.<br>
6358

@@ -69,29 +64,29 @@ There are some people who have trained successfully with 1min to 2min data, but
6964
Data of less than 1min duration has not been successfully attempted so far. This is not recommended.<br>
7065

7166

72-
## Q11:What is the index rate for and how to adjust it?
67+
## Q10:What is the index rate for and how to adjust it?
7368
If the tone quality of the pre-trained model and inference source is higher than that of the training set, they can bring up the tone quality of the inference result, but at the cost of a possible tone bias towards the tone of the underlying model/inference source rather than the tone of the training set, which is generally referred to as "tone leakage".<br>
7469

7570
The index rate is used to reduce/resolve the timbre leakage problem. If the index rate is set to 1, theoretically there is no timbre leakage from the inference source and the timbre quality is more biased towards the training set. If the training set has a lower sound quality than the inference source, then a higher index rate may reduce the sound quality. Turning it down to 0 does not have the effect of using retrieval blending to protect the training set tones.<br>
7671

7772
If the training set has good audio quality and long duration, turn up the total_epoch, when the model itself is less likely to refer to the inferred source and the pretrained underlying model, and there is little "tone leakage", the index_rate is not important and you can even not create/share the index file.<br>
7873

79-
## Q12:How to choose the gpu when inferring?
74+
## Q11:How to choose the gpu when inferring?
8075
In the config.py file, select the card number after "device cuda:".<br>
8176

8277
The mapping between card number and graphics card can be seen in the graphics card information section of the training tab.<br>
8378

84-
## Q13:How to use the model saved in the middle of training?
79+
## Q12:How to use the model saved in the middle of training?
8580
Save via model extraction at the bottom of the ckpt processing tab.
8681

87-
## Q14:File/memory error(when training)?
82+
## Q13:File/memory error(when training)?
8883
Too many processes and your memory is not enough. You may fix it by:
8984

9085
1、decrease the input in field "Threads of CPU".
9186

9287
2、pre-cut trainset to shorter audio files.
9388

94-
## Q15: How to continue training using more data
89+
## Q14: How to continue training using more data
9590

9691
step1: put all wav data to path2.
9792

@@ -101,19 +96,19 @@ step3: copy the latest G and D file of exp_name1 (your previous experiment) into
10196

10297
step4: click "train the model", and it will continue training from the beginning of your previous exp model epoch.
10398

104-
## Q16: error about llvmlite.dll
99+
## Q15: error about llvmlite.dll
105100

106101
OSError: Could not load shared object file: llvmlite.dll
107102

108103
FileNotFoundError: Could not find module lib\site-packages\llvmlite\binding\llvmlite.dll (or one of its dependencies). Try using the full path with constructor syntax.
109104

110105
The issue will happen in windows, install https://aka.ms/vs/17/release/vc_redist.x64.exe and it will be fixed.
111106

112-
## Q17: RuntimeError: The expanded size of the tensor (17280) must match the existing size (0) at non-singleton dimension 1. Target sizes: [1, 17280]. Tensor sizes: [0]
107+
## Q16: RuntimeError: The expanded size of the tensor (17280) must match the existing size (0) at non-singleton dimension 1. Target sizes: [1, 17280]. Tensor sizes: [0]
113108

114109
Delete the wav files whose size is significantly smaller than others, and that won't happen again. Than click "train the model"and "train the index".
115110

116-
## Q18: RuntimeError: The size of tensor a (24) must match the size of tensor b (16) at non-singleton dimension 2
111+
## Q17: RuntimeError: The size of tensor a (24) must match the size of tensor b (16) at non-singleton dimension 2
117112

118113
Do not change the sampling rate and then continue training. If it is necessary to change, the exp name should be changed and the model will be trained from scratch. You can also copy the pitch and features (0/1/2/2b folders) extracted last time to accelerate the training process.
119114

docs/fr/README.fr.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -112,16 +112,6 @@ Voici une liste des modèles et autres fichiers requis par RVC :
112112

113113
./assets/pretrained_v2
114114

115-
# Si vous utilisez Windows, vous pourriez avoir besoin de ces fichiers pour ffmpeg et ffprobe, sautez cette étape si vous avez déjà installé ffmpeg et ffprobe. Les utilisateurs d'ubuntu/debian peuvent installer ces deux bibliothèques avec apt install ffmpeg. Les utilisateurs de Mac peuvent les installer avec brew install ffmpeg (prérequis : avoir installé brew).
116-
117-
# ./ffmpeg
118-
119-
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe
120-
121-
# ./ffprobe
122-
123-
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe
124-
125115
# Si vous souhaitez utiliser le dernier algorithme RMVPE de pitch vocal, téléchargez les paramètres du modèle de pitch et placez-les dans le répertoire racine de RVC.
126116

127117
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt
@@ -167,7 +157,6 @@ python web.py
167157
+ [VITS](https://github.com/jaywalnut310/vits)
168158
+ [HIFIGAN](https://github.com/jik876/hifi-gan)
169159
+ [Gradio](https://github.com/gradio-app/gradio)
170-
+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
171160
+ [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
172161
+ [audio-slicer](https://github.com/openvpi/audio-slicer)
173162
+ [Extraction de la hauteur vocale : RMVPE](https://github.com/Dream-High/RMVPE)

0 commit comments

Comments
 (0)