Finish docs for v2.3.0 release (#177)

yqzhishen · web-flow · commit e17c428d42fd · 2024-03-11T00:52:22.000+08:00
* Complete configuration schemas for tension and voicing

* Add docs for choosing variance parameters
diff --git a/configs/templates/config_acoustic.yaml b/configs/templates/config_acoustic.yaml
@@ -24,10 +24,14 @@ vocoder_ckpt: checkpoints/nsf_hifigan_44.1k_hop512_128bin_2024.02/model.ckpt
 
 use_spk_id: false
 num_spk: 1
+
+# NOTICE: before enabling variance embeddings, please read the docs at
+# https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters
 use_energy_embed: false
 use_breathiness_embed: false
 use_voicing_embed: false
 use_tension_embed: false
+
 use_key_shift_embed: true
 use_speed_embed: true
 
diff --git a/configs/templates/config_variance.yaml b/configs/templates/config_variance.yaml
@@ -23,6 +23,8 @@ use_spk_id: false
 num_spk: 1
 predict_dur: true
 predict_pitch: true
+# NOTICE: before enabling variance predictions, please read the docs at
+# https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters
 predict_energy: false
 predict_breathiness: false
 predict_voicing: false
diff --git a/docs/BestPractices.md b/docs/BestPractices.md
@@ -183,13 +183,57 @@ The DS files should also use the same dictionary as that of your target model. T
 |           `f0_seq`           |                ✓                |              ✓               |                     ✓                      |       WAV       |     DS/WAV     |
 | `energy`, `breathiness`, ... |                                 |                              |                     ✓                      |       WAV       |     DS/WAV     |
 
-This means you only need one column in trancriptions.csv, the `name` column, to declare all DS files included in the dataset. The name pattern can be:
+This means you only need one column in transcriptions.csv, the `name` column, to declare all DS files included in the dataset. The name pattern can be:
 
 - Full name: `some-name` will firstly match the first segment in `some-name.ds`.
 - Name with index: `some-name#0` and `some-name#1` will match segment 0 and segment 1 in `some-name.ds` if there are no match with full name.
 
 Though not recommended, the binarizer will still try to load attributes from transcriptions.csv or extract parameters from recordings if there are no matching DS files. In this case the full name matching logic is applied (the same as the normal binarization process).
 
+## Choosing variance parameters
+
+Variance parameters are a type of parameters that are significantly related to singing styles and emotions, have no default values and need to be predicted by the variance models. Choosing the proper variance parameters can obtain more controllability and expressiveness for your singing models. In this section, we are only talking about **narrowly defined variance parameters**, which are variance parameters except the pitch.
+
+### Supported variance parameters
+
+#### Energy
+
+> WARNING
+>
+> This parameter is no longer recommended in favor of the new voicing parameter. The latter are less coupled with breathiness than energy.
+
+Energy is defined as the RMS curve of the singing, in dB, which can control the strength of voice to a certain extent.
+
+#### Breathiness
+
+Breathiness is defined as the RMS curve of the aperiodic part of the singing, in dB, which can control the power of the air and unvoiced consonants in the voice.
+
+#### Voicing
+
+Voicing is defined as the RMS curve of the harmonic part of the singing, in dB, which can control the power of the harmonics in vowels and voiced consonants in the voice.
+
+#### Tension
+
+Tension is mostly related to the ratio of the base harmonic to the full harmonics, which can be used to control the strength and timbre of the voice. The ratio is calculated as
+$$
+r = \frac{\text{RMS}(H_{full}-H_{base})}{\text{RMS}(H_{full})}
+$$
+where $H_{full}$ is the full harmonics and $H_{base}$ is the base harmonic. The ratio is then mapped to the final domain via the inverse function of Sigmoid, that
+$$
+T = \log{\frac{r}{1-r}}
+$$
+where $T$ is the tension value.
+
+### Principles of choosing multiple parameters
+
+#### Energy, breathiness and voicing
+
+These three parameters should **NOT** be enabled together. Energy is the RMS of the full waveform, which is the composition of the harmonic part and the aperiodic part. Therefore, these three parameters are coupled with each other.
+
+#### Energy, voicing and tension
+
+When voicing (or energy) is enabled, it almost fixes the loudness. However, tension sometimes rely on the implicitly predicted loudness for more expressiveness, because when a person sings with higher tension, he/she always produces louder voice. For this reason, some people may find their models or datasets _less natural_ with tension control. To be specific, changing tension will change the timbre but keep the loudness, and changing voicing (or energy) will change the loudness but keep the timbre. This behavior can be suitable for some, but not all datasets and users. Therefore, it is highly recommended for everyone to conduct some experiments on the actual datasets used to train the model.
+
 ## Mutual influence between variance modules
 
 In some recent experiments and researches, some mutual influence between the modules of variance models has been found. In practice, being aware of the influence and making use of it can improve accuracy and avoid instability of the model.
diff --git a/docs/ConfigurationSchemas.md b/docs/ConfigurationSchemas.md
@@ -1410,6 +1410,30 @@ Whether to enable pitch prediction.
 <tr><td align="center"><b>default</b></td><td>true</td>
 </tbody></table>
 
+### predict_tension
+
+Whether to enable tension prediction.
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>variance</td>
+<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, training, inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>bool</td>
+<tr><td align="center"><b>default</b></td><td>true</td>
+</tbody></table>
+
+### predict_voicing
+
+Whether to enable voicing prediction.
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>variance</td>
+<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, training, inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>bool</td>
+<tr><td align="center"><b>default</b></td><td>true</td>
+</tbody></table>
+
 ### raw_data_dir
 
 Path(s) to the raw dataset including wave files, transcriptions, etc.
@@ -1637,6 +1661,50 @@ Task trainer class name.
 <tr><td align="center"><b>type</b></td><td>str</td>
 </tbody></table>
 
+### tension_logit_max
+
+Maximum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
+
+$$
+f(x) = \ln\frac{x}{1-x}
+$$
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>variance</td>
+<tr><td align="center"><b>scope</b></td><td>inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>float</td>
+<tr><td align="center"><b>default</b></td><td>10.0</td>
+</tbody></table>
+
+### tension_logit_min
+
+Minimum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
+
+$$
+f(x) = \ln\frac{x}{1-x}
+$$
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>variance</td>
+<tr><td align="center"><b>scope</b></td><td>inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>float</td>
+<tr><td align="center"><b>default</b></td><td>-10.0</td>
+</tbody></table>
+
+### tension_smooth_width
+
+Length of sinusoidal smoothing convolution kernel (in seconds) on extracted tension curve.
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>acoustic, variance</td>
+<tr><td align="center"><b>scope</b></td><td>preprocessing</td>
+<tr><td align="center"><b>customizability</b></td><td>normal</td>
+<tr><td align="center"><b>type</b></td><td>float</td>
+<tr><td align="center"><b>default</b></td><td>0.12</td>
+</tbody></table>
+
 ### test_prefixes
 
 List of data item names or name prefixes for the validation set. For each string `s` in the list:
@@ -1775,6 +1843,30 @@ Whether embed the speaker id from a multi-speaker dataset.
 <tr><td align="center"><b>default</b></td><td>false</td>
 </tbody></table>
 
+### use_tension_embed
+
+Whether to accept and embed tension values into the model.
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>acoustic</td>
+<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>boolean</td>
+<tr><td align="center"><b>default</b></td><td>false</td>
+</tbody></table>
+
+### use_voicing_embed
+
+Whether to accept and embed voicing values into the model.
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>acoustic</td>
+<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>boolean</td>
+<tr><td align="center"><b>default</b></td><td>false</td>
+</tbody></table>
+
 ### val_check_interval
 
 Interval (in number of training steps) between validation checks.
@@ -1870,6 +1962,42 @@ Path of the vocoder model.
 <tr><td align="center"><b>default</b></td><td>checkpoints/nsf_hifigan/model</td>
 </tbody></table>
 
+### voicing_db_max
+
+Maximum voicing value in dB used for normalization to [-1, 1].
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>variance</td>
+<tr><td align="center"><b>scope</b></td><td>inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>float</td>
+<tr><td align="center"><b>default</b></td><td>-20.0</td>
+</tbody></table>
+
+### voicing_db_min
+
+Minimum voicing value in dB used for normalization to [-1, 1].
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>acoustic, variance</td>
+<tr><td align="center"><b>scope</b></td><td>inference</td>
+<tr><td align="center"><b>customizability</b></td><td>recommended</td>
+<tr><td align="center"><b>type</b></td><td>float</td>
+<tr><td align="center"><b>default</b></td><td>-96.0</td>
+</tbody></table>
+
+### voicing_smooth_width
+
+Length of sinusoidal smoothing convolution kernel (in seconds) on extracted voicing curve.
+
+<table><tbody>
+<tr><td align="center"><b>visibility</b></td><td>acoustic, variance</td>
+<tr><td align="center"><b>scope</b></td><td>preprocessing</td>
+<tr><td align="center"><b>customizability</b></td><td>normal</td>
+<tr><td align="center"><b>type</b></td><td>float</td>
+<tr><td align="center"><b>default</b></td><td>0.12</td>
+</tbody></table>
+
 ### win_size
 
 Window size for mel or feature extraction.