Skip to content

Commit e17c428

Browse files
authored
Finish docs for v2.3.0 release (#177)
* Complete configuration schemas for tension and voicing * Add docs for choosing variance parameters
1 parent cad4830 commit e17c428

File tree

4 files changed

+179
-1
lines changed

4 files changed

+179
-1
lines changed

configs/templates/config_acoustic.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,14 @@ vocoder_ckpt: checkpoints/nsf_hifigan_44.1k_hop512_128bin_2024.02/model.ckpt
2424

2525
use_spk_id: false
2626
num_spk: 1
27+
28+
# NOTICE: before enabling variance embeddings, please read the docs at
29+
# https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters
2730
use_energy_embed: false
2831
use_breathiness_embed: false
2932
use_voicing_embed: false
3033
use_tension_embed: false
34+
3135
use_key_shift_embed: true
3236
use_speed_embed: true
3337

configs/templates/config_variance.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ use_spk_id: false
2323
num_spk: 1
2424
predict_dur: true
2525
predict_pitch: true
26+
# NOTICE: before enabling variance predictions, please read the docs at
27+
# https://github.com/openvpi/DiffSinger/tree/main/docs/BestPractices.md#choosing-variance-parameters
2628
predict_energy: false
2729
predict_breathiness: false
2830
predict_voicing: false

docs/BestPractices.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,13 +183,57 @@ The DS files should also use the same dictionary as that of your target model. T
183183
| `f0_seq` | ✓ | ✓ | ✓ | WAV | DS/WAV |
184184
| `energy`, `breathiness`, ... | | | ✓ | WAV | DS/WAV |
185185

186-
This means you only need one column in trancriptions.csv, the `name` column, to declare all DS files included in the dataset. The name pattern can be:
186+
This means you only need one column in transcriptions.csv, the `name` column, to declare all DS files included in the dataset. The name pattern can be:
187187

188188
- Full name: `some-name` will firstly match the first segment in `some-name.ds`.
189189
- Name with index: `some-name#0` and `some-name#1` will match segment 0 and segment 1 in `some-name.ds` if there are no match with full name.
190190

191191
Though not recommended, the binarizer will still try to load attributes from transcriptions.csv or extract parameters from recordings if there are no matching DS files. In this case the full name matching logic is applied (the same as the normal binarization process).
192192

193+
## Choosing variance parameters
194+
195+
Variance parameters are a type of parameters that are significantly related to singing styles and emotions, have no default values and need to be predicted by the variance models. Choosing the proper variance parameters can obtain more controllability and expressiveness for your singing models. In this section, we are only talking about **narrowly defined variance parameters**, which are variance parameters except the pitch.
196+
197+
### Supported variance parameters
198+
199+
#### Energy
200+
201+
> WARNING
202+
>
203+
> This parameter is no longer recommended in favor of the new voicing parameter. The latter are less coupled with breathiness than energy.
204+
205+
Energy is defined as the RMS curve of the singing, in dB, which can control the strength of voice to a certain extent.
206+
207+
#### Breathiness
208+
209+
Breathiness is defined as the RMS curve of the aperiodic part of the singing, in dB, which can control the power of the air and unvoiced consonants in the voice.
210+
211+
#### Voicing
212+
213+
Voicing is defined as the RMS curve of the harmonic part of the singing, in dB, which can control the power of the harmonics in vowels and voiced consonants in the voice.
214+
215+
#### Tension
216+
217+
Tension is mostly related to the ratio of the base harmonic to the full harmonics, which can be used to control the strength and timbre of the voice. The ratio is calculated as
218+
$$
219+
r = \frac{\text{RMS}(H_{full}-H_{base})}{\text{RMS}(H_{full})}
220+
$$
221+
where $H_{full}$ is the full harmonics and $H_{base}$ is the base harmonic. The ratio is then mapped to the final domain via the inverse function of Sigmoid, that
222+
$$
223+
T = \log{\frac{r}{1-r}}
224+
$$
225+
where $T$ is the tension value.
226+
227+
### Principles of choosing multiple parameters
228+
229+
#### Energy, breathiness and voicing
230+
231+
These three parameters should **NOT** be enabled together. Energy is the RMS of the full waveform, which is the composition of the harmonic part and the aperiodic part. Therefore, these three parameters are coupled with each other.
232+
233+
#### Energy, voicing and tension
234+
235+
When voicing (or energy) is enabled, it almost fixes the loudness. However, tension sometimes rely on the implicitly predicted loudness for more expressiveness, because when a person sings with higher tension, he/she always produces louder voice. For this reason, some people may find their models or datasets _less natural_ with tension control. To be specific, changing tension will change the timbre but keep the loudness, and changing voicing (or energy) will change the loudness but keep the timbre. This behavior can be suitable for some, but not all datasets and users. Therefore, it is highly recommended for everyone to conduct some experiments on the actual datasets used to train the model.
236+
193237
## Mutual influence between variance modules
194238

195239
In some recent experiments and researches, some mutual influence between the modules of variance models has been found. In practice, being aware of the influence and making use of it can improve accuracy and avoid instability of the model.

docs/ConfigurationSchemas.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1410,6 +1410,30 @@ Whether to enable pitch prediction.
14101410
<tr><td align="center"><b>default</b></td><td>true</td>
14111411
</tbody></table>
14121412

1413+
### predict_tension
1414+
1415+
Whether to enable tension prediction.
1416+
1417+
<table><tbody>
1418+
<tr><td align="center"><b>visibility</b></td><td>variance</td>
1419+
<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, training, inference</td>
1420+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1421+
<tr><td align="center"><b>type</b></td><td>bool</td>
1422+
<tr><td align="center"><b>default</b></td><td>true</td>
1423+
</tbody></table>
1424+
1425+
### predict_voicing
1426+
1427+
Whether to enable voicing prediction.
1428+
1429+
<table><tbody>
1430+
<tr><td align="center"><b>visibility</b></td><td>variance</td>
1431+
<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, training, inference</td>
1432+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1433+
<tr><td align="center"><b>type</b></td><td>bool</td>
1434+
<tr><td align="center"><b>default</b></td><td>true</td>
1435+
</tbody></table>
1436+
14131437
### raw_data_dir
14141438

14151439
Path(s) to the raw dataset including wave files, transcriptions, etc.
@@ -1637,6 +1661,50 @@ Task trainer class name.
16371661
<tr><td align="center"><b>type</b></td><td>str</td>
16381662
</tbody></table>
16391663

1664+
### tension_logit_max
1665+
1666+
Maximum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
1667+
1668+
$$
1669+
f(x) = \ln\frac{x}{1-x}
1670+
$$
1671+
1672+
<table><tbody>
1673+
<tr><td align="center"><b>visibility</b></td><td>variance</td>
1674+
<tr><td align="center"><b>scope</b></td><td>inference</td>
1675+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1676+
<tr><td align="center"><b>type</b></td><td>float</td>
1677+
<tr><td align="center"><b>default</b></td><td>10.0</td>
1678+
</tbody></table>
1679+
1680+
### tension_logit_min
1681+
1682+
Minimum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
1683+
1684+
$$
1685+
f(x) = \ln\frac{x}{1-x}
1686+
$$
1687+
1688+
<table><tbody>
1689+
<tr><td align="center"><b>visibility</b></td><td>variance</td>
1690+
<tr><td align="center"><b>scope</b></td><td>inference</td>
1691+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1692+
<tr><td align="center"><b>type</b></td><td>float</td>
1693+
<tr><td align="center"><b>default</b></td><td>-10.0</td>
1694+
</tbody></table>
1695+
1696+
### tension_smooth_width
1697+
1698+
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted tension curve.
1699+
1700+
<table><tbody>
1701+
<tr><td align="center"><b>visibility</b></td><td>acoustic, variance</td>
1702+
<tr><td align="center"><b>scope</b></td><td>preprocessing</td>
1703+
<tr><td align="center"><b>customizability</b></td><td>normal</td>
1704+
<tr><td align="center"><b>type</b></td><td>float</td>
1705+
<tr><td align="center"><b>default</b></td><td>0.12</td>
1706+
</tbody></table>
1707+
16401708
### test_prefixes
16411709

16421710
List of data item names or name prefixes for the validation set. For each string `s` in the list:
@@ -1775,6 +1843,30 @@ Whether embed the speaker id from a multi-speaker dataset.
17751843
<tr><td align="center"><b>default</b></td><td>false</td>
17761844
</tbody></table>
17771845

1846+
### use_tension_embed
1847+
1848+
Whether to accept and embed tension values into the model.
1849+
1850+
<table><tbody>
1851+
<tr><td align="center"><b>visibility</b></td><td>acoustic</td>
1852+
<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, inference</td>
1853+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1854+
<tr><td align="center"><b>type</b></td><td>boolean</td>
1855+
<tr><td align="center"><b>default</b></td><td>false</td>
1856+
</tbody></table>
1857+
1858+
### use_voicing_embed
1859+
1860+
Whether to accept and embed voicing values into the model.
1861+
1862+
<table><tbody>
1863+
<tr><td align="center"><b>visibility</b></td><td>acoustic</td>
1864+
<tr><td align="center"><b>scope</b></td><td>nn, preprocessing, inference</td>
1865+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1866+
<tr><td align="center"><b>type</b></td><td>boolean</td>
1867+
<tr><td align="center"><b>default</b></td><td>false</td>
1868+
</tbody></table>
1869+
17781870
### val_check_interval
17791871

17801872
Interval (in number of training steps) between validation checks.
@@ -1870,6 +1962,42 @@ Path of the vocoder model.
18701962
<tr><td align="center"><b>default</b></td><td>checkpoints/nsf_hifigan/model</td>
18711963
</tbody></table>
18721964

1965+
### voicing_db_max
1966+
1967+
Maximum voicing value in dB used for normalization to [-1, 1].
1968+
1969+
<table><tbody>
1970+
<tr><td align="center"><b>visibility</b></td><td>variance</td>
1971+
<tr><td align="center"><b>scope</b></td><td>inference</td>
1972+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1973+
<tr><td align="center"><b>type</b></td><td>float</td>
1974+
<tr><td align="center"><b>default</b></td><td>-20.0</td>
1975+
</tbody></table>
1976+
1977+
### voicing_db_min
1978+
1979+
Minimum voicing value in dB used for normalization to [-1, 1].
1980+
1981+
<table><tbody>
1982+
<tr><td align="center"><b>visibility</b></td><td>acoustic, variance</td>
1983+
<tr><td align="center"><b>scope</b></td><td>inference</td>
1984+
<tr><td align="center"><b>customizability</b></td><td>recommended</td>
1985+
<tr><td align="center"><b>type</b></td><td>float</td>
1986+
<tr><td align="center"><b>default</b></td><td>-96.0</td>
1987+
</tbody></table>
1988+
1989+
### voicing_smooth_width
1990+
1991+
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted voicing curve.
1992+
1993+
<table><tbody>
1994+
<tr><td align="center"><b>visibility</b></td><td>acoustic, variance</td>
1995+
<tr><td align="center"><b>scope</b></td><td>preprocessing</td>
1996+
<tr><td align="center"><b>customizability</b></td><td>normal</td>
1997+
<tr><td align="center"><b>type</b></td><td>float</td>
1998+
<tr><td align="center"><b>default</b></td><td>0.12</td>
1999+
</tbody></table>
2000+
18732001
### win_size
18742002

18752003
Window size for mel or feature extraction.

0 commit comments

Comments
 (0)