Voiced/Unvoiced Consonant Issue w/ F0 Curve || CLI Renderer Issue

Hello! I hope this finds you well.

I've come across an issue recently and was wondering if they were related. I've recently trained a model with the variance parameters `Tension`, `Voicing`, `Energy`, and `Breathiness`. It's a bit unorthodox as tension/voicing were successors to energy/breathiness, however I find that it gives me more freedoms with models.

However, I've recently come across an issue where deep F0 curves can cause unvoiced consonants to become voiced. This causes issues as words/phrases become mispronounced. Using OpenUTAU's live curvature feature, I discovered that voicing would go upward rather than staying toward the bottom of the curve. While redrawing the curve fixes the solution, it makes using the model more tedious. I switched between using WORLD/VR for my hnsep and RMVPE/Parselmouth for my pe, however no combination solved this issue.

Without F0 modifications, the model retains [en/k] properly, however the [k] sound is enunciated as an unaspirated sound.
<img width="778" height="358" alt="Image" src="https://github.com/user-attachments/assets/5c1eb98f-77bd-4b6d-8a78-7c3342c4c591" />

Here is an example of a tuned notes where [en/k] retains it's proper sound, however is now aspirated:
<img width="821" height="380" alt="Image" src="https://github.com/user-attachments/assets/77d3a7d0-a600-4ce1-9a42-91b2aa23d021" />

However in this example, the dip causes a slight increase in voicing here causes the [en/k] sound to transform into a [en/g] sound:
<img width="882" height="370" alt="Image" src="https://github.com/user-attachments/assets/90ba9f3f-c165-4306-942d-9a8b8cadf660" />

I've checked through the labels as well as run scripts to ensure there was not voicing within the consonant, yet the issue perssists.

While trying to figure this out I decided to use the CLI interface to see if it was an issue with OpenUTAU. I exported a sequence from OpenUTAU with the F0 curve maintained, and inferred variance using this command:
``` python scripts/infer.py variance "B:\Diffsinger\Blue.ds" --exp Singer1 --lang en --spk Spk1 --predict tension --predict energy --predict voicing --predict breathiness ```

I proceeded to infer the acoustic part of the sequence with:
``` python scripts/infer.py acoustic "B:\Diffsinger\Blue.ds" --exp Singer1 --lang en --spk Spk1 ```

I'm not quite sure if there was an issue with my execution of the CLI command, so please let me know!

<img width="896" height="649" alt="Image" src="https://github.com/user-attachments/assets/492d0f05-12f8-4cdd-ae63-8256ae87ffd7" />

As shown in the image, the CLI inference has more errors, those of which seem related to voicing and breathiness specifically. Both were exported from the same checkpoints and the same steps number, except OpenUTAU was exported to ONNX before use as required. However, both yielded different results.

Would you be able to point me in the right direction of the CLI inference as well as config edits that can be made? I would love to be able to narrow down the issue regarding the voiced/unvoiced issue and CLI inference issue if related!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Voiced/Unvoiced Consonant Issue w/ F0 Curve || CLI Renderer Issue #284

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Voiced/Unvoiced Consonant Issue w/ F0 Curve || CLI Renderer Issue #284

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions