Skip to content

Voiced/Unvoiced Consonant Issue w/ F0 Curve || CLI Renderer Issue #284

@it-owen

Description

@it-owen

Hello! I hope this finds you well.

I've come across an issue recently and was wondering if they were related. I've recently trained a model with the variance parameters Tension, Voicing, Energy, and Breathiness. It's a bit unorthodox as tension/voicing were successors to energy/breathiness, however I find that it gives me more freedoms with models.

However, I've recently come across an issue where deep F0 curves can cause unvoiced consonants to become voiced. This causes issues as words/phrases become mispronounced. Using OpenUTAU's live curvature feature, I discovered that voicing would go upward rather than staying toward the bottom of the curve. While redrawing the curve fixes the solution, it makes using the model more tedious. I switched between using WORLD/VR for my hnsep and RMVPE/Parselmouth for my pe, however no combination solved this issue.

Without F0 modifications, the model retains [en/k] properly, however the [k] sound is enunciated as an unaspirated sound.
Image

Here is an example of a tuned notes where [en/k] retains it's proper sound, however is now aspirated:
Image

However in this example, the dip causes a slight increase in voicing here causes the [en/k] sound to transform into a [en/g] sound:
Image

I've checked through the labels as well as run scripts to ensure there was not voicing within the consonant, yet the issue perssists.

While trying to figure this out I decided to use the CLI interface to see if it was an issue with OpenUTAU. I exported a sequence from OpenUTAU with the F0 curve maintained, and inferred variance using this command:
python scripts/infer.py variance "B:\Diffsinger\Blue.ds" --exp Singer1 --lang en --spk Spk1 --predict tension --predict energy --predict voicing --predict breathiness

I proceeded to infer the acoustic part of the sequence with:
python scripts/infer.py acoustic "B:\Diffsinger\Blue.ds" --exp Singer1 --lang en --spk Spk1

I'm not quite sure if there was an issue with my execution of the CLI command, so please let me know!

Image

As shown in the image, the CLI inference has more errors, those of which seem related to voicing and breathiness specifically. Both were exported from the same checkpoints and the same steps number, except OpenUTAU was exported to ONNX before use as required. However, both yielded different results.

Would you be able to point me in the right direction of the CLI inference as well as config edits that can be made? I would love to be able to narrow down the issue regarding the voiced/unvoiced issue and CLI inference issue if related!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions