Dataset

The dataset is composed of 20 audio files and transcriptions (manual, automatic by WhisperX and automatic with ProsSegue model), from NURC-SP Minimal Corpus: 6 Formal Elocutions (EF) totaling 4h28min52s; 5 formal dialogues between the speakers, with the presence of a documenter (D2) totaling 5h22min07s; 9 interviews about different topics, carried out by an interviewer with the interviewee (DID) totaling 6h11min20s. Although NURC-SP Minimal Corpus has 21 pairs of audio-transcription, one of them (SP_D2_062) was removed as its audio quality was not good for training TTS.

From the subset called “prosodic-manual”, only terminal intonational units (terminal boundaries mark the conclusion of the utterance) from the NURC-SP Minimal Corpus were used. For the subset “whisper-automatic”, the 20 audio files of the NURC-SP Minimal Corpus were segmented and transcribed by WhisperX. Finally, the subset “ProsSegue-automatic” was segmented using ProsSegue model. The dataset was used to compare the training of TTS models with terminal intonational units and with the segments automatically generated. The prosodic-manual has 12:32:25 hours (8k segments), the whisper-automatic part has 16:33:54 hours (10k segments) and the ProsSegue-automatic part has 14:51:00 hours (12k segments).

Dataset

Hugging Face Dataset

Code

OUR Github

ENTOA-TTS Github

Models

FastSpeech2: FastSpeech2 info

FastSpeech2 Training Setup

FastSpeech2 is a transformer-based model, the open-source implementation of ming024 was used, and we trained for 720.000 steps in a GPU RTX4070. The subset Prossegue-Automatic was added to CML-TTS dataset following the same steps from Entoa-TTS authors: entoa-tts.

Steps:

Prepare align: Audio files and transcriptions for each segment were extracted from original dataset and saved in .lab and .wav files inside a folder named after each speaker code;
MFA Align 1: The tool Montreal Forced Aligner was applied to transform graphemes to phonemes using the pretrained model portuguese_brazil_mfa and a lexicon dictionary was generated in a text file;
MFA Align 2: MFA align command was used to process the raw data and generate the alignments in TextGrid format;
From that, a pre-processing script generated estimated values for energy and pitch, a json file with speakers codes and a train/validation text files with phonetic transcriptions. Validation subset was a selection of 1% random segments;
Finally, the model was trained with the preprocessed data on a RTX 4070 GPU for 720k steps.

Details of prosodic acoustic analysis:

Calculation of deviation from the series mean

In each speech group, the raw values of the 4 points are added together, the mean is obtained, and then the difference between each raw value and the mean is calculated.

Calculation of successive difference:

The raw values of each point are considered to calculate the difference in transitions between points. Then, each value resulting from each of the FastSpeech2 speech samples is subtracted from the reference speech.

Examples - FastSpeech2

ID	Speech	Ground Truth	MANUAL	WHISPER	PROSSEGUE
01	eu quase não vou ao cinema teatro...
02	ah às vezes eu vou...
03	eu tenho ido a teatro.
04	deve ser como na televisão
05	então no teatro eu acho que é bem mais difícil...
06	a televisão é horroroso quando eles estão fazendo programa.
07	eu sei que não há preparação toda.
08	porque o grupo que trabalha em hair é enorme né
09	tenho impressão que ali levou tanto tempo de ensaio
10	me chocou tremendamente
11	eu saber que o filme é bom
12	eu gostei bastante
13	eu me lembro de vários filmes não lembro os nomes
14	por isso é que eu deixo de ir ao cinema
15	hoje tá tudo meio louco né
16	assisti em araraquara.
17	eu num lembro o nome do filme...
18	a molecada adorou.
19	eles adoraram o filme...
20	porque eu saio cansada mesmo
21	eu fico numa tensão nervosa
22	nós saímos pra ir ao teatro.
23	não conseguimos entrar fomos assistir esse filme.
24	eu acho que influi bastante
25	eu acho que teatro tá bem mais caro
26	eu acho que o público pre prefere cinema ainda
27	eu não entendi a pergunta
28	eu acho que o cinema tá perdendo viu
29	o que eu noto é isso
30	principalmente nos fins de semana

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
experiments		experiments
models/FastSpeech2		models/FastSpeech2
.gitattributes		.gitattributes
README.md		README.md
blind-site.html		blind-site.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset

Code

Models

FastSpeech2 Training Setup

Details of prosodic acoustic analysis:

Calculation of deviation from the series mean

Calculation of successive difference:

Examples - FastSpeech2

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

nilc-nlp/entoa-tts-prossegue

Folders and files

Latest commit

History

Repository files navigation

Dataset

Code

Models

FastSpeech2 Training Setup

Details of prosodic acoustic analysis:

Calculation of deviation from the series mean

Calculation of successive difference:

Examples - FastSpeech2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages