Skip to content

Commit 1394f73

Browse files
authored
Add support for VITS (multilingual TTS) (#466)
* Add custom VITS tokenizer converter * Do not decode if expected input_ids is empty * Update vits tokenizer tests * Implement `VitsTokenizer` * Add support for VITS model * Support VITS through pipeline API * Update JSDoc * Add TTS unit test * Add speecht5 unit test * Fix typo * Fix typo * Update speecht5 model id * Add note about using quantized speecht5 in unit tests * Monkey-patch `BigInt64Array` and `BigUint64Array`
1 parent f5bc758 commit 1394f73

File tree

12 files changed

+336
-6
lines changed

12 files changed

+336
-6
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -336,6 +336,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
336336
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
337337
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
338338
1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
339+
1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
339340
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
340341
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
341342
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@
7171
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
7272
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
7373
1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
74+
1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
7475
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
7576
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.
7677
1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.

scripts/convert.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,7 +334,15 @@ def main():
334334

335335
with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
336336
json.dump(tokenizer_json, fp, indent=4)
337+
338+
elif config.model_type == 'vits':
339+
if tokenizer is not None:
340+
from .extra.vits import generate_tokenizer_json
341+
tokenizer_json = generate_tokenizer_json(tokenizer)
337342

343+
with open(os.path.join(output_model_folder, 'tokenizer.json'), 'w', encoding='utf-8') as fp:
344+
json.dump(tokenizer_json, fp, indent=4)
345+
338346
elif config.model_type == 'speecht5':
339347
# TODO allow user to specify vocoder path
340348
export_kwargs["model_kwargs"] = {"vocoder": "microsoft/speecht5_hifigan"}

scripts/extra/vits.py

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
2+
3+
def generate_tokenizer_json(tokenizer):
4+
vocab = tokenizer.get_vocab()
5+
6+
normalizers = []
7+
8+
if tokenizer.normalize:
9+
# Lowercase the input string
10+
normalizers.append({
11+
"type": "Lowercase",
12+
})
13+
14+
if tokenizer.language == 'ron':
15+
# Replace diacritics
16+
normalizers.append({
17+
"type": "Replace",
18+
"pattern": {
19+
"String": "ț",
20+
},
21+
"content": "ţ",
22+
})
23+
24+
if tokenizer.phonemize:
25+
raise NotImplementedError("Phonemization is not implemented yet")
26+
27+
elif tokenizer.normalize:
28+
# strip any chars outside of the vocab (punctuation)
29+
chars = ''.join(x for x in vocab if len(x) == 1)
30+
escaped = chars.replace('-', r'\-').replace(']', r'\]')
31+
normalizers.append({
32+
"type": "Replace",
33+
"pattern": {
34+
"Regex": f"[^{escaped}]",
35+
},
36+
"content": "",
37+
})
38+
normalizers.append({
39+
"type": "Strip",
40+
"strip_left": True,
41+
"strip_right": True,
42+
})
43+
44+
if tokenizer.add_blank:
45+
# add pad token between each char
46+
normalizers.append({
47+
"type": "Replace",
48+
"pattern": {
49+
# Add a blank token between each char, except when blank (then do nothing)
50+
"Regex": "(?=.)|(?<!^)$",
51+
},
52+
"content": tokenizer.pad_token,
53+
})
54+
55+
if len(normalizers) == 0:
56+
normalizer = None
57+
elif len(normalizers) == 1:
58+
normalizer = normalizers[0]
59+
else:
60+
normalizer = {
61+
"type": "Sequence",
62+
"normalizers": normalizers,
63+
}
64+
65+
tokenizer_json = {
66+
"version": "1.0",
67+
"truncation": None,
68+
"padding": None,
69+
"added_tokens": [
70+
{
71+
"id": vocab[token],
72+
"content": token,
73+
"single_word": False,
74+
"lstrip": False,
75+
"rstrip": False,
76+
"normalized": False,
77+
"special": True
78+
}
79+
for token in vocab
80+
81+
# `tokenizer.pad_token` should not be considered an added token
82+
if token in (tokenizer.unk_token, )
83+
],
84+
"normalizer": normalizer,
85+
"pre_tokenizer": {
86+
"type": "Split",
87+
"pattern": {
88+
"Regex": ""
89+
},
90+
"behavior": "Isolated",
91+
"invert": False
92+
},
93+
"post_processor": None,
94+
"decoder": None, # Custom decoder implemented in JS
95+
"model": {
96+
"vocab": vocab
97+
},
98+
}
99+
100+
return tokenizer_json

scripts/supported_models.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -858,6 +858,27 @@
858858
'hustvl/vitmatte-base-composition-1k',
859859
],
860860
},
861+
'vits': {
862+
# Text-to-audio/Text-to-speech/Text-to-waveform
863+
'text-to-waveform': {
864+
# NOTE: requires --task text-to-waveform --skip_validation
865+
'echarlaix/tiny-random-vits',
866+
'facebook/mms-tts-eng',
867+
'facebook/mms-tts-rus',
868+
'facebook/mms-tts-hin',
869+
'facebook/mms-tts-yor',
870+
'facebook/mms-tts-spa',
871+
'facebook/mms-tts-fra',
872+
'facebook/mms-tts-ara',
873+
'facebook/mms-tts-ron',
874+
'facebook/mms-tts-vie',
875+
'facebook/mms-tts-deu',
876+
'facebook/mms-tts-kor',
877+
'facebook/mms-tts-por',
878+
# TODO add more checkpoints from
879+
# https://huggingface.co/models?other=vits&sort=trending&search=facebook-tts
880+
}
881+
},
861882
'wav2vec2': {
862883
# Feature extraction # NOTE: requires --task feature-extraction
863884
'feature-extraction': [

src/models.js

Lines changed: 77 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4696,6 +4696,47 @@ export class ClapAudioModelWithProjection extends ClapPreTrainedModel {
46964696
//////////////////////////////////////////////////
46974697

46984698

4699+
//////////////////////////////////////////////////
4700+
// VITS models
4701+
export class VitsPreTrainedModel extends PreTrainedModel { }
4702+
4703+
/**
4704+
* The complete VITS model, for text-to-speech synthesis.
4705+
*
4706+
* **Example:** Generate speech from text with `VitsModel`.
4707+
* ```javascript
4708+
* import { AutoTokenizer, VitsModel } from '@xenova/transformers';
4709+
*
4710+
* // Load the tokenizer and model
4711+
* const tokenizer = await AutoTokenizer.from_pretrained('Xenova/mms-tts-eng');
4712+
* const model = await VitsModel.from_pretrained('Xenova/mms-tts-eng');
4713+
*
4714+
* // Run tokenization
4715+
* const inputs = tokenizer('I love transformers');
4716+
*
4717+
* // Generate waveform
4718+
* const { waveform } = await model(inputs);
4719+
* // Tensor {
4720+
* // dims: [ 1, 35328 ],
4721+
* // type: 'float32',
4722+
* // data: Float32Array(35328) [ ... ],
4723+
* // size: 35328,
4724+
* // }
4725+
* ```
4726+
*/
4727+
export class VitsModel extends VitsPreTrainedModel {
4728+
/**
4729+
* Calls the model on new inputs.
4730+
* @param {Object} model_inputs The inputs to the model.
4731+
* @returns {Promise<VitsModelOutput>} The outputs for the VITS model.
4732+
*/
4733+
async _call(model_inputs) {
4734+
return new VitsModelOutput(await super._call(model_inputs));
4735+
}
4736+
}
4737+
//////////////////////////////////////////////////
4738+
4739+
46994740
//////////////////////////////////////////////////
47004741
// AutoModels, used to simplify construction of PreTrainedModels
47014742
// (uses config to instantiate correct class)
@@ -4789,6 +4830,7 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
47894830
['hubert', ['HubertModel', HubertModel]],
47904831
['wavlm', ['WavLMModel', WavLMModel]],
47914832
['audio-spectrogram-transformer', ['ASTModel', ASTModel]],
4833+
['vits', ['VitsModel', VitsModel]],
47924834

47934835
['detr', ['DetrModel', DetrModel]],
47944836
['table-transformer', ['TableTransformerModel', TableTransformerModel]],
@@ -4846,11 +4888,15 @@ const MODEL_MAPPING_NAMES_DECODER_ONLY = new Map([
48464888
const MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = new Map([
48474889
['speecht5', ['SpeechT5ForSpeechToText', SpeechT5ForSpeechToText]],
48484890
['whisper', ['WhisperForConditionalGeneration', WhisperForConditionalGeneration]],
4849-
])
4891+
]);
48504892

48514893
const MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES = new Map([
48524894
['speecht5', ['SpeechT5ForTextToSpeech', SpeechT5ForTextToSpeech]],
4853-
])
4895+
]);
4896+
4897+
const MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES = new Map([
4898+
['vits', ['VitsModel', VitsModel]],
4899+
]);
48544900

48554901
const MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = new Map([
48564902
['bert', ['BertForSequenceClassification', BertForSequenceClassification]],
@@ -5044,6 +5090,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
50445090
[MODEL_FOR_CTC_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
50455091
[MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
50465092
[MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES, MODEL_TYPES.Seq2Seq],
5093+
[MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
50475094
];
50485095

50495096
for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) {
@@ -5136,6 +5183,17 @@ export class AutoModelForTextToSpectrogram extends PretrainedMixin {
51365183
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES];
51375184
}
51385185

5186+
/**
5187+
* Helper class which is used to instantiate pretrained text-to-waveform models with the `from_pretrained` function.
5188+
* The chosen model class is determined by the type specified in the model config.
5189+
*
5190+
* @example
5191+
* let model = await AutoModelForTextToSpectrogram.from_pretrained('facebook/mms-tts-eng');
5192+
*/
5193+
export class AutoModelForTextToWaveform extends PretrainedMixin {
5194+
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES];
5195+
}
5196+
51395197
/**
51405198
* Helper class which is used to instantiate pretrained causal language models with the `from_pretrained` function.
51415199
* The chosen model class is determined by the type specified in the model config.
@@ -5375,3 +5433,20 @@ export class ImageMattingOutput extends ModelOutput {
53755433
this.alphas = alphas;
53765434
}
53775435
}
5436+
5437+
/**
5438+
* Describes the outputs for the VITS model.
5439+
*/
5440+
export class VitsModelOutput extends ModelOutput {
5441+
/**
5442+
* @param {Object} output The output of the model.
5443+
* @param {Tensor} output.waveform The final audio waveform predicted by the model, of shape `(batch_size, sequence_length)`.
5444+
* @param {Tensor} output.spectrogram The log-mel spectrogram predicted at the output of the flow model.
5445+
* This spectrogram is passed to the Hi-Fi GAN decoder model to obtain the final audio waveform.
5446+
*/
5447+
constructor({ waveform, spectrogram }) {
5448+
super();
5449+
this.waveform = waveform;
5450+
this.spectrogram = spectrogram;
5451+
}
5452+
}

src/pipelines.js

Lines changed: 47 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ import {
2626
AutoModelForMaskedLM,
2727
AutoModelForSeq2SeqLM,
2828
AutoModelForSpeechSeq2Seq,
29+
AutoModelForTextToWaveform,
2930
AutoModelForTextToSpectrogram,
3031
AutoModelForCTC,
3132
AutoModelForCausalLM,
@@ -37,7 +38,6 @@ import {
3738
AutoModelForDocumentQuestionAnswering,
3839
AutoModelForImageToImage,
3940
AutoModelForDepthEstimation,
40-
// AutoModelForTextToWaveform,
4141
PreTrainedModel,
4242
} from './models.js';
4343
import {
@@ -2112,6 +2112,16 @@ export class DocumentQuestionAnsweringPipeline extends Pipeline {
21122112
* wav.fromScratch(1, out.sampling_rate, '32f', out.audio);
21132113
* fs.writeFileSync('out.wav', wav.toBuffer());
21142114
* ```
2115+
*
2116+
* **Example:** Multilingual speech generation with `Xenova/mms-tts-fra`. See [here](https://huggingface.co/models?pipeline_tag=text-to-speech&other=vits&sort=trending) for the full list of available languages (1107).
2117+
* ```js
2118+
* let synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-fra');
2119+
* let out = await synthesizer('Bonjour');
2120+
* // {
2121+
* // audio: Float32Array(23808) [-0.00037693005288019776, 0.0003325853613205254, ...],
2122+
* // sampling_rate: 16000
2123+
* // }
2124+
* ```
21152125
*/
21162126
export class TextToAudioPipeline extends Pipeline {
21172127
DEFAULT_VOCODER_ID = "Xenova/speecht5_hifigan"
@@ -2143,6 +2153,34 @@ export class TextToAudioPipeline extends Pipeline {
21432153
async _call(text_inputs, {
21442154
speaker_embeddings = null,
21452155
} = {}) {
2156+
// If this.processor is not set, we are using a `AutoModelForTextToWaveform` model
2157+
if (this.processor) {
2158+
return this._call_text_to_spectrogram(text_inputs, { speaker_embeddings });
2159+
} else {
2160+
return this._call_text_to_waveform(text_inputs);
2161+
}
2162+
}
2163+
2164+
async _call_text_to_waveform(text_inputs) {
2165+
2166+
// Run tokenization
2167+
const inputs = this.tokenizer(text_inputs, {
2168+
padding: true,
2169+
truncation: true
2170+
});
2171+
2172+
// Generate waveform
2173+
const { waveform } = await this.model(inputs);
2174+
2175+
const sampling_rate = this.model.config.sampling_rate;
2176+
return {
2177+
audio: waveform.data,
2178+
sampling_rate,
2179+
}
2180+
}
2181+
2182+
async _call_text_to_spectrogram(text_inputs, { speaker_embeddings }) {
2183+
21462184
// Load vocoder, if not provided
21472185
if (!this.vocoder) {
21482186
console.log('No vocoder specified, using default HifiGan vocoder.');
@@ -2412,8 +2450,8 @@ const SUPPORTED_TASKS = {
24122450
"text-to-audio": {
24132451
"tokenizer": AutoTokenizer,
24142452
"pipeline": TextToAudioPipeline,
2415-
"model": [ /* TODO: AutoModelForTextToWaveform, */ AutoModelForTextToSpectrogram],
2416-
"processor": AutoProcessor,
2453+
"model": [AutoModelForTextToWaveform, AutoModelForTextToSpectrogram],
2454+
"processor": [AutoProcessor, /* Some don't use a processor */ null],
24172455
"default": {
24182456
// TODO: replace with original
24192457
// "model": "microsoft/speecht5_tts",
@@ -2673,6 +2711,12 @@ async function loadItems(mapping, model, pretrainedOptions) {
26732711
promise = new Promise(async (resolve, reject) => {
26742712
let e;
26752713
for (let c of cls) {
2714+
if (c === null) {
2715+
// If null, we resolve it immediately, meaning the relevant
2716+
// class was not found, but it is optional.
2717+
resolve(null);
2718+
return;
2719+
}
26762720
try {
26772721
resolve(await c.from_pretrained(model, pretrainedOptions));
26782722
return;

0 commit comments

Comments
 (0)