Skip to content

Commit eb83819

Browse files
xenovaTh3G33k
andauthored
Add support for StyleTTS 2 (Kokoro) (#1148)
* Add RawAudio class and 'save to wav' * Apply suggestions from code review Co-authored-by: Joshua Lochner <[email protected]> * RawAudio toBlob() + save() rewrite * Add saveBlob in utils/core.js * RawAudio : Add support 2 channels + interleave * Fix * Fix * simplify type check * RawAudio : improve interleave + change env -> apis * image.js: change env -> apis * env.js remove changes * Update RawAudio toblob * Add support for style_tts * Update imports * Improve saveBlob function * Update style tts model info --------- Co-authored-by: Th3G33k <[email protected]> Co-authored-by: Th3G33k <[email protected]>
1 parent 1130961 commit eb83819

File tree

8 files changed

+161
-29
lines changed

8 files changed

+161
-29
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -403,6 +403,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
403403
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
404404
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
405405
1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
406+
1. StyleTTS 2 (from Columbia University) released with the paper [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691) by Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani.
406407
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
407408
1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
408409
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@
118118
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
119119
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
120120
1. **[Starcoder2](https://huggingface.co/docs/transformers/main/model_doc/starcoder2)** (from BigCode team) released with the paper [StarCoder 2 and The Stack v2: The Next Generation](https://arxiv.org/abs/2402.19173) by Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries.
121+
1. StyleTTS 2 (from Columbia University) released with the paper [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691) by Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani.
121122
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
122123
1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte.
123124
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.

src/env.js

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,4 +160,3 @@ export const env = {
160160
function isEmpty(obj) {
161161
return Object.keys(obj).length === 0;
162162
}
163-

src/models.js

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6126,6 +6126,9 @@ export class WavLMForAudioFrameClassification extends WavLMPreTrainedModel {
61266126
}
61276127
}
61286128

6129+
export class StyleTextToSpeech2PreTrainedModel extends PreTrainedModel { }
6130+
export class StyleTextToSpeech2Model extends StyleTextToSpeech2PreTrainedModel { }
6131+
61296132
//////////////////////////////////////////////////
61306133
// SpeechT5 models
61316134
/**
@@ -7089,6 +7092,8 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
70897092

70907093
['maskformer', ['MaskFormerModel', MaskFormerModel]],
70917094
['mgp-str', ['MgpstrForSceneTextRecognition', MgpstrForSceneTextRecognition]],
7095+
7096+
['style_text_to_speech_2', ['StyleTextToSpeech2Model', StyleTextToSpeech2Model]],
70927097
]);
70937098

70947099
const MODEL_MAPPING_NAMES_ENCODER_DECODER = new Map([

src/pipelines.js

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,8 @@ import {
6464
round,
6565
} from './utils/maths.js';
6666
import {
67-
read_audio
67+
read_audio,
68+
RawAudio
6869
} from './utils/audio.js';
6970
import {
7071
Tensor,
@@ -2678,7 +2679,7 @@ export class DocumentQuestionAnsweringPipeline extends (/** @type {new (options:
26782679
* const synthesizer = await pipeline('text-to-speech', 'Xenova/speecht5_tts', { quantized: false });
26792680
* const speaker_embeddings = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings.bin';
26802681
* const out = await synthesizer('Hello, my dog is cute', { speaker_embeddings });
2681-
* // {
2682+
* // RawAudio {
26822683
* // audio: Float32Array(26112) [-0.00005657337896991521, 0.00020583874720614403, ...],
26832684
* // sampling_rate: 16000
26842685
* // }
@@ -2698,7 +2699,7 @@ export class DocumentQuestionAnsweringPipeline extends (/** @type {new (options:
26982699
* ```javascript
26992700
* const synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-fra');
27002701
* const out = await synthesizer('Bonjour');
2701-
* // {
2702+
* // RawAudio {
27022703
* // audio: Float32Array(23808) [-0.00037693005288019776, 0.0003325853613205254, ...],
27032704
* // sampling_rate: 16000
27042705
* // }
@@ -2745,10 +2746,10 @@ export class TextToAudioPipeline extends (/** @type {new (options: TextToAudioPi
27452746

27462747
// @ts-expect-error TS2339
27472748
const sampling_rate = this.model.config.sampling_rate;
2748-
return {
2749-
audio: waveform.data,
2749+
return new RawAudio(
2750+
waveform.data,
27502751
sampling_rate,
2751-
}
2752+
)
27522753
}
27532754

27542755
async _call_text_to_spectrogram(text_inputs, { speaker_embeddings }) {
@@ -2788,10 +2789,10 @@ export class TextToAudioPipeline extends (/** @type {new (options: TextToAudioPi
27882789
const { waveform } = await this.model.generate_speech(input_ids, speaker_embeddings, { vocoder: this.vocoder });
27892790

27902791
const sampling_rate = this.processor.feature_extractor.config.sampling_rate;
2791-
return {
2792-
audio: waveform.data,
2792+
return new RawAudio(
2793+
waveform.data,
27932794
sampling_rate,
2794-
}
2795+
)
27952796
}
27962797
}
27972798

src/utils/audio.js

Lines changed: 113 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,10 @@ import {
1212
} from './hub.js';
1313
import { FFT, max } from './maths.js';
1414
import {
15-
calculateReflectOffset,
15+
calculateReflectOffset, saveBlob,
1616
} from './core.js';
17+
import { apis } from '../env.js';
18+
import fs from 'fs';
1719
import { Tensor, matmul } from './tensor.js';
1820

1921

@@ -702,3 +704,113 @@ export function window_function(window_length, name, {
702704

703705
return window;
704706
}
707+
708+
/**
709+
* Encode audio data to a WAV file.
710+
* WAV file specs : https://en.wikipedia.org/wiki/WAV#WAV_File_header
711+
*
712+
* Adapted from https://www.npmjs.com/package/audiobuffer-to-wav
713+
* @param {Float32Array} samples The audio samples.
714+
* @param {number} rate The sample rate.
715+
* @returns {ArrayBuffer} The WAV audio buffer.
716+
*/
717+
function encodeWAV(samples, rate) {
718+
let offset = 44;
719+
const buffer = new ArrayBuffer(offset + samples.length * 4);
720+
const view = new DataView(buffer);
721+
722+
/* RIFF identifier */
723+
writeString(view, 0, "RIFF");
724+
/* RIFF chunk length */
725+
view.setUint32(4, 36 + samples.length * 4, true);
726+
/* RIFF type */
727+
writeString(view, 8, "WAVE");
728+
/* format chunk identifier */
729+
writeString(view, 12, "fmt ");
730+
/* format chunk length */
731+
view.setUint32(16, 16, true);
732+
/* sample format (raw) */
733+
view.setUint16(20, 3, true);
734+
/* channel count */
735+
view.setUint16(22, 1, true);
736+
/* sample rate */
737+
view.setUint32(24, rate, true);
738+
/* byte rate (sample rate * block align) */
739+
view.setUint32(28, rate * 4, true);
740+
/* block align (channel count * bytes per sample) */
741+
view.setUint16(32, 4, true);
742+
/* bits per sample */
743+
view.setUint16(34, 32, true);
744+
/* data chunk identifier */
745+
writeString(view, 36, "data");
746+
/* data chunk length */
747+
view.setUint32(40, samples.length * 4, true);
748+
749+
for (let i = 0; i < samples.length; ++i, offset += 4) {
750+
view.setFloat32(offset, samples[i], true);
751+
}
752+
753+
return buffer;
754+
}
755+
756+
function writeString(view, offset, string) {
757+
for (let i = 0; i < string.length; ++i) {
758+
view.setUint8(offset + i, string.charCodeAt(i));
759+
}
760+
}
761+
762+
763+
export class RawAudio {
764+
765+
/**
766+
* Create a new `RawAudio` object.
767+
* @param {Float32Array} audio Audio data
768+
* @param {number} sampling_rate Sampling rate of the audio data
769+
*/
770+
constructor(audio, sampling_rate) {
771+
this.audio = audio
772+
this.sampling_rate = sampling_rate
773+
}
774+
775+
/**
776+
* Convert the audio to a wav file buffer.
777+
* @returns {ArrayBuffer} The WAV file.
778+
*/
779+
toWav() {
780+
return encodeWAV(this.audio, this.sampling_rate)
781+
}
782+
783+
/**
784+
* Convert the audio to a blob.
785+
* @returns {Blob}
786+
*/
787+
toBlob() {
788+
const wav = this.toWav();
789+
const blob = new Blob([wav], { type: 'audio/wav' });
790+
return blob;
791+
}
792+
793+
/**
794+
* Save the audio to a wav file.
795+
* @param {string} path
796+
*/
797+
async save(path) {
798+
let fn;
799+
800+
if (apis.IS_BROWSER_ENV) {
801+
if (apis.IS_WEBWORKER_ENV) {
802+
throw new Error('Unable to save a file from a Web Worker.')
803+
}
804+
fn = saveBlob;
805+
} else if (apis.IS_FS_AVAILABLE) {
806+
fn = async (/** @type {string} */ path, /** @type {Blob} */ blob) => {
807+
let buffer = await blob.arrayBuffer();
808+
fs.writeFileSync(path, Buffer.from(buffer));
809+
}
810+
} else {
811+
throw new Error('Unable to save because filesystem is disabled in this environment.')
812+
}
813+
814+
await fn(path, this.toBlob())
815+
}
816+
}

src/utils/core.js

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,32 @@ export function calculateReflectOffset(i, w) {
189189
return Math.abs((i + w) % (2 * w) - w);
190190
}
191191

192+
/**
193+
* Save blob file on the web.
194+
* @param {string} path The path to save the blob to
195+
* @param {Blob} blob The blob to save
196+
*/
197+
export function saveBlob(path, blob){
198+
// Convert the canvas content to a data URL
199+
const dataURL = URL.createObjectURL(blob);
200+
201+
// Create an anchor element with the data URL as the href attribute
202+
const downloadLink = document.createElement('a');
203+
downloadLink.href = dataURL;
204+
205+
// Set the download attribute to specify the desired filename for the downloaded image
206+
downloadLink.download = path;
207+
208+
// Trigger the download
209+
downloadLink.click();
210+
211+
// Clean up: remove the anchor element from the DOM
212+
downloadLink.remove();
213+
214+
// Revoke the Object URL to free up memory
215+
URL.revokeObjectURL(dataURL);
216+
}
217+
192218
/**
193219
*
194220
* @param {Object} o

src/utils/image.js

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@
88
* @module utils/image
99
*/
1010

11-
import { isNullishDimension } from './core.js';
11+
import { isNullishDimension, saveBlob } from './core.js';
1212
import { getFile } from './hub.js';
13-
import { env, apis } from '../env.js';
13+
import { apis } from '../env.js';
1414
import { Tensor } from './tensor.js';
1515

1616
// Will be empty (or not used) if running in browser or web-worker
@@ -793,23 +793,9 @@ export class RawImage {
793793
// Convert image to Blob
794794
const blob = await this.toBlob(mime);
795795

796-
// Convert the canvas content to a data URL
797-
const dataURL = URL.createObjectURL(blob);
796+
saveBlob(path, blob)
798797

799-
// Create an anchor element with the data URL as the href attribute
800-
const downloadLink = document.createElement('a');
801-
downloadLink.href = dataURL;
802-
803-
// Set the download attribute to specify the desired filename for the downloaded image
804-
downloadLink.download = path;
805-
806-
// Trigger the download
807-
downloadLink.click();
808-
809-
// Clean up: remove the anchor element from the DOM
810-
downloadLink.remove();
811-
812-
} else if (!env.useFS) {
798+
} else if (!apis.IS_FS_AVAILABLE) {
813799
throw new Error('Unable to save the image because filesystem is disabled in this environment.')
814800

815801
} else {
@@ -837,3 +823,4 @@ export class RawImage {
837823
* Helper function to load an image from a URL, path, etc.
838824
*/
839825
export const load_image = RawImage.read.bind(RawImage);
826+

0 commit comments

Comments
 (0)