Skip to content

Commit 4a991bd

Browse files
authored
Add support for text-to-speech (w/ Speecht5) (#345)
* Add vocoder to export * Add tokenizer.json export for speecht5 models * Update speecht5 supported models * Create `SpeechT5Tokenizer` * Add `ones` and `ones_like` tensor functions * Add support for speecht5 text-to-speech * Disambiguate `SpeechSeq2Seq` and `Seq2SeqLM` * Create `TextToAudioPipeline` * Add listed support for `text-to-audio` / `text-to-speech` * Use unquantized vocoder by default * Skip speecht5 unit tests for now Due to bug in transformers: huggingface/transformers#26547 * Update example pipeline output * Create simple in-browser TTS demo * Add template README * Delete package-lock.json * Update required transformers.js version * Add link to Transformers.js * Double -> Single quotes * Add link to text-to-speech demo * Update sample speaker embeddings
1 parent 983cf3a commit 4a991bd

28 files changed

+988
-16
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ Want to jump straight in? Get started with one of our sample applications/templa
116116
| Semantic Image Search (server-side) | Search for images with text (Supabase) | [code](./examples/semantic-image-search/), [demo](https://huggingface.co/spaces/Xenova/semantic-image-search) |
117117
| Vanilla JavaScript | In-browser object detection | [video](https://scrimba.com/scrim/cKm9bDAg), [code](./examples/vanilla-js/), [demo](https://huggingface.co/spaces/Scrimba/vanilla-js-object-detector) |
118118
| React | Multilingual translation website | [code](./examples/react-translator/), [demo](https://huggingface.co/spaces/Xenova/react-translator) |
119+
| Text to speech (client-side) | In-browser speech synthesis | [code](./examples/text-to-speech-client/), [demo](https://huggingface.co/spaces/Xenova/text-to-speech-client) |
119120
| Browser extension | Text classification extension | [code](./examples/extension/) |
120121
| Electron | Text classification application | [code](./examples/electron/) |
121122
| Next.js (client-side) | Sentiment analysis (in-browser inference) | [code](./examples/next-client/), [demo](https://huggingface.co/spaces/Xenova/next-example-app) |
@@ -222,7 +223,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
222223
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. |[(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AudioClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=audio-classification&library=transformers.js) |
223224
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. ||
224225
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. |[(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&library=transformers.js) |
225-
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. | |
226+
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | `text-to-speech` or `text-to-audio` | | Generating natural-sounding speech given text input. | [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TextToAudioPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=text-to-audio&library=transformers.js) |
226227

227228

228229
#### Tabular

docs/snippets/3_examples.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Want to jump straight in? Get started with one of our sample applications/templa
99
| Semantic Image Search (server-side) | Search for images with text (Supabase) | [code](./examples/semantic-image-search/), [demo](https://huggingface.co/spaces/Xenova/semantic-image-search) |
1010
| Vanilla JavaScript | In-browser object detection | [video](https://scrimba.com/scrim/cKm9bDAg), [code](./examples/vanilla-js/), [demo](https://huggingface.co/spaces/Scrimba/vanilla-js-object-detector) |
1111
| React | Multilingual translation website | [code](./examples/react-translator/), [demo](https://huggingface.co/spaces/Xenova/react-translator) |
12+
| Text to speech (client-side) | In-browser speech synthesis | [code](./examples/text-to-speech-client/), [demo](https://huggingface.co/spaces/Xenova/text-to-speech-client) |
1213
| Browser extension | Text classification extension | [code](./examples/extension/) |
1314
| Electron | Text classification application | [code](./examples/electron/) |
1415
| Next.js (client-side) | Sentiment analysis (in-browser inference) | [code](./examples/next-client/), [demo](https://huggingface.co/spaces/Xenova/next-example-app) |

docs/snippets/5_supported-tasks.snippet

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
| [Audio Classification](https://huggingface.co/tasks/audio-classification) | `audio-classification` | Assigning a label or class to a given audio. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AudioClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=audio-classification&library=transformers.js) |
3939
| [Audio-to-Audio](https://huggingface.co/tasks/audio-to-audio) | n/a | Generating audio from an input audio source. | ❌ |
4040
| [Automatic Speech Recognition](https://huggingface.co/tasks/automatic-speech-recognition) | `automatic-speech-recognition` | Transcribing a given audio into text. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&library=transformers.js) |
41-
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | n/a | Generating natural-sounding speech given text input. | |
41+
| [Text-to-Speech](https://huggingface.co/tasks/text-to-speech) | `text-to-speech` or `text-to-audio` | | Generating natural-sounding speech given text input. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TextToAudioPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=text-to-audio&library=transformers.js) |
4242

4343

4444
#### Tabular
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
module.exports = {
2+
root: true,
3+
env: { browser: true, es2020: true },
4+
extends: [
5+
'eslint:recommended',
6+
'plugin:react/recommended',
7+
'plugin:react/jsx-runtime',
8+
'plugin:react-hooks/recommended',
9+
],
10+
ignorePatterns: ['dist', '.eslintrc.cjs'],
11+
parserOptions: { ecmaVersion: 'latest', sourceType: 'module' },
12+
settings: { react: { version: '18.2' } },
13+
plugins: ['react-refresh'],
14+
rules: {
15+
'react-refresh/only-export-components': [
16+
'warn',
17+
{ allowConstantExport: true },
18+
],
19+
},
20+
}
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Logs
2+
logs
3+
*.log
4+
npm-debug.log*
5+
yarn-debug.log*
6+
yarn-error.log*
7+
pnpm-debug.log*
8+
lerna-debug.log*
9+
10+
node_modules
11+
dist
12+
dist-ssr
13+
*.local
14+
15+
# Editor directories and files
16+
.vscode/*
17+
!.vscode/extensions.json
18+
.idea
19+
.DS_Store
20+
*.suo
21+
*.ntvs*
22+
*.njsproj
23+
*.sln
24+
*.sw?
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# React + Vite
2+
3+
This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
4+
5+
Currently, two official plugins are available:
6+
7+
- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react/README.md) uses [Babel](https://babeljs.io/) for Fast Refresh
8+
- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="UTF-8" />
5+
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
6+
<title>Transformers.js - Text-to-speech demo</title>
7+
</head>
8+
<body>
9+
<div id="root"></div>
10+
<script type="module" src="/src/main.jsx"></script>
11+
</body>
12+
</html>
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"name": "text-to-speech-client",
3+
"private": true,
4+
"version": "0.0.0",
5+
"type": "module",
6+
"scripts": {
7+
"dev": "vite",
8+
"build": "vite build",
9+
"lint": "eslint . --ext js,jsx --report-unused-disable-directives --max-warnings 0",
10+
"preview": "vite preview"
11+
},
12+
"dependencies": {
13+
"@xenova/transformers": "^2.7.0",
14+
"react": "^18.2.0",
15+
"react-dom": "^18.2.0"
16+
},
17+
"devDependencies": {
18+
"@types/react": "^18.2.15",
19+
"@types/react-dom": "^18.2.7",
20+
"@vitejs/plugin-react": "^4.0.3",
21+
"autoprefixer": "^10.4.16",
22+
"eslint": "^8.45.0",
23+
"eslint-plugin-react": "^7.32.2",
24+
"eslint-plugin-react-hooks": "^4.6.0",
25+
"eslint-plugin-react-refresh": "^0.4.3",
26+
"postcss": "^8.4.31",
27+
"tailwindcss": "^3.3.3",
28+
"vite": "^4.4.5"
29+
}
30+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
export default {
2+
plugins: {
3+
tailwindcss: {},
4+
autoprefixer: {},
5+
},
6+
}
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
import React, { useState, useEffect, useRef } from 'react';
2+
3+
import AudioPlayer from './components/AudioPlayer';
4+
import Progress from './components/Progress';
5+
import { SPEAKERS, DEFAULT_SPEAKER } from './constants';
6+
7+
const App = () => {
8+
9+
// Model loading
10+
const [ready, setReady] = useState(null);
11+
const [disabled, setDisabled] = useState(false);
12+
const [progressItems, setProgressItems] = useState([]);
13+
14+
// Inputs and outputs
15+
const [text, setText] = useState('I love Hugging Face!');
16+
const [selectedSpeaker, setSelectedSpeaker] = useState(DEFAULT_SPEAKER);
17+
const [output, setOutput] = useState(null);
18+
19+
// Create a reference to the worker object.
20+
const worker = useRef(null);
21+
22+
// We use the `useEffect` hook to setup the worker as soon as the `App` component is mounted.
23+
useEffect(() => {
24+
if (!worker.current) {
25+
// Create the worker if it does not yet exist.
26+
worker.current = new Worker(new URL('./worker.js', import.meta.url), {
27+
type: 'module'
28+
});
29+
}
30+
31+
// Create a callback function for messages from the worker thread.
32+
const onMessageReceived = (e) => {
33+
switch (e.data.status) {
34+
case 'initiate':
35+
// Model file start load: add a new progress item to the list.
36+
setReady(false);
37+
setProgressItems(prev => [...prev, e.data]);
38+
break;
39+
40+
case 'progress':
41+
// Model file progress: update one of the progress items.
42+
setProgressItems(
43+
prev => prev.map(item => {
44+
if (item.file === e.data.file) {
45+
return { ...item, progress: e.data.progress }
46+
}
47+
return item;
48+
})
49+
);
50+
break;
51+
52+
case 'done':
53+
// Model file loaded: remove the progress item from the list.
54+
setProgressItems(
55+
prev => prev.filter(item => item.file !== e.data.file)
56+
);
57+
break;
58+
59+
case 'ready':
60+
// Pipeline ready: the worker is ready to accept messages.
61+
setReady(true);
62+
break;
63+
64+
case 'complete':
65+
// Generation complete: re-enable the "Translate" button
66+
setDisabled(false);
67+
68+
const blobUrl = URL.createObjectURL(e.data.output);
69+
setOutput(blobUrl);
70+
break;
71+
}
72+
};
73+
74+
// Attach the callback function as an event listener.
75+
worker.current.addEventListener('message', onMessageReceived);
76+
77+
// Define a cleanup function for when the component is unmounted.
78+
return () => worker.current.removeEventListener('message', onMessageReceived);
79+
});
80+
81+
82+
const handleGenerateSpeech = () => {
83+
setDisabled(true);
84+
worker.current.postMessage({
85+
text,
86+
speaker_id: selectedSpeaker,
87+
});
88+
};
89+
90+
const isLoading = ready === false;
91+
return (
92+
<div className='min-h-screen flex items-center justify-center bg-gray-100'>
93+
<div className='absolute gap-1 z-50 top-0 left-0 w-full h-full transition-all px-8 flex flex-col justify-center text-center' style={{
94+
opacity: isLoading ? 1 : 0,
95+
pointerEvents: isLoading ? 'all' : 'none',
96+
background: 'rgba(0, 0, 0, 0.9)',
97+
backdropFilter: 'blur(8px)',
98+
}}>
99+
{isLoading && (
100+
<label className='text-white text-xl p-3'>Loading models... (only run once)</label>
101+
)}
102+
{progressItems.map(data => (
103+
<div key={`${data.name}/${data.file}`}>
104+
<Progress text={`${data.name}/${data.file}`} percentage={data.progress} />
105+
</div>
106+
))}
107+
</div>
108+
<div className='bg-white p-8 rounded-lg shadow-lg w-full max-w-xl m-2'>
109+
<h1 className='text-3xl font-semibold text-gray-800 mb-1 text-center'>In-browser Text to Speech</h1>
110+
<h2 className='text-base font-medium text-gray-700 mb-2 text-center'>Made with <a href='https://huggingface.co/docs/transformers.js'>🤗 Transformers.js</a></h2>
111+
<div className='mb-4'>
112+
<label htmlFor='text' className='block text-sm font-medium text-gray-600'>
113+
Text
114+
</label>
115+
<textarea
116+
id='text'
117+
className='border border-gray-300 rounded-md p-2 w-full'
118+
rows='4'
119+
placeholder='Enter text here'
120+
value={text}
121+
onChange={(e) => setText(e.target.value)}
122+
></textarea>
123+
</div>
124+
<div className='mb-4'>
125+
<label htmlFor='speaker' className='block text-sm font-medium text-gray-600'>
126+
Speaker
127+
</label>
128+
<select
129+
id='speaker'
130+
className='border border-gray-300 rounded-md p-2 w-full'
131+
value={selectedSpeaker}
132+
onChange={(e) => setSelectedSpeaker(e.target.value)}
133+
>
134+
{Object.entries(SPEAKERS).map(([key, value]) => (
135+
<option key={key} value={value}>
136+
{key}
137+
</option>
138+
))}
139+
</select>
140+
</div>
141+
<div className='flex justify-center'>
142+
<button
143+
className={`${disabled
144+
? 'bg-gray-400 cursor-not-allowed'
145+
: 'bg-blue-500 hover:bg-blue-600'
146+
} text-white rounded-md py-2 px-4`}
147+
onClick={handleGenerateSpeech}
148+
disabled={disabled}
149+
>
150+
{disabled ? 'Generating...' : 'Generate'}
151+
</button>
152+
</div>
153+
{output && <AudioPlayer
154+
audioUrl={output}
155+
mimeType={'audio/wav'}
156+
/>}
157+
</div>
158+
</div>
159+
);
160+
};
161+
162+
export default App;

0 commit comments

Comments
 (0)