Skip to content

Commit 528bb24

Browse files
Kostis-S-Zdaavoo
andauthored
TTS re-write (#72)
* [WIP] Add bark and parler multi support * Add config files for other models to easily test across models * Use model loading wrapper function for download_models.py * Make sure transformers>4.31.0 (required for bark model) * Add parler dependency * Use TTSModelWrapper for demo code * Use TTSModelWrapper for cli * Add outetts_language attribute * Add TTSModelWrapper * Update text_to_speech.py * Pass model-specific variables as **kwargs * Rename TTSModelWrapper to TTSInterface * Update language argument to kwargs * Remove parler from dependencies Co-authored-by: David de la Iglesia Castro <daviddelaiglesiacastro@gmail.com> * Separate inference from TTSModel * Make sure config model is properly registered * Decouple loading & inference of TTS model * Decouple loading & inference of TTS model * Enable user to exit podcast generation gracefully * Add Q2 Oute version to TTS_LOADERS * Add comment for support in TTS_INFERENCE * Update test_model_loaders.py * Update test_text_to_speech.py * Remove extra "use case" examples * Add bark to readme & note about multilingual support * Reference a repo that showcases multilingual use cases * Change default model to 500M * Remove support for bark and parler models * Update docs * Remove unused code * Remove parler dep from tests * Update docs * Lint * Lint * Lint * Remove transformers dependency * Remove parler reference from docs --------- Co-authored-by: David de la Iglesia Castro <daviddelaiglesiacastro@gmail.com>
1 parent 8809feb commit 528bb24

File tree

15 files changed

+152
-184
lines changed

15 files changed

+152
-184
lines changed

.github/workflows/tests.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,6 @@ jobs:
3030
- name: Install test dependencies
3131
run: pip install -e '.[tests]'
3232

33-
- name: Install parler dependency
34-
run: pip install git+https://github.com/huggingface/parler-tts.git
35-
3633
- name: Run Unit Tests
3734
run: pytest -v tests/unit
3835

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ It is designed to work on most local setups or with [GitHub Codespaces](https://
1515
### Built with
1616
- Python 3.10+ (use Python 3.12 for Apple M1/2/3 chips)
1717
- [Llama-cpp](https://github.com/abetlen/llama-cpp-python) (text-to-text, i.e script generation)
18-
- [OuteAI](https://github.com/edwko/OuteTTS) / [Parler_tts](https://github.com/huggingface/parler-tts) (text-to-speech, i.e audio generation)
18+
- [OuteAI](https://github.com/edwko/OuteTTS) (text-to-speech, i.e audio generation)
1919
- [Streamlit](https://streamlit.io/) (UI demo)
2020

2121

@@ -106,10 +106,10 @@ For the complete list of models supported out-of-the-box, visit this [link](http
106106
107107
### text-to-speech
108108
109-
We support models from the [OuteAI](https://github.com/edwko/OuteTTS) and [Parler_tts](https://github.com/huggingface/parler-tts) packages. The default text-to-speech model in this repo is [OuteTTS-0.2-500M](https://huggingface.co/OuteAI/OuteTTS-0.2-500M). Note that the `0.1-350M` version has a `CC-By-4.0` (permissive) license, whereas the newer / better `0.2-500M` version has a `CC-By-NC-4.0` (non-commercial) license.
110-
For a complete list of models visit [Oute HF](https://huggingface.co/collections/OuteAI) (only the GGUF versions) and [Parler HF](https://huggingface.co/collections/parler-tts).
109+
We support models from the [OuteAI](https://github.com/edwko/OuteTTS) package. The default text-to-speech model in this repo is [OuteTTS-0.2-500M](https://huggingface.co/OuteAI/OuteTTS-0.2-500M). Note that the `0.1-350M` version has a `CC-By-4.0` (permissive) license, whereas the newer / better `0.2-500M` version has a `CC-By-NC-4.0` (non-commercial) license.
110+
For a complete list of models visit [Oute HF](https://huggingface.co/collections/OuteAI) (only the GGUF versions).
111111
112-
**Important note:** In order to keep the package dependencies as lightweight as possible, only the Oute interface is installed by default. If you want to use the parler models, please also follow the instructions at https://github.com/huggingface/parler-tts.
112+
In this [repo](https://github.com/Kostis-S-Z/document-to-podcast) you can see examples of using different TTS models with minimal code changes.
113113
114114
## Pre-requisites
115115

demo/app.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@
66
import soundfile as sf
77
import streamlit as st
88

9+
from document_to_podcast.inference.text_to_speech import text_to_speech
910
from document_to_podcast.preprocessing import DATA_LOADERS, DATA_CLEANERS
1011
from document_to_podcast.inference.model_loaders import (
1112
load_llama_cpp_model,
12-
load_outetts_model,
13+
load_tts_model,
1314
)
1415
from document_to_podcast.config import DEFAULT_PROMPT, DEFAULT_SPEAKERS, Speaker
15-
from document_to_podcast.inference.text_to_speech import text_to_speech
1616
from document_to_podcast.inference.text_to_text import text_to_text_stream
1717
from document_to_podcast.utils import stack_audio_segments
1818

@@ -26,7 +26,7 @@ def load_text_to_text_model():
2626

2727
@st.cache_resource
2828
def load_text_to_speech_model():
29-
return load_outetts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")
29+
return load_tts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")
3030

3131

3232
script = "script"
@@ -167,7 +167,8 @@ def gen_button_clicked():
167167
speech_model,
168168
voice_profile,
169169
)
170-
st.audio(speech, sample_rate=speech_model.audio_codec.sr)
170+
st.audio(speech, sample_rate=speech_model.sample_rate)
171+
171172
st.session_state.audio.append(speech)
172173
text = ""
173174

@@ -179,7 +180,7 @@ def gen_button_clicked():
179180
sf.write(
180181
"podcast.wav",
181182
st.session_state.audio,
182-
samplerate=speech_model.audio_codec.sr,
183+
samplerate=speech_model.sample_rate,
183184
)
184185
st.markdown("Podcast saved to disk!")
185186

demo/download_models.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44

55
from document_to_podcast.inference.model_loaders import (
66
load_llama_cpp_model,
7-
load_outetts_model,
7+
load_tts_model,
88
)
99

1010
load_llama_cpp_model(
1111
"allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf"
1212
)
13-
load_outetts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")
13+
load_tts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")

docs/customization.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ Looking for inspiration? Check out these examples of how others have customized
7474
7575
- **[Radio Drama Generator](https://github.com/stefanfrench/radio-drama-generator)**: A creative adaptation that generates radio dramas by customizing ng the Blueprint parameters.
7676
- **[Readme-to-Podcast](https://github.com/alexmeckes/readme-to-podcast)**: This project transforms GitHub README files into podcast-style audio, showcasing the Blueprint’s ability to handle diverse text inputs.
77+
- **[Multilingual Podcast](https://github.com/Kostis-S-Z/document-to-podcast/)**: A repo that showcases how to use this package in other languages, like Hindi, Polish, Korean and many more.
7778
7879
## 🤝 **Contributing to the Blueprint**
7980

docs/future-features-contributions.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ The Document-to-Podcast Blueprint is an evolving project designed to grow with t
1515
This Blueprint is designed to be a foundation you can build upon. By extending its capabilities, you can open the door to new applications, improve user experience, and adapt the Blueprint to address other use cases. Here are a few ideas for how you can expand its potential:
1616

1717

18-
- **Multi-language podcast generation:** Add support for multi-language podcast generation to expand the reach of this Blueprint.
1918
- **New modalities input:** Add support to the Blueprint to be able to handle different input modalities, like audio or images, enabling more flexibility in podcast generation.
2019
- **Improved audio quality:** Explore and integrate more advanced open-source TTS frameworks to enhance the quality of generated audio, making podcasts sound more natural.
2120

docs/getting-started.md

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,3 @@ pip install -e .
3636
```bash
3737
python -m streamlit run demo/app.py
3838
```
39-
40-
41-
### [Optional]: Use Parler models for text-to-speech
42-
43-
If you want to use the [parler tts](https://github.com/huggingface/parler-tts) models, you will need to **additionally** install an optional dependency by running:
44-
```bash
45-
pip install -e '.[parler]'
46-
```

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ These docs are your companion to mastering the **Document-to-Podcast Blueprint**
1111
### Built with
1212
- Python 3.10+
1313
- [Llama-cpp](https://github.com/abetlen/llama-cpp-python) (text-to-text, i.e script generation)
14-
- [OuteAI](https://github.com/edwko/OuteTTS) / [Parler_tts](https://github.com/huggingface/parler-tts) (text-to-speech, i.e audio generation)
14+
- [OuteAI](https://github.com/edwko/OuteTTS) (text-to-speech, i.e audio generation)
1515
- [Streamlit](https://streamlit.io/) (UI demo)
1616

1717

docs/step-by-step-guide.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -160,17 +160,16 @@ In this final step, the generated podcast transcript is brought to life as an au
160160

161161
**1 - Model Loading**
162162

163-
- The [`model_loader.py`](api.md/#document_to_podcast.inference.model_loaders) module is responsible for loading the `text-to-speech` models using the `outetts` and `parler_tts` libraries.
163+
- The [`model_loader.py`](api.md/#document_to_podcast.inference.model_loaders) module is responsible for loading the `text-to-text` and `text-to-speech` models.
164164

165165
- The function `load_outetts_model` takes a model ID in the format `{org}/{repo}/{filename}` and loads the specified model, either on CPU or GPU, based on the `device` parameter. The parameter `language` also enables to swap between the languages the Oute package supports (as of Dec 2024: `en, zh, ja, ko`)
166166

167-
- The function `load_parler_tts_model_and_tokenizer` takes a model ID in the format `{repo}/{filename}` and loads the specified model and tokenizer, either on CPU or GPU, based on the `device` parameter.
168167

169168
**2 - Text-to-Speech Audio Generation**
170169

171170
- The [`text_to_speech.py`](api.md/#document_to_podcast.inference.text_to_speech) script converts text into audio using a specified TTS model.
172171

173-
- A **speaker profile** defines the voice characteristics (e.g., tone, speed, clarity) for each speaker. This is specific to each TTS package. Oute models require one of the IDs specified [here](https://github.com/edwko/OuteTTS/tree/main/outetts/version/v1/default_speakers). Parler requires natural language description of the speaker's voice and you have to use a pre-defined name (see [here](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md#speaker-consistency))
172+
- A **speaker profile** defines the voice characteristics (e.g., tone, speed, clarity) for each speaker. This is specific to each TTS package. Oute models require one of the IDs specified [here](https://github.com/edwko/OuteTTS/tree/main/outetts/version/v1/default_speakers).
174173

175174
- The function `text_to_speech` takes the input text (e.g. podcast script) and speaker profile, generating a waveform (audio data in a numpy array) that represents the spoken version of the text.
176175

src/document_to_podcast/cli.py

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,14 @@
1111
Speaker,
1212
DEFAULT_PROMPT,
1313
DEFAULT_SPEAKERS,
14-
SUPPORTED_TTS_MODELS,
14+
TTS_LOADERS,
1515
)
1616
from document_to_podcast.inference.model_loaders import (
1717
load_llama_cpp_model,
18-
load_outetts_model,
19-
load_parler_tts_model_and_tokenizer,
18+
load_tts_model,
2019
)
21-
from document_to_podcast.inference.text_to_text import text_to_text_stream
2220
from document_to_podcast.inference.text_to_speech import text_to_speech
21+
from document_to_podcast.inference.text_to_text import text_to_text_stream
2322
from document_to_podcast.preprocessing import DATA_CLEANERS, DATA_LOADERS
2423
from document_to_podcast.utils import stack_audio_segments
2524

@@ -30,8 +29,9 @@ def document_to_podcast(
3029
output_folder: str | None = None,
3130
text_to_text_model: str = "allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf",
3231
text_to_text_prompt: str = DEFAULT_PROMPT,
33-
text_to_speech_model: SUPPORTED_TTS_MODELS = "OuteAI/OuteTTS-0.1-350M-GGUF/OuteTTS-0.1-350M-FP16.gguf",
32+
text_to_speech_model: TTS_LOADERS = "OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf",
3433
speakers: list[Speaker] | None = None,
34+
outetts_language: str = "en", # Only applicable to OuteTTS models
3535
from_config: str | None = None,
3636
):
3737
"""
@@ -70,8 +70,10 @@ def document_to_podcast(
7070
speakers (list[Speaker] | None, optional): The speakers for the podcast.
7171
Defaults to DEFAULT_SPEAKERS.
7272
73-
from_config (str, optional): The path to the config file. Defaults to None.
73+
outetts_language (str): For OuteTTS models we need to specify which language to use.
74+
Supported languages in 0.2-500M: en, zh, ja, ko. More info: https://github.com/edwko/OuteTTS
7475
76+
from_config (str, optional): The path to the config file. Defaults to None.
7577
7678
If provided, all other arguments will be ignored.
7779
"""
@@ -86,6 +88,7 @@ def document_to_podcast(
8688
text_to_text_prompt=text_to_text_prompt,
8789
text_to_speech_model=text_to_speech_model,
8890
speakers=[Speaker.model_validate(speaker) for speaker in speakers],
91+
outetts_language=outetts_language,
8992
)
9093

9194
output_folder = Path(config.output_folder)
@@ -106,15 +109,9 @@ def document_to_podcast(
106109
text_model = load_llama_cpp_model(model_id=config.text_to_text_model)
107110

108111
logger.info(f"Loading {config.text_to_speech_model}")
109-
if "oute" in config.text_to_speech_model.lower():
110-
speech_model = load_outetts_model(model_id=config.text_to_speech_model)
111-
speech_tokenizer = None
112-
sample_rate = speech_model.audio_codec.sr
113-
else:
114-
speech_model, speech_tokenizer = load_parler_tts_model_and_tokenizer(
115-
model_id=config.text_to_speech_model
116-
)
117-
sample_rate = speech_model.config.sampling_rate
112+
speech_model = load_tts_model(
113+
model_id=config.text_to_speech_model, outetts_language=outetts_language
114+
)
118115

119116
# ~4 characters per token is considered a reasonable default.
120117
max_characters = text_model.n_ctx() * 4
@@ -151,22 +148,21 @@ def document_to_podcast(
151148
text.split(f'"Speaker {speaker_id}":')[-1],
152149
speech_model,
153150
voice_profile,
154-
tokenizer=speech_tokenizer, # Applicable only for parler models
155151
)
156152
podcast_audio.append(speech)
157153
text = ""
154+
158155
except KeyboardInterrupt:
159156
logger.warning("Podcast generation stopped by user.")
160-
161157
logger.info("Saving Podcast...")
162158
complete_audio = stack_audio_segments(
163-
podcast_audio, sample_rate=sample_rate, silence_pad=1.0
159+
podcast_audio, sample_rate=speech_model.sample_rate, silence_pad=1.0
164160
)
165161

166162
sf.write(
167163
str(output_folder / "podcast.wav"),
168164
complete_audio,
169-
samplerate=sample_rate,
165+
samplerate=speech_model.sample_rate,
170166
)
171167
(output_folder / "podcast.txt").write_text(podcast_script)
172168
logger.success("Done!")

0 commit comments

Comments
 (0)