Skip to content

[TUTORIAL DEPENDENCY REMOVAL] Use char-based pipeline instead of phoneme one for tacotron tutorial #4028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 12, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 4 additions & 57 deletions examples/tutorials/tacotron2_pipeline_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# 1. Text preprocessing
#
# First, the input text is encoded into a list of symbols. In this
# tutorial, we will use English characters and phonemes as the symbols.
# tutorial, we will use English characters as the symbols.
#
# 2. Spectrogram generation
#
Expand Down Expand Up @@ -47,16 +47,6 @@
# Preparation
# -----------
#
# First, we install the necessary dependencies. In addition to
# ``torchaudio``, ``DeepPhonemizer`` is required to perform phoneme-based
# encoding.
#

# %%
# .. code-block:: bash
#
# %%bash
# pip3 install deep_phonemizer

import torch
import torchaudio
Expand Down Expand Up @@ -140,49 +130,6 @@ def text_to_sequence(text):
print([processor.tokens[i] for i in processed[0, : lengths[0]]])


######################################################################
# Phoneme-based encoding
# ~~~~~~~~~~~~~~~~~~~~~~
#
# Phoneme-based encoding is similar to character-based encoding, but it
# uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
# model.
#
# The detail of the G2P model is out of the scope of this tutorial, we will
# just look at what the conversion looks like.
#
# Similar to the case of character-based encoding, the encoding process is
# expected to match what a pretrained Tacotron2 model is trained on.
# ``torchaudio`` has an interface to create the process.
#
# The following code illustrates how to make and use the process. Behind
# the scene, a G2P model is created using ``DeepPhonemizer`` package, and
# the pretrained weights published by the author of ``DeepPhonemizer`` is
# fetched.
#

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()

text = "Hello world! Text to speech!"
with torch.inference_mode():
processed, lengths = processor(text)

print(processed)
print(lengths)


######################################################################
# Notice that the encoded values are different from the example of
# character-based encoding.
#
# The intermediate representation looks like the following.
#

print([processor.tokens[i] for i in processed[0, : lengths[0]]])


######################################################################
# Spectrogram Generation
# ----------------------
Expand All @@ -202,7 +149,7 @@ def text_to_sequence(text):
# :py:class:`~torchaudio.pipelines.Tacotron2TTSBundle`.
#

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

Expand Down Expand Up @@ -256,7 +203,7 @@ def plot():
# WaveRNN model from the same bundle.
#

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
Expand Down Expand Up @@ -299,7 +246,7 @@ def plot(waveforms, spec, sample_rate):
# method and pass the spectrogram.
#

bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
Expand Down
Loading