Skip to content

Commit 17a7081

Browse files
authored
Update tacotron2_pipeline_tutorial.py (#3759)
* Update tacotron2_pipeline_tutorial.py - Fixed typo - Clarified what was being done in different sections
1 parent 1bc1479 commit 17a7081

File tree

1 file changed

+21
-25
lines changed

1 file changed

+21
-25
lines changed

examples/tutorials/tacotron2_pipeline_tutorial.py

Lines changed: 21 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,13 @@
2323
#
2424
# 2. Spectrogram generation
2525
#
26-
# From the encoded text, a spectrogram is generated. We use ``Tacotron2``
26+
# From the encoded text, a spectrogram is generated. We use the ``Tacotron2``
2727
# model for this.
2828
#
2929
# 3. Time-domain conversion
3030
#
3131
# The last step is converting the spectrogram into the waveform. The
32-
# process to generate speech from spectrogram is also called Vocoder.
32+
# process to generate speech from spectrogram is also called a Vocoder.
3333
# In this tutorial, three different vocoders are used,
3434
# :py:class:`~torchaudio.models.WaveRNN`,
3535
# :py:class:`~torchaudio.transforms.GriffinLim`, and
@@ -90,17 +90,13 @@
9090
# works.
9191
#
9292
# Since the pre-trained Tacotron2 model expects specific set of symbol
93-
# tables, the same functionalities available in ``torchaudio``. This
94-
# section is more for the explanation of the basis of encoding.
93+
# tables, the same functionalities is available in ``torchaudio``. However,
94+
# we will first manually implement the encoding to aid in understanding.
9595
#
96-
# Firstly, we define the set of symbols. For example, we can use
96+
# First, we define the set of symbols
9797
# ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the
9898
# each character of the input text into the index of the corresponding
99-
# symbol in the table.
100-
#
101-
# The following is an example of such processing. In the example, symbols
102-
# that are not in the table are ignored.
103-
#
99+
# symbol in the table. Symbols that are not in the table are ignored.
104100

105101
symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
106102
look_up = {s: i for i, s in enumerate(symbols)}
@@ -118,8 +114,8 @@ def text_to_sequence(text):
118114

119115
######################################################################
120116
# As mentioned in the above, the symbol table and indices must match
121-
# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the
122-
# transform along with the pretrained model. For example, you can
117+
# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the same
118+
# transform along with the pretrained model. You can
123119
# instantiate and use such transform as follow.
124120
#
125121

@@ -133,12 +129,12 @@ def text_to_sequence(text):
133129

134130

135131
######################################################################
136-
# The ``processor`` object takes either a text or list of texts as inputs.
132+
# Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.
137133
# When a list of texts are provided, the returned ``lengths`` variable
138134
# represents the valid length of each processed tokens in the output
139135
# batch.
140136
#
141-
# The intermediate representation can be retrieved as follow.
137+
# The intermediate representation can be retrieved as follows:
142138
#
143139

144140
print([processor.tokens[i] for i in processed[0, : lengths[0]]])
@@ -152,7 +148,7 @@ def text_to_sequence(text):
152148
# uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
153149
# model.
154150
#
155-
# The detail of the G2P model is out of scope of this tutorial, we will
151+
# The detail of the G2P model is out of the scope of this tutorial, we will
156152
# just look at what the conversion looks like.
157153
#
158154
# Similar to the case of character-based encoding, the encoding process is
@@ -195,7 +191,7 @@ def text_to_sequence(text):
195191
# encoded text. For the detail of the model, please refer to `the
196192
# paper <https://arxiv.org/abs/1712.05884>`__.
197193
#
198-
# It is easy to instantiate a Tacotron2 model with pretrained weight,
194+
# It is easy to instantiate a Tacotron2 model with pretrained weights,
199195
# however, note that the input to Tacotron2 models need to be processed
200196
# by the matching text processor.
201197
#
@@ -224,7 +220,7 @@ def text_to_sequence(text):
224220

225221
######################################################################
226222
# Note that ``Tacotron2.infer`` method perfoms multinomial sampling,
227-
# therefor, the process of generating the spectrogram incurs randomness.
223+
# therefore, the process of generating the spectrogram incurs randomness.
228224
#
229225

230226

@@ -245,16 +241,16 @@ def plot():
245241
# -------------------
246242
#
247243
# Once the spectrogram is generated, the last process is to recover the
248-
# waveform from the spectrogram.
244+
# waveform from the spectrogram using a vocoder.
249245
#
250246
# ``torchaudio`` provides vocoders based on ``GriffinLim`` and
251247
# ``WaveRNN``.
252248
#
253249

254250

255251
######################################################################
256-
# WaveRNN
257-
# ~~~~~~~
252+
# WaveRNN Vocoder
253+
# ~~~~~~~~~~~~~~~
258254
#
259255
# Continuing from the previous section, we can instantiate the matching
260256
# WaveRNN model from the same bundle.
@@ -294,11 +290,11 @@ def plot(waveforms, spec, sample_rate):
294290

295291

296292
######################################################################
297-
# Griffin-Lim
298-
# ~~~~~~~~~~~
293+
# Griffin-Lim Vocoder
294+
# ~~~~~~~~~~~~~~~~~~~
299295
#
300296
# Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
301-
# the vocode object with
297+
# the vocoder object with
302298
# :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
303299
# method and pass the spectrogram.
304300
#
@@ -323,8 +319,8 @@ def plot(waveforms, spec, sample_rate):
323319

324320

325321
######################################################################
326-
# Waveglow
327-
# ~~~~~~~~
322+
# Waveglow Vocoder
323+
# ~~~~~~~~~~~~~~~~
328324
#
329325
# Waveglow is a vocoder published by Nvidia. The pretrained weights are
330326
# published on Torch Hub. One can instantiate the model using ``torch.hub``

0 commit comments

Comments
 (0)