2323#
2424# 2. Spectrogram generation
2525#
26- # From the encoded text, a spectrogram is generated. We use ``Tacotron2``
26+ # From the encoded text, a spectrogram is generated. We use the ``Tacotron2``
2727# model for this.
2828#
2929# 3. Time-domain conversion
3030#
3131# The last step is converting the spectrogram into the waveform. The
32- # process to generate speech from spectrogram is also called Vocoder.
32+ # process to generate speech from spectrogram is also called a Vocoder.
3333# In this tutorial, three different vocoders are used,
3434# :py:class:`~torchaudio.models.WaveRNN`,
3535# :py:class:`~torchaudio.transforms.GriffinLim`, and
9090# works.
9191#
9292# Since the pre-trained Tacotron2 model expects specific set of symbol
93- # tables, the same functionalities available in ``torchaudio``. This
94- # section is more for the explanation of the basis of encoding .
93+ # tables, the same functionalities is available in ``torchaudio``. However,
94+ # we will first manually implement the encoding to aid in understanding .
9595#
96- # Firstly , we define the set of symbols. For example, we can use
96+ # First , we define the set of symbols
9797# ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the
9898# each character of the input text into the index of the corresponding
99- # symbol in the table.
100- #
101- # The following is an example of such processing. In the example, symbols
102- # that are not in the table are ignored.
103- #
99+ # symbol in the table. Symbols that are not in the table are ignored.
104100
105101symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
106102look_up = {s : i for i , s in enumerate (symbols )}
@@ -118,8 +114,8 @@ def text_to_sequence(text):
118114
119115######################################################################
120116# As mentioned in the above, the symbol table and indices must match
121- # what the pretrained Tacotron2 model expects. ``torchaudio`` provides the
122- # transform along with the pretrained model. For example, you can
117+ # what the pretrained Tacotron2 model expects. ``torchaudio`` provides the same
118+ # transform along with the pretrained model. You can
123119# instantiate and use such transform as follow.
124120#
125121
@@ -133,12 +129,12 @@ def text_to_sequence(text):
133129
134130
135131######################################################################
136- # The ``processor `` object takes either a text or list of texts as inputs.
132+ # Note: The output of our manual encoding and the ``torchaudio `` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.
137133# When a list of texts are provided, the returned ``lengths`` variable
138134# represents the valid length of each processed tokens in the output
139135# batch.
140136#
141- # The intermediate representation can be retrieved as follow.
137+ # The intermediate representation can be retrieved as follows:
142138#
143139
144140print ([processor .tokens [i ] for i in processed [0 , : lengths [0 ]]])
@@ -152,7 +148,7 @@ def text_to_sequence(text):
152148# uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
153149# model.
154150#
155- # The detail of the G2P model is out of scope of this tutorial, we will
151+ # The detail of the G2P model is out of the scope of this tutorial, we will
156152# just look at what the conversion looks like.
157153#
158154# Similar to the case of character-based encoding, the encoding process is
@@ -195,7 +191,7 @@ def text_to_sequence(text):
195191# encoded text. For the detail of the model, please refer to `the
196192# paper <https://arxiv.org/abs/1712.05884>`__.
197193#
198- # It is easy to instantiate a Tacotron2 model with pretrained weight ,
194+ # It is easy to instantiate a Tacotron2 model with pretrained weights ,
199195# however, note that the input to Tacotron2 models need to be processed
200196# by the matching text processor.
201197#
@@ -224,7 +220,7 @@ def text_to_sequence(text):
224220
225221######################################################################
226222# Note that ``Tacotron2.infer`` method perfoms multinomial sampling,
227- # therefor , the process of generating the spectrogram incurs randomness.
223+ # therefore , the process of generating the spectrogram incurs randomness.
228224#
229225
230226
@@ -245,16 +241,16 @@ def plot():
245241# -------------------
246242#
247243# Once the spectrogram is generated, the last process is to recover the
248- # waveform from the spectrogram.
244+ # waveform from the spectrogram using a vocoder .
249245#
250246# ``torchaudio`` provides vocoders based on ``GriffinLim`` and
251247# ``WaveRNN``.
252248#
253249
254250
255251######################################################################
256- # WaveRNN
257- # ~~~~~~~
252+ # WaveRNN Vocoder
253+ # ~~~~~~~~~~~~~~~
258254#
259255# Continuing from the previous section, we can instantiate the matching
260256# WaveRNN model from the same bundle.
@@ -294,11 +290,11 @@ def plot(waveforms, spec, sample_rate):
294290
295291
296292######################################################################
297- # Griffin-Lim
298- # ~~~~~~~~~~~
293+ # Griffin-Lim Vocoder
294+ # ~~~~~~~~~~~~~~~~~~~
299295#
300296# Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
301- # the vocode object with
297+ # the vocoder object with
302298# :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
303299# method and pass the spectrogram.
304300#
@@ -323,8 +319,8 @@ def plot(waveforms, spec, sample_rate):
323319
324320
325321######################################################################
326- # Waveglow
327- # ~~~~~~~~
322+ # Waveglow Vocoder
323+ # ~~~~~~~~~~~~~~~~
328324#
329325# Waveglow is a vocoder published by Nvidia. The pretrained weights are
330326# published on Torch Hub. One can instantiate the model using ``torch.hub``
0 commit comments