Skip to content

Commit cadefa1

Browse files
committed
1025
1 parent 6e8d922 commit cadefa1

File tree

5 files changed

+31
-29
lines changed

5 files changed

+31
-29
lines changed

index.html

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ <h1 id="">
3535
DECOUPLING</center>
3636
</h1>
3737

38-
<center> Xinfa Zhu, Yi Lei, kun Song, Yongmao Zhang, Tao Li, Lei Xie </center>
38+
<center> Xinfa Zhu, Yi Lei, Kun Song, Yongmao Zhang, Tao Li, Lei Xie </center>
3939
<center> Northwestern Polytechnical University </center>
4040
<!-- <center> Tencent AI Lab </center> -->
4141

@@ -45,32 +45,35 @@ <h2>0. Contents</h2>
4545
<!-- <li><a href="#transfer">Examples of information perturbation</a></li> -->
4646
<li><a href="#prediction">Demos -- expressive speech with a specific style and emotion for target
4747
speakers</a></li>
48-
<!-- <li><a href="#control">Demos -- expressive speech with a specific style and emotion for unseen target speakers</a> -->
48+
<li><a href="#control">Demos -- long expressive speech with a specific style and various emotions for target
49+
speakers</a>
4950
</li>
5051

5152
</ol>
5253

5354
<br><br>
5455
<h2 id="abstract">1. Abstract<a name="abstract"></a></h2>
55-
<p> Expressive speech synthesis, aiming at generating stylistic and emotional speech for target speakers, is widely
56-
applied in human-computer interactions. Since, in many scenarios, only expressive data of some specific speakers
57-
are available for learning expressiveness, cross-speaker style, or emotion transfer are critical to the
58-
multi-speaker expressive speech synthesis. This paper proposes a novel framework for multi-speaker expressive
59-
speech synthesis via decoupling multiple factors (style, emotion, and speaker timbre). Specifically, we leverage a
60-
two-stage system using speaker-independent bottleneck (BN) features as the intermediate representations. The first
61-
stage tries to produce the speaker-independent BN features in the desired style and emotion. The second stage is
62-
to generate the stylistic and emotional waveform in the target speaker's timbre.Experimental results show that the
63-
proposed system outperforms the compared methods, which indicates the proposed
64-
system can decouple the multiple factors and flexibly recompose them for generating expressive speech of multiple
65-
speakers.
56+
<p> This paper aims to synthesize target speaker's speech with desired speaking style and emotion by transferring
57+
the style and emotion from reference speech recorded by other speakers. Specifically, we address this challenging
58+
problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a
59+
style-and-emotion-to-wave (SE2Wave) module, bridging by neural
60+
bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion)
61+
decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to
62+
respectively discretize the extracted
63+
embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we
64+
introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labelled
65+
data, style-labelled data, and unlabeled data. To better transfer the fine-grained expressiveness from references
66+
to the target speaker in the non-parallel transfer, we introduce a reference-candidate pool and propose an
67+
attention based reference selection approach.
68+
Extensive experiments demonstrate the good design of our model.
6669
</p>
6770
<center><img src='fig/architecture.png'></center>
6871
<br><br>
6972

7073

7174
<h2>2. Demos -- expressive speech with a specific style and emotion for target speakers<a name="prediction"></a>
7275
</h2>
73-
<h3>Convert the emotion and style expresssions from different source speakers to the neutral target speakers without
76+
<h3>Convert the emotion and style expresssions from different source speakers to the target speakers without
7477
emotional and stylistic training data.</h3>
7578

7679
<p><b>Target speaker: M1 </b></p>
@@ -681,12 +684,11 @@ <h3>Convert the emotion and style expresssions from different source speakers to
681684
<!-- <p><b>Short summary:</b> The results indicate the effectiveness of our proposed method can successfully transfer the
682685
source emotion to the target speaker while maintaining the target speaker's timbre.</p> -->
683686

684-
<!-- <br><br>
685-
<h2>3. Demos -- expressive speech with a specific style and emotion for low-quality speech in few-shot speaker
686-
adaptation<a name="control"></a></h2>
687-
<h3>Apply the proposed method to the low-quality speech in few-shot speaker adaptation. We synthesize long segments
688-
of speech to feel style attributes and emotional changes.</h3> -->
689-
<!-- <table>
687+
<br><br>
688+
<h2>3. Demos -- long expressive speech with a specific style and various emotions for target
689+
speakers<a name="control"></a></h2>
690+
<h3>We synthesize long segments of speech to help feel style attributes and emotional variation.</h3>
691+
<table>
690692
<thead>
691693
<tr>
692694
<th style="text-align: center"><strong>Style</strong></th>
@@ -717,8 +719,8 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
717719
with surprise and delight, Oh, it's a cabin.)<br />
718720
(surprise)进入小木屋后,里面竟然整齐排列着七张小小的床。(English: After entering the cabin, there were seven small beds neatly
719721
arranged in it.)</td>
720-
<td style="text-align: left"><audio src="subjective/few-shot/male/33.wav" controls="" preload=""></audio></td>
721-
<td style="text-align: left"><audio src="subjective/few-shot/male/style+emotion2.wav" controls=""
722+
<td style="text-align: left"><audio src="subjective/ref_speaker/90_03_16.wav" controls="" preload=""></audio></td>
723+
<td style="text-align: left"><audio src="subjective/paragraphs/91.wav" controls=""
722724
preload=""></audio></td>
723725
</tr>
724726
<tr>
@@ -742,9 +744,9 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
742744
with surprise and delight, Oh, it's a cabin.)<br />
743745
(surprise)进入小木屋后,里面竟然整齐排列着七张小小的床。(English: After entering the cabin, there were seven small beds neatly
744746
arranged in it.)</td>
745-
<td style="text-align: left"><audio src="subjective/few-shot/female/46.wav" controls="" preload=""></audio>
747+
<td style="text-align: left"><audio src="subjective/ref_speaker/270_03_16.wav" controls="" preload=""></audio>
746748
</td>
747-
<td style="text-align: left"><audio src="subjective/few-shot/female/style+emotion2.wav" controls=""
749+
<td style="text-align: left"><audio src="subjective/paragraphs/271.wav" controls=""
748750
preload=""></audio></td>
749751
</tr>
750752
<tr>
@@ -773,8 +775,8 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
773775
voice, "this. What on earth is going on? why are the officers and soldiers here?")<br />
774776
(sad)那少妇摇了摇头,道:"昨日前线传来消息,说这次御驾亲征已然惨败。"(English: The young woman shook her head and said, "there was news
775777
from the front yesterday that the expedition had failed miserably.")</td>
776-
<td style="text-align: left"><audio src="subjective/few-shot/male/33.wav" controls="" preload=""></audio></td>
777-
<td style="text-align: left"><audio src="subjective/few-shot/male/style+emotion1.wav" controls=""
778+
<td style="text-align: left"><audio src="subjective/ref_speaker/90_03_16.wav" controls="" preload=""></audio></td>
779+
<td style="text-align: left"><audio src="subjective/paragraphs/92.wav" controls=""
778780
preload=""></audio></td>
779781
</tr>
780782
<tr>
@@ -803,15 +805,15 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
803805
voice, "this. What on earth is going on? why are the officers and soldiers here?")<br />
804806
(sad)那少妇摇了摇头,道:"昨日前线传来消息,说这次御驾亲征已然惨败。"(English: The young woman shook her head and said, "there was news
805807
from the front yesterday that the expedition had failed miserably.")</td>
806-
<td style="text-align: left"><audio src="subjective/few-shot/female/46.wav" controls="" preload=""></audio>
808+
<td style="text-align: left"><audio src="subjective/ref_speaker/270_03_16.wav" controls="" preload=""></audio>
807809
</td>
808-
<td style="text-align: left"><audio src="subjective/few-shot/female/style+emotion1.wav" controls=""
810+
<td style="text-align: left"><audio src="subjective/paragraphs/272.wav" controls=""
809811
preload=""></audio></td>
810812
</tr>
811813

812814

813815
</tbody>
814-
</table> -->
816+
</table>
815817

816818
<!-- <p><b>Short summary:</b> Given different speaker and emotion embedding during inference, the Speaker-mel gererator
817819
could provide emotionless speech with specific timbre, while the output of the Emotion-mel gererator contains the

subjective/paragraphs/271.wav

2.59 MB
Binary file not shown.

subjective/paragraphs/272.wav

3.96 MB
Binary file not shown.

subjective/paragraphs/91.wav

2.59 MB
Binary file not shown.

subjective/paragraphs/92.wav

3.96 MB
Binary file not shown.

0 commit comments

Comments
 (0)