@@ -35,7 +35,7 @@ <h1 id="">
3535 DECOUPLING</ center >
3636 </ h1 >
3737
38- < center > Xinfa Zhu, Yi Lei, kun Song, Yongmao Zhang, Tao Li, Lei Xie </ center >
38+ < center > Xinfa Zhu, Yi Lei, Kun Song, Yongmao Zhang, Tao Li, Lei Xie </ center >
3939 < center > Northwestern Polytechnical University </ center >
4040 <!-- <center> Tencent AI Lab </center> -->
4141
@@ -45,32 +45,35 @@ <h2>0. Contents</h2>
4545 <!-- <li><a href="#transfer">Examples of information perturbation</a></li> -->
4646 < li > < a href ="#prediction "> Demos -- expressive speech with a specific style and emotion for target
4747 speakers</ a > </ li >
48- <!-- <li><a href="#control">Demos -- expressive speech with a specific style and emotion for unseen target speakers</a> -->
48+ < li > < a href ="#control "> Demos -- long expressive speech with a specific style and various emotions for target
49+ speakers</ a >
4950 </ li >
5051
5152 </ ol >
5253
5354 < br > < br >
5455 < h2 id ="abstract "> 1. Abstract< a name ="abstract "> </ a > </ h2 >
55- < p > Expressive speech synthesis, aiming at generating stylistic and emotional speech for target speakers, is widely
56- applied in human-computer interactions. Since, in many scenarios, only expressive data of some specific speakers
57- are available for learning expressiveness, cross-speaker style, or emotion transfer are critical to the
58- multi-speaker expressive speech synthesis. This paper proposes a novel framework for multi-speaker expressive
59- speech synthesis via decoupling multiple factors (style, emotion, and speaker timbre). Specifically, we leverage a
60- two-stage system using speaker-independent bottleneck (BN) features as the intermediate representations. The first
61- stage tries to produce the speaker-independent BN features in the desired style and emotion. The second stage is
62- to generate the stylistic and emotional waveform in the target speaker's timbre.Experimental results show that the
63- proposed system outperforms the compared methods, which indicates the proposed
64- system can decouple the multiple factors and flexibly recompose them for generating expressive speech of multiple
65- speakers.
56+ < p > This paper aims to synthesize target speaker's speech with desired speaking style and emotion by transferring
57+ the style and emotion from reference speech recorded by other speakers. Specifically, we address this challenging
58+ problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a
59+ style-and-emotion-to-wave (SE2Wave) module, bridging by neural
60+ bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion)
61+ decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to
62+ respectively discretize the extracted
63+ embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we
64+ introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labelled
65+ data, style-labelled data, and unlabeled data. To better transfer the fine-grained expressiveness from references
66+ to the target speaker in the non-parallel transfer, we introduce a reference-candidate pool and propose an
67+ attention based reference selection approach.
68+ Extensive experiments demonstrate the good design of our model.
6669 </ p >
6770 < center > < img src ='fig/architecture.png '> </ center >
6871 < br > < br >
6972
7073
7174 < h2 > 2. Demos -- expressive speech with a specific style and emotion for target speakers< a name ="prediction "> </ a >
7275 </ h2 >
73- < h3 > Convert the emotion and style expresssions from different source speakers to the neutral target speakers without
76+ < h3 > Convert the emotion and style expresssions from different source speakers to the target speakers without
7477 emotional and stylistic training data.</ h3 >
7578
7679 < p > < b > Target speaker: M1 </ b > </ p >
@@ -681,12 +684,11 @@ <h3>Convert the emotion and style expresssions from different source speakers to
681684 <!-- <p><b>Short summary:</b> The results indicate the effectiveness of our proposed method can successfully transfer the
682685 source emotion to the target speaker while maintaining the target speaker's timbre.</p> -->
683686
684- <!-- <br><br>
685- <h2>3. Demos -- expressive speech with a specific style and emotion for low-quality speech in few-shot speaker
686- adaptation<a name="control"></a></h2>
687- <h3>Apply the proposed method to the low-quality speech in few-shot speaker adaptation. We synthesize long segments
688- of speech to feel style attributes and emotional changes.</h3> -->
689- <!-- <table>
687+ < br > < br >
688+ < h2 > 3. Demos -- long expressive speech with a specific style and various emotions for target
689+ speakers< a name ="control "> </ a > </ h2 >
690+ < h3 > We synthesize long segments of speech to help feel style attributes and emotional variation.</ h3 >
691+ < table >
690692 < thead >
691693 < tr >
692694 < th style ="text-align: center "> < strong > Style</ strong > </ th >
@@ -717,8 +719,8 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
717719 with surprise and delight, Oh, it's a cabin.)< br />
718720 (surprise)进入小木屋后,里面竟然整齐排列着七张小小的床。(English: After entering the cabin, there were seven small beds neatly
719721 arranged in it.)</ td >
720- <td style="text-align: left"><audio src="subjective/few-shot/male/33 .wav" controls="" preload=""></audio></td>
721- <td style="text-align: left"><audio src="subjective/few-shot/male/style+emotion2 .wav" controls=""
722+ < td style ="text-align: left "> < audio src ="subjective/ref_speaker/90_03_16 .wav " controls ="" preload =""> </ audio > </ td >
723+ < td style ="text-align: left "> < audio src ="subjective/paragraphs/91 .wav " controls =""
722724 preload =""> </ audio > </ td >
723725 </ tr >
724726 < tr >
@@ -742,9 +744,9 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
742744 with surprise and delight, Oh, it's a cabin.)< br />
743745 (surprise)进入小木屋后,里面竟然整齐排列着七张小小的床。(English: After entering the cabin, there were seven small beds neatly
744746 arranged in it.)</ td >
745- <td style="text-align: left"><audio src="subjective/few-shot/female/46 .wav" controls="" preload=""></audio>
747+ < td style ="text-align: left "> < audio src ="subjective/ref_speaker/270_03_16 .wav " controls ="" preload =""> </ audio >
746748 </ td >
747- <td style="text-align: left"><audio src="subjective/few-shot/female/style+emotion2 .wav" controls=""
749+ < td style ="text-align: left "> < audio src ="subjective/paragraphs/271 .wav " controls =""
748750 preload =""> </ audio > </ td >
749751 </ tr >
750752 < tr >
@@ -773,8 +775,8 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
773775 voice, "this. What on earth is going on? why are the officers and soldiers here?")< br />
774776 (sad)那少妇摇了摇头,道:"昨日前线传来消息,说这次御驾亲征已然惨败。"(English: The young woman shook her head and said, "there was news
775777 from the front yesterday that the expedition had failed miserably.")</ td >
776- <td style="text-align: left"><audio src="subjective/few-shot/male/33 .wav" controls="" preload=""></audio></td>
777- <td style="text-align: left"><audio src="subjective/few-shot/male/style+emotion1 .wav" controls=""
778+ < td style ="text-align: left "> < audio src ="subjective/ref_speaker/90_03_16 .wav " controls ="" preload =""> </ audio > </ td >
779+ < td style ="text-align: left "> < audio src ="subjective/paragraphs/92 .wav " controls =""
778780 preload =""> </ audio > </ td >
779781 </ tr >
780782 < tr >
@@ -803,15 +805,15 @@ <h3>Apply the proposed method to the low-quality speech in few-shot speaker adap
803805 voice, "this. What on earth is going on? why are the officers and soldiers here?")< br />
804806 (sad)那少妇摇了摇头,道:"昨日前线传来消息,说这次御驾亲征已然惨败。"(English: The young woman shook her head and said, "there was news
805807 from the front yesterday that the expedition had failed miserably.")</ td >
806- <td style="text-align: left"><audio src="subjective/few-shot/female/46 .wav" controls="" preload=""></audio>
808+ < td style ="text-align: left "> < audio src ="subjective/ref_speaker/270_03_16 .wav " controls ="" preload =""> </ audio >
807809 </ td >
808- <td style="text-align: left"><audio src="subjective/few-shot/female/style+emotion1 .wav" controls=""
810+ < td style ="text-align: left "> < audio src ="subjective/paragraphs/272 .wav " controls =""
809811 preload =""> </ audio > </ td >
810812 </ tr >
811813
812814
813815 </ tbody >
814- </table> -->
816+ </ table >
815817
816818 <!-- <p><b>Short summary:</b> Given different speaker and emotion embedding during inference, the Speaker-mel gererator
817819 could provide emotionless speech with specific timbre, while the output of the Emotion-mel gererator contains the
0 commit comments