PaddlePaddle
diff --git a/‎.mergify.yml‎
Lines changed: 30 additions & 6 deletions b/‎.mergify.yml‎
Lines changed: 30 additions & 6 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎demos/speech_recognition/README.md‎
Lines changed: 7 additions & 0 deletions b/‎demos/speech_recognition/README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎demos/speech_recognition/README_cn.md‎
Lines changed: 8 additions & 1 deletion b/‎demos/speech_recognition/README_cn.md‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎docs/source/asr/augmentation.md‎
Lines changed: 0 additions & 40 deletions b/‎docs/source/asr/augmentation.md‎
Lines changed: 0 additions & 40 deletions
diff --git a/‎docs/source/index.rst‎
Lines changed: 0 additions & 1 deletion b/‎docs/source/index.rst‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/source/released_model.md‎
Lines changed: 8 additions & 9 deletions b/‎docs/source/released_model.md‎
Lines changed: 8 additions & 9 deletions
diff --git a/‎docs/source/tts/tts_papers.md‎
Lines changed: 42 additions & 0 deletions b/‎docs/source/tts/tts_papers.md‎
Lines changed: 42 additions & 0 deletions
@@ -32,6 +32,12 @@ pull_request_rules:
     actions:
       label:
         remove: ["conflicts"]
+  - name: "auto add label=Dataset"
+    conditions:
+      - files~=^dataset/
+    actions:
+      label:
+        add: ["Dataset"]
   - name: "auto add label=S2T"
     conditions:
       - files~=^paddlespeech/s2t/
@@ -50,18 +56,30 @@ pull_request_rules:
     actions:
       label:
         add: ["Audio"]
-  - name: "auto add label=TextProcess"
+  - name: "auto add label=Vector"
+    conditions:
+      - files~=^paddlespeech/vector/
+    actions:
+      label:
+        add: ["Vector"]
+  - name: "auto add label=Text"
     conditions:
       - files~=^paddlespeech/text/
     actions:
       label:
-        add: ["TextProcess"]
+        add: ["Text"]
   - name: "auto add label=Example"
     conditions:
       - files~=^examples/
     actions:
       label:
         add: ["Example"]
+  - name: "auto add label=CLI"
+    conditions:
+      - files~=^paddlespeech/cli
+    actions:
+      label:
+        add: ["CLI"]
   - name: "auto add label=Demo"
     conditions:
       - files~=^demos/
@@ -70,13 +88,13 @@ pull_request_rules:
         add: ["Demo"]
   - name: "auto add label=README"
     conditions:
-      - files~=README.md
+      - files~=(README.md|READEME_cn.md)
     actions:
       label:
         add: ["README"]
   - name: "auto add label=Documentation"
     conditions:
-      - files~=^docs/
+      - files~=^(docs/|CHANGELOG.md|paddleaudio/CHANGELOG.md)
     actions:
       label:
         add: ["Documentation"]
@@ -88,10 +106,16 @@ pull_request_rules:
         add: ["CI"]
   - name: "auto add label=Installation"
     conditions:
-      - files~=^(tools/|setup.py|setup.sh)
+      - files~=^(tools/|setup.py|setup.cfg|setup_audio.py)
     actions:
       label:
         add: ["Installation"]
+  - name: "auto add label=Test"
+    conditions:
+      - files~=^(tests/)
+    actions:
+      label:
+        add: ["Test"]
   - name: "auto add label=mergify"
     conditions:
       - files~=^.mergify.yml
@@ -106,7 +130,7 @@ pull_request_rules:
         add: ["Docker"]
   - name: "auto add label=Deployment"
     conditions:
-      - files~=^speechnn/
+      - files~=^speechx/
     actions:
       label:
         add: ["Deployment"]
@@ -1,2 +1,11 @@
 # Changelog
 
+
+Date: 2022-1-10, Author: Jackwaterveg.  
+Add features to: CLI:  
+  - Support English (librispeech/asr1/transformer).  
+  - Support choosing `decode_method` for conformer and transformer models.  
+  - Refactor the config, using the unified config.  
+  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1297
+
+***
@@ -23,7 +23,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 ### 3. Usage
 - Command Line(Recommended)
   ```bash
+  # Chinese
   paddlespeech asr --input ./zh.wav
+  # English
+  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
   ```
   (It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
 
@@ -43,7 +46,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 
   Output:
   ```bash
+  # Chinese
   [2021-12-08 13:12:34,063] [    INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康
+  # English
+  [2022-01-12 11:51:10,815] [    INFO] - ASR Result: i knocked at the door on the ancient side of the building
   ```
 
 - Python API
@@ -77,3 +83,4 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
 | Model | Language | Sample Rate
 | :--- | :---: | :---: |
 | conformer_wenetspeech| zh| 16000
+| transformer_librispeech| en| 16000
@@ -2,7 +2,7 @@
 
 # 语音识别
 ## 介绍
-语音识别解决让计算机程序自动转录语音的问题。
+语音识别是一项用计算机程序自动转录语音的技术。
 
 这个 demo 是一个从给定音频文件识别文本的实现，它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。
 ## 使用方法
@@ -21,7 +21,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 ### 3. 使用方法
 - 命令行 (推荐使用)
   ```bash
+  # 中文
   paddlespeech asr --input ./zh.wav
+  # 英文
+  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
   ```
   (如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error，没有关系，这个包是非必须的。)
 
@@ -41,7 +44,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 
   输出：
   ```bash
+  # 中文
   [2021-12-08 13:12:34,063] [    INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康
+  # 英文
+  [2022-01-12 11:51:10,815] [    INFO] - ASR Result: i knocked at the door on the ancient side of the building
   ```
 
 - Python API
@@ -74,3 +80,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 | 模型 | 语言 | 采样率
 | :--- | :---: | :---: |
 | conformer_wenetspeech| zh| 16000
+| transformer_librispeech| en| 16000
@@ -27,7 +27,6 @@ Contents
 
    asr/models_introduction
    asr/data_preparation
-   asr/augmentation
    asr/feature_list
    asr/ngram_lm
 
 
@@ -5,14 +5,13 @@
 ### Speech Recognition Model
 Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link 
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/aishell_ds2_online_cer8.00_release.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
-[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
-[Conformer Online Aishell ASR1 Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz) | Aishell Dataset | Char-based | 283 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0594 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
-[Conformer Offline Aishell ASR1 Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz) | Aishell Dataset | Char-based | 284 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0547 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
-[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer  Aishell ASR1](../../examples/aishell/asr1) 
-[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../example/librispeech/asr1) 
-[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/transformer.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB  | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../example/librispeech/asr1) 
-[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB  | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../example/librispeech/asr2) 
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
+[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
+[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 284 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.056 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
+[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer  Aishell ASR1](../../examples/aishell/asr1) 
+[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../example/librispeech/asr1) 
+[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB  | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../example/librispeech/asr1) 
+[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB  | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../example/librispeech/asr2) 
 
 ### Language Model based on NGram
 Language Model | Training Data | Token-based | Size | Descriptions
@@ -25,7 +24,7 @@ Language Model | Training Data | Token-based | Size | Descriptions
 
 | Model | Training Data | Token-based | Size | Descriptions | BLEU | Example Link |
 | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |
-| [Transformer FAT-ST MTL En-Zh](https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/fat_st_ted-en-zh.tar.gz) | Ted-En-Zh| Spm| | Encoder:Transformer, Decoder:Transformer, <br />Decoding method: Attention | 20.80 | [Transformer Ted-En-Zh ST1](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/ted_en_zh/st1) |
+| (only for CLI)[Transformer FAT-ST MTL En-Zh](https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz) | Ted-En-Zh| Spm| | Encoder:Transformer, Decoder:Transformer, <br />Decoding method: Attention | 20.80 | [Transformer Ted-En-Zh ST1](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/ted_en_zh/st1) |
 
 ## Text-to-Speech Models
 
 
@@ -0,0 +1,42 @@
+# TTS Papers
+## Text Frontend
+### Polyphone
+- [【g2pM】g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset](https://arxiv.org/abs/2004.03136)
+- [Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT](https://www1.se.cuhk.edu.hk/~hccl/publications/pub/201909_INTERSPEECH_DongyangDAI.pdf)
+### Text Normalization
+#### English
+- [applenob/text_normalization](https://github.com/applenob/text_normalization)
+### G2P
+#### English
+- [cmusphinx/g2p-seq2seq](https://github.com/cmusphinx/g2p-seq2seq)
+
+## Acoustic Models
+- [【AdaSpeech3】AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style](https://arxiv.org/abs/2107.02530)
+- [【AdaSpeech2】AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data](https://arxiv.org/abs/2104.09715)
+- [【AdaSpeech】AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/abs/2103.00993)
+- [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
+- [【FastPitch】FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)
+- [【SpeedySpeech】SpeedySpeech: Efficient Neural Speech Synthesis](https://arxiv.org/abs/2008.03802)
+- [【FastSpeech】FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
+- [【Transformer TTS】Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
+- [【Tacotron2】Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
+
+## Vocoders
+- [【RefineGAN】RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses](https://arxiv.org/abs/2111.00962)
+- [【Fre-GAN】Fre-GAN: Adversarial Frequency-consistent Audio Synthesis](https://arxiv.org/abs/2106.02297)
+- [【StyleMelGAN】StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization](https://arxiv.org/abs/2011.01557)
+- [【Multi-band MelGAN】Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106)
+- [【HiFi-GAN】HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
+- [【VocGAN】VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network](https://arxiv.org/abs/2007.15256)
+- [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
+- [【MelGAN】MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis](https://arxiv.org/abs/1910.06711)
+- [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
+- [【LPCNet】LPCNet: Improving Neural Speech Synthesis Through Linear Prediction](https://arxiv.org/abs/1810.11846)
+- [【WaveRNN】Efficient Neural Audio Synthesis](https://arxiv.org/abs/1802.08435)
+## GAN TTS
+
+- [【GAN TTS】High Fidelity Speech Synthesis with Adversarial Networks](https://arxiv.org/abs/1909.11646)
+
+## Voice Cloning
+- [【SV2TTS】Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis](https://arxiv.org/abs/1806.04558)
+- [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)