Skip to content

Commit 1bf1a87

Browse files
committed
Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSpeech into add_new_tacotron2, test=tts
2 parents 3fd7a77 + f22ac5a commit 1bf1a87

File tree

76 files changed

+1947
-412
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+1947
-412
lines changed

.mergify.yml

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,12 @@ pull_request_rules:
3232
actions:
3333
label:
3434
remove: ["conflicts"]
35+
- name: "auto add label=Dataset"
36+
conditions:
37+
- files~=^dataset/
38+
actions:
39+
label:
40+
add: ["Dataset"]
3541
- name: "auto add label=S2T"
3642
conditions:
3743
- files~=^paddlespeech/s2t/
@@ -50,18 +56,30 @@ pull_request_rules:
5056
actions:
5157
label:
5258
add: ["Audio"]
53-
- name: "auto add label=TextProcess"
59+
- name: "auto add label=Vector"
60+
conditions:
61+
- files~=^paddlespeech/vector/
62+
actions:
63+
label:
64+
add: ["Vector"]
65+
- name: "auto add label=Text"
5466
conditions:
5567
- files~=^paddlespeech/text/
5668
actions:
5769
label:
58-
add: ["TextProcess"]
70+
add: ["Text"]
5971
- name: "auto add label=Example"
6072
conditions:
6173
- files~=^examples/
6274
actions:
6375
label:
6476
add: ["Example"]
77+
- name: "auto add label=CLI"
78+
conditions:
79+
- files~=^paddlespeech/cli
80+
actions:
81+
label:
82+
add: ["CLI"]
6583
- name: "auto add label=Demo"
6684
conditions:
6785
- files~=^demos/
@@ -70,13 +88,13 @@ pull_request_rules:
7088
add: ["Demo"]
7189
- name: "auto add label=README"
7290
conditions:
73-
- files~=README.md
91+
- files~=(README.md|READEME_cn.md)
7492
actions:
7593
label:
7694
add: ["README"]
7795
- name: "auto add label=Documentation"
7896
conditions:
79-
- files~=^docs/
97+
- files~=^(docs/|CHANGELOG.md|paddleaudio/CHANGELOG.md)
8098
actions:
8199
label:
82100
add: ["Documentation"]
@@ -88,10 +106,16 @@ pull_request_rules:
88106
add: ["CI"]
89107
- name: "auto add label=Installation"
90108
conditions:
91-
- files~=^(tools/|setup.py|setup.sh)
109+
- files~=^(tools/|setup.py|setup.cfg|setup_audio.py)
92110
actions:
93111
label:
94112
add: ["Installation"]
113+
- name: "auto add label=Test"
114+
conditions:
115+
- files~=^(tests/)
116+
actions:
117+
label:
118+
add: ["Test"]
95119
- name: "auto add label=mergify"
96120
conditions:
97121
- files~=^.mergify.yml
@@ -106,7 +130,7 @@ pull_request_rules:
106130
add: ["Docker"]
107131
- name: "auto add label=Deployment"
108132
conditions:
109-
- files~=^speechnn/
133+
- files~=^speechx/
110134
actions:
111135
label:
112136
add: ["Deployment"]

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,11 @@
11
# Changelog
22

3+
4+
Date: 2022-1-10, Author: Jackwaterveg.
5+
Add features to: CLI:
6+
- Support English (librispeech/asr1/transformer).
7+
- Support choosing `decode_method` for conformer and transformer models.
8+
- Refactor the config, using the unified config.
9+
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1297
10+
11+
***

demos/speech_recognition/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
2323
### 3. Usage
2424
- Command Line(Recommended)
2525
```bash
26+
# Chinese
2627
paddlespeech asr --input ./zh.wav
28+
# English
29+
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
2730
```
2831
(It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
2932

@@ -43,7 +46,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
4346

4447
Output:
4548
```bash
49+
# Chinese
4650
[2021-12-08 13:12:34,063] [ INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康
51+
# English
52+
[2022-01-12 11:51:10,815] [ INFO] - ASR Result: i knocked at the door on the ancient side of the building
4753
```
4854

4955
- Python API
@@ -77,3 +83,4 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
7783
| Model | Language | Sample Rate
7884
| :--- | :---: | :---: |
7985
| conformer_wenetspeech| zh| 16000
86+
| transformer_librispeech| en| 16000

demos/speech_recognition/README_cn.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
# 语音识别
44
## 介绍
5-
语音识别解决让计算机程序自动转录语音的问题
5+
语音识别是一项用计算机程序自动转录语音的技术
66

77
这个 demo 是一个从给定音频文件识别文本的实现,它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。
88
## 使用方法
@@ -21,7 +21,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
2121
### 3. 使用方法
2222
- 命令行 (推荐使用)
2323
```bash
24+
# 中文
2425
paddlespeech asr --input ./zh.wav
26+
# 英文
27+
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
2528
```
2629
(如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error,没有关系,这个包是非必须的。)
2730

@@ -41,7 +44,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
4144

4245
输出:
4346
```bash
47+
# 中文
4448
[2021-12-08 13:12:34,063] [ INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康
49+
# 英文
50+
[2022-01-12 11:51:10,815] [ INFO] - ASR Result: i knocked at the door on the ancient side of the building
4551
```
4652

4753
- Python API
@@ -74,3 +80,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
7480
| 模型 | 语言 | 采样率
7581
| :--- | :---: | :---: |
7682
| conformer_wenetspeech| zh| 16000
83+
| transformer_librispeech| en| 16000

docs/source/asr/augmentation.md

Lines changed: 0 additions & 40 deletions
This file was deleted.

docs/source/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ Contents
2727

2828
asr/models_introduction
2929
asr/data_preparation
30-
asr/augmentation
3130
asr/feature_list
3231
asr/ngram_lm
3332

docs/source/released_model.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,13 @@
55
### Speech Recognition Model
66
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link
77
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
8-
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/aishell_ds2_online_cer8.00_release.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
9-
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
10-
[Conformer Online Aishell ASR1 Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz) | Aishell Dataset | Char-based | 283 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0594 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1)
11-
[Conformer Offline Aishell ASR1 Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0547 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
12-
[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1)
13-
[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../example/librispeech/asr1)
14-
[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/transformer.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../example/librispeech/asr1)
15-
[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../example/librispeech/asr2)
8+
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
9+
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
10+
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.056 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
11+
[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1)
12+
[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../example/librispeech/asr1)
13+
[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../example/librispeech/asr1)
14+
[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../example/librispeech/asr2)
1615

1716
### Language Model based on NGram
1817
Language Model | Training Data | Token-based | Size | Descriptions
@@ -25,7 +24,7 @@ Language Model | Training Data | Token-based | Size | Descriptions
2524

2625
| Model | Training Data | Token-based | Size | Descriptions | BLEU | Example Link |
2726
| :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |
28-
| [Transformer FAT-ST MTL En-Zh](https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/fat_st_ted-en-zh.tar.gz) | Ted-En-Zh| Spm| | Encoder:Transformer, Decoder:Transformer, <br />Decoding method: Attention | 20.80 | [Transformer Ted-En-Zh ST1](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/ted_en_zh/st1) |
27+
| (only for CLI)[Transformer FAT-ST MTL En-Zh](https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz) | Ted-En-Zh| Spm| | Encoder:Transformer, Decoder:Transformer, <br />Decoding method: Attention | 20.80 | [Transformer Ted-En-Zh ST1](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/ted_en_zh/st1) |
2928

3029
## Text-to-Speech Models
3130

docs/source/tts/tts_papers.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# TTS Papers
2+
## Text Frontend
3+
### Polyphone
4+
- [【g2pM】g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset](https://arxiv.org/abs/2004.03136)
5+
- [Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT](https://www1.se.cuhk.edu.hk/~hccl/publications/pub/201909_INTERSPEECH_DongyangDAI.pdf)
6+
### Text Normalization
7+
#### English
8+
- [applenob/text_normalization](https://github.com/applenob/text_normalization)
9+
### G2P
10+
#### English
11+
- [cmusphinx/g2p-seq2seq](https://github.com/cmusphinx/g2p-seq2seq)
12+
13+
## Acoustic Models
14+
- [【AdaSpeech3】AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style](https://arxiv.org/abs/2107.02530)
15+
- [【AdaSpeech2】AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data](https://arxiv.org/abs/2104.09715)
16+
- [【AdaSpeech】AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/abs/2103.00993)
17+
- [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
18+
- [【FastPitch】FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)
19+
- [【SpeedySpeech】SpeedySpeech: Efficient Neural Speech Synthesis](https://arxiv.org/abs/2008.03802)
20+
- [【FastSpeech】FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
21+
- [【Transformer TTS】Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
22+
- [【Tacotron2】Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
23+
24+
## Vocoders
25+
- [【RefineGAN】RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses](https://arxiv.org/abs/2111.00962)
26+
- [【Fre-GAN】Fre-GAN: Adversarial Frequency-consistent Audio Synthesis](https://arxiv.org/abs/2106.02297)
27+
- [【StyleMelGAN】StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization](https://arxiv.org/abs/2011.01557)
28+
- [【Multi-band MelGAN】Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106)
29+
- [【HiFi-GAN】HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
30+
- [【VocGAN】VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network](https://arxiv.org/abs/2007.15256)
31+
- [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
32+
- [【MelGAN】MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis](https://arxiv.org/abs/1910.06711)
33+
- [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
34+
- [【LPCNet】LPCNet: Improving Neural Speech Synthesis Through Linear Prediction](https://arxiv.org/abs/1810.11846)
35+
- [【WaveRNN】Efficient Neural Audio Synthesis](https://arxiv.org/abs/1802.08435)
36+
## GAN TTS
37+
38+
- [【GAN TTS】High Fidelity Speech Synthesis with Adversarial Networks](https://arxiv.org/abs/1909.11646)
39+
40+
## Voice Cloning
41+
- [【SV2TTS】Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis](https://arxiv.org/abs/1806.04558)
42+
- [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)

0 commit comments

Comments
 (0)