You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pip3 install -e ".[tf2.x]"# or ".[tf2.x-gpu]" or ".[tf2.x-apple]" for apple m1 machine
109
+
pip3 install "TensorFlowASR"# or "TensorFlowASR[cuda]" if using GPU
112
110
```
113
111
114
-
### Install for Apple Sillicon
115
-
116
-
Due to tensorflow-text is not built for Apple Sillicon, we need to install it with the prebuilt wheel file from [sun1638650145/Libraries-and-Extensions-for-TensorFlow-for-Apple-Silicon](https://github.com/sun1638650145/Libraries-and-Extensions-for-TensorFlow-for-Apple-Silicon)
- For training, please read [tutorial_training](./docs/tutorials/training.md)
@@ -165,7 +149,7 @@ See [tflite_convertion](./docs/tutorials/tflite.md)
165
149
166
150
## Pretrained Models
167
151
168
-
Go to [drive](https://drive.google.com/drive/folders/1BD0AK30n8hc-yR28C5FW3LqzZxtLOQfl?usp=sharing)
152
+
See the results on each example folder, e.g. [./examples/models//transducer/conformer/results/sentencepiece/README.md](./examples/models//transducer/conformer/results/sentencepiece/README.md)
169
153
170
154
## Corpus Sources
171
155
@@ -183,6 +167,7 @@ Go to [drive](https://drive.google.com/drive/folders/1BD0AK30n8hc-yR28C5FW3LqzZx
See [librespeech config](../examples/configs/librispeech/characters/char.yml.j2)
10
+
See [librespeech config](../examples/datasets/librispeech/characters/char.yml.j2)
12
11
13
12
This splits the text into characters and then maps each character to an index. The index starts from 1 and 0 is reserved for blank token. This tokenizer only used for languages that have a small number of characters and each character is not a combination of other characters. For example, English, Vietnamese, etc.
14
13
15
14
## 2. Wordpiece Tokenizer
16
15
17
-
See [librespeech config](../examples/configs/librispeech/wordpiece/wp.yml.j2) for wordpiece splitted by whitespace
16
+
See [librespeech config](../examples/datasets/librispeech/wordpiece/wp.yml.j2) for wordpiece splitted by whitespace
18
17
19
-
See [librespeech config](../examples/configs/librispeech/wordpiece/wp_whitespace.yml.j2) for wordpiece that whitespace is a separate token
18
+
See [librespeech config](../examples/datasets/librispeech/wordpiece/wp_whitespace.yml.j2) for wordpiece that whitespace is a separate token
20
19
21
20
This splits the text into words and then splits each word into subwords. The subwords are then mapped to indices. Blank token can be set to <unk> as index 0. This tokenizer is used for languages that have a large number of words and each word can be a combination of other words, therefore it can be applied to any language.
22
21
23
22
## 3. Sentencepiece Tokenizer
24
23
25
-
See [librespeech config](../examples/configs/librispeech/sentencepiece/sp.yml.j2)
24
+
See [librespeech config](../examples/datasets/librispeech/sentencepiece/sp.yml.j2)
26
25
27
26
This splits the whole sentence into subwords and then maps each subword to an index. Blank token can be set to <unk> as index 0. This tokenizer is used for languages that have a large number of words and each word can be a combination of other words, therefore it can be applied to any language.
For other datasets, you must prepare your own python script like the `scripts/create_librispeech_trans.py`
30
+
For other datasets, please make your own script to prepare the transcript files, take a look at the [`prepare_transcript.py`](../../examples/datasets/librispeech/prepare_transcript.py) file for more reference
27
31
28
32
## 3. Prepare config file
29
33
@@ -33,27 +37,27 @@ Please take a look in some examples for config files in `examples/*/*.yml.j2`
33
37
34
38
The config file is the same as the config used for training
35
39
36
-
## 4. [Optional][Required if not exists] Generate vocabulary and metadata
40
+
The inputs, outputs and other options of vocabulary are defined in the config file
41
+
42
+
For example:
37
43
38
-
Use the same vocabulary file used in training
44
+
```jinja2
45
+
{% import "examples/datasets/librispeech/sentencepiece/sp.yml.j2" as decoder_config with context %}
46
+
{{decoder_config}}
39
47
40
-
```bash
41
-
python scripts/prepare_vocab_and_metadata.py \
42
-
--config-path=/path/to/config.yml.j2 \
43
-
--datadir=/path/to/datadir
48
+
{% import "examples/models/transducer/conformer/small.yml.j2" as config with context %}
49
+
{{config}}
44
50
```
45
51
46
-
The inputs, outputs and other options of vocabulary are defined in the config file
0 commit comments