Skip to content

Commit 70aafd8

Browse files
bliscchtruong814
authored andcommitted
Remove HeteronymClassificationModel (#14980)
* remove HeteronymClassificationModel Signed-off-by: Jason <jasoli@nvidia.com> * pylint Signed-off-by: Jason <jasoli@nvidia.com> --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>
1 parent 932e4b1 commit 70aafd8

13 files changed

+1
-1372
lines changed

.github/workflows/cicd-main-speech.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,8 +174,6 @@ jobs:
174174
script: L2_TTS_Fast_dev_runs_1_Hifigan
175175
- runner: self-hosted-azure
176176
script: L2_G2P_Models_G2P_Conformer_training_evaluation_and_inference
177-
- runner: self-hosted-azure
178-
script: L2_G2P_Models_HeteronymClassificationModel_training_evaluation_and_inference
179177
- runner: self-hosted-azure
180178
script: SPEECHLM_HF_Training_DuplexS2S
181179
- runner: self-hosted-azure

docs/source/tts/g2p.rst

Lines changed: 1 addition & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@ The models can be trained using words or sentences as input.
2424
If trained with sentence-level input, the models can handle out-of-vocabulary (OOV) and heteronyms along with unambiguous words in a single pass.
2525
See :ref:`Sentence-level Dataset Preparation Pipeline <sentence_level_dataset_pipeline>` on how to label data for G2P model training.
2626

27-
Additionally, we support a purpose-built BERT-based classification model for heteronym disambiguation, see :ref:`this <bert_heteronym_cl>` for details.
28-
2927
Model Training, Evaluation and Inference
3028
----------------------------------------
3129

@@ -125,116 +123,10 @@ Finally, we mask-out OOV words with a special masking token, “<unk>” in the
125123
Using this unknown token forces a G2P model to produce the same masking token as a phonetic representation during training. During inference, the model generates phoneme predictions for OOV words without emitting the masking token as long as this token is not included in the grapheme input.
126124

127125

128-
129-
.. _bert_heteronym_cl:
130-
131-
Purpose-built BERT-based classification model for heteronym disambiguation
132-
--------------------------------------------------------------------------
133-
134-
HeteronymClassificationModel is a BERT-based :cite:`g2p--devlin2018bert` model represents a token classification model and can handle multiple heteronyms at once. The model takes a sentence as an input, and then for every word, it selects a heteronym option out of the available forms.
135-
We mask irrelevant forms to disregard the model’s predictions for non-ambiguous words. E.g., given the input “The Poems are simple to read and easy to comprehend.” the model scores possible {READ_PRESENT and READ_PAST} options for the word “read”.
136-
Possible heteronym forms are extracted from the WikipediaHomographData :cite:`g2p--gorman2018improving`.
137-
138-
The model expects input to be in `.json` manifest format, where is line contains at least the following fields:
139-
140-
.. code::
141-
142-
{"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
143-
144-
Manifest fields:
145-
146-
* `text_graphemes` - input sentence
147-
148-
* `start_end` - beginning and end of the heteronym span in the input sentence
149-
150-
* `homograph_span` - heteronym word in the sentence
151-
152-
* `word_id` - heteronym label, e.g., word `diffuse` has the following possible labels: `diffuse_vrb` and `diffuse_adj`. See `https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv <https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv>`__ for more details.
153-
154-
To convert the WikipediaHomographData to `.json` format suitable for the HeteronymClassificationModel training, run:
155-
156-
.. code-block::
157-
158-
# WikipediaHomographData could be downloaded from `https://github.com/google-research-datasets/WikipediaHomographData <https://github.com/google-research-datasets/WikipediaHomographData>`__.
159-
160-
python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
161-
--data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/eval/
162-
--output=eval.json
163-
python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
164-
--data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/train/
165-
--output=train.json
166-
167-
To train the model, run:
168-
169-
.. code-block::
170-
171-
python g2p_heteronym_classification_train_and_evaluate.py \
172-
train_manifest=<Path to train manifest file>" \
173-
validation_manifest=<Path to validation manifest file>" \
174-
model.wordids=<Path to wordids.tsv file, similar to https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv> \
175-
do_training=True \
176-
do_testing=False
177-
178-
To train the model and evaluate it when the training is complete, run:
179-
180-
.. code-block::
181-
182-
python g2p_heteronym_classification_train_and_evaluate.py \
183-
train_manifest=<Path to train manifest file>" \
184-
validation_manifest=<Path to validation manifest file>" \
185-
model.test_ds.dataset.manifest=<Path to test manifest file>" \
186-
model.wordids="<Path to wordids.tsv file>" \
187-
do_training=True \
188-
do_testing=True
189-
190-
To evaluate pretrained model, run:
191-
192-
.. code-block::
193-
194-
python g2p_heteronym_classification_train_and_evaluate.py \
195-
do_training=False \
196-
do_testing=True \
197-
model.test_ds.dataset.manifest=<Path to test manifest file>" \
198-
pretrained_model=<Path to pretrained .nemo model or from list_available_models()>
199-
200-
To run inference with a pretrained HeteronymClassificationModel, run:
201-
202-
.. code-block::
203-
204-
python g2p_heteronym_classification_inference.py \
205-
manifest="<Path to .json manifest>" \
206-
pretrained_model="<Path to .nemo file or pretrained model name from list_available_models()>" \
207-
output_file="<Path to .json manifest to save prediction>"
208-
209-
Note, if the input manifest contains target "word_id", evaluation will be also performed. During inference, the model predicts heteronym `word_id` and saves predictions in `"pred_text"` field of the `output_file`:
210-
211-
.. code::
212-
213-
{"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "pred_text": "diffuse_vrb", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
214-
215-
To train a model with `Chinese Polyphones with Pinyin (CPP) <https://github.com/kakaobrain/g2pM/tree/master/data>`__ dataset, run:
216-
217-
.. code-block::
218-
219-
# prepare CPP manifest
220-
mkdir -p ./cpp_manifest
221-
git clone https://github.com/kakaobrain/g2pM.git
222-
python3 export_zh_cpp_data_to_manifest.py --data_folder g2pM/data/ --output_folder ./cpp_manifest
223-
224-
# model training and evaluation
225-
python3 heteronym_classification_train_and_evaluate.py \
226-
--config-name "heteronym_classification_zh.yaml" \
227-
train_manifest="./cpp_manifest/train.json" \
228-
validation_manifest="./cpp_manifest/dev.json" \
229-
model.test_ds.dataset.manifest="./cpp_manifest/test.json" \
230-
model.wordids="./cpp_manifest/wordid.tsv" \
231-
do_training=False \
232-
do_testing=True
233-
234126
Requirements
235127
------------
236128

237-
G2P requires NeMo NLP and ASR collections installed. See `Installation instructions <https://docs.nvidia.com/nemo-framework/user-guide/latest/installation.html>`__ for more details.
129+
G2P requires the NeMo ASR collection to be installed. See `Installation instructions <https://docs.nvidia.com/nemo-framework/user-guide/latest/installation.html>`__ for more details.
238130

239131

240132
References

examples/tts/g2p/conf/g2p_heteronym_classification.yaml

Lines changed: 0 additions & 104 deletions
This file was deleted.

examples/tts/g2p/conf/heteronym_classification_zh.yaml

Lines changed: 0 additions & 106 deletions
This file was deleted.

0 commit comments

Comments
 (0)