Skip to content

Commit 6b59df1

Browse files
chtruong814dimapihtarblisc
authored
cp: Remove nlp module (#15258)
* remove nlp.parts collection (#14617) * remove nlp.parts collection Signed-off-by: dimapihtar <[email protected]> * remove nemo legacy import Signed-off-by: dimapihtar <[email protected]> * remove nlp.parts collection Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * move nlp_overrides Signed-off-by: dimapihtar <[email protected]> * fix styl Signed-off-by: dimapihtar <[email protected]> * remove extra scripts Signed-off-by: dimapihtar <[email protected]> * remove extra test Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Charlie Truong <[email protected]> * revert ckpt scripts removal from #14617 (#15048) Signed-off-by: Charlie Truong <[email protected]> * remove nlp/modules (#14934) * move tokenizer_utils Signed-off-by: dimapihtar <[email protected]> * remove tokenizer_utils Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * remove extra import Signed-off-by: dimapihtar <[email protected]> * move vocab file name Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * remove HF imports Signed-off-by: dimapihtar <[email protected]> * remove hyena submodule Signed-off-by: dimapihtar <[email protected]> * remove transformer submodule Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * remove .py files Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * remove .py files Signed-off-by: dimapihtar <[email protected]> * remove .py files Signed-off-by: dimapihtar <[email protected]> * remove .py files Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * remove nlp.modules.megatron Signed-off-by: dimapihtar <[email protected]> * remove nlp collection Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * fix code style Signed-off-by: dimapihtar <[email protected]> * fix code style Signed-off-by: dimapihtar <[email protected]> * fix code style Signed-off-by: dimapihtar <[email protected]> * fix import Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix code style Signed-off-by: dimapihtar <[email protected]> * fix imports Signed-off-by: dimapihtar <[email protected]> * fix import Signed-off-by: dimapihtar <[email protected]> * fix imports Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix imports Signed-off-by: dimapihtar <[email protected]> * remove unused function Signed-off-by: dimapihtar <[email protected]> * fix import Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> * fix imports Signed-off-by: dimapihtar <[email protected]> * fix unit tests Signed-off-by: dimapihtar <[email protected]> * fix imports Signed-off-by: dimapihtar <[email protected]> * fix imports Signed-off-by: dimapihtar <[email protected]> * fix import Signed-off-by: dimapihtar <[email protected]> * fix import Signed-off-by: dimapihtar <[email protected]> * fix code style Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Signed-off-by: Charlie Truong <[email protected]> * Remove HeteronymClassificationModel (#14980) * remove HeteronymClassificationModel Signed-off-by: Jason <[email protected]> * pylint Signed-off-by: Jason <[email protected]> --------- Signed-off-by: Jason <[email protected]> Signed-off-by: Charlie Truong <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: Jason <[email protected]> Co-authored-by: Dmytro Pykhtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> Co-authored-by: Jason <[email protected]>
1 parent d608ba7 commit 6b59df1

File tree

237 files changed

+749
-24939
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

237 files changed

+749
-24939
lines changed

.github/workflows/cicd-main-speech.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -178,8 +178,6 @@ jobs:
178178
script: L2_TTS_Fast_dev_runs_1_Hifigan
179179
- runner: self-hosted-azure
180180
script: L2_G2P_Models_G2P_Conformer_training_evaluation_and_inference
181-
- runner: self-hosted-azure
182-
script: L2_G2P_Models_HeteronymClassificationModel_training_evaluation_and_inference
183181
- runner: self-hosted-azure
184182
script: SPEECHLM_HF_Training_DuplexS2S
185183
- runner: self-hosted-azure

docs/source/tts/g2p.rst

Lines changed: 1 addition & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@ The models can be trained using words or sentences as input.
2424
If trained with sentence-level input, the models can handle out-of-vocabulary (OOV) and heteronyms along with unambiguous words in a single pass.
2525
See :ref:`Sentence-level Dataset Preparation Pipeline <sentence_level_dataset_pipeline>` on how to label data for G2P model training.
2626

27-
Additionally, we support a purpose-built BERT-based classification model for heteronym disambiguation, see :ref:`this <bert_heteronym_cl>` for details.
28-
2927
Model Training, Evaluation and Inference
3028
----------------------------------------
3129

@@ -125,116 +123,10 @@ Finally, we mask-out OOV words with a special masking token, “<unk>” in the
125123
Using this unknown token forces a G2P model to produce the same masking token as a phonetic representation during training. During inference, the model generates phoneme predictions for OOV words without emitting the masking token as long as this token is not included in the grapheme input.
126124

127125

128-
129-
.. _bert_heteronym_cl:
130-
131-
Purpose-built BERT-based classification model for heteronym disambiguation
132-
--------------------------------------------------------------------------
133-
134-
HeteronymClassificationModel is a BERT-based :cite:`g2p--devlin2018bert` model represents a token classification model and can handle multiple heteronyms at once. The model takes a sentence as an input, and then for every word, it selects a heteronym option out of the available forms.
135-
We mask irrelevant forms to disregard the model’s predictions for non-ambiguous words. E.g., given the input “The Poems are simple to read and easy to comprehend.” the model scores possible {READ_PRESENT and READ_PAST} options for the word “read”.
136-
Possible heteronym forms are extracted from the WikipediaHomographData :cite:`g2p--gorman2018improving`.
137-
138-
The model expects input to be in `.json` manifest format, where is line contains at least the following fields:
139-
140-
.. code::
141-
142-
{"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
143-
144-
Manifest fields:
145-
146-
* `text_graphemes` - input sentence
147-
148-
* `start_end` - beginning and end of the heteronym span in the input sentence
149-
150-
* `homograph_span` - heteronym word in the sentence
151-
152-
* `word_id` - heteronym label, e.g., word `diffuse` has the following possible labels: `diffuse_vrb` and `diffuse_adj`. See `https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv <https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv>`__ for more details.
153-
154-
To convert the WikipediaHomographData to `.json` format suitable for the HeteronymClassificationModel training, run:
155-
156-
.. code-block::
157-
158-
# WikipediaHomographData could be downloaded from `https://github.com/google-research-datasets/WikipediaHomographData <https://github.com/google-research-datasets/WikipediaHomographData>`__.
159-
160-
python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
161-
--data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/eval/
162-
--output=eval.json
163-
python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
164-
--data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/train/
165-
--output=train.json
166-
167-
To train the model, run:
168-
169-
.. code-block::
170-
171-
python g2p_heteronym_classification_train_and_evaluate.py \
172-
train_manifest=<Path to train manifest file>" \
173-
validation_manifest=<Path to validation manifest file>" \
174-
model.wordids=<Path to wordids.tsv file, similar to https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv> \
175-
do_training=True \
176-
do_testing=False
177-
178-
To train the model and evaluate it when the training is complete, run:
179-
180-
.. code-block::
181-
182-
python g2p_heteronym_classification_train_and_evaluate.py \
183-
train_manifest=<Path to train manifest file>" \
184-
validation_manifest=<Path to validation manifest file>" \
185-
model.test_ds.dataset.manifest=<Path to test manifest file>" \
186-
model.wordids="<Path to wordids.tsv file>" \
187-
do_training=True \
188-
do_testing=True
189-
190-
To evaluate pretrained model, run:
191-
192-
.. code-block::
193-
194-
python g2p_heteronym_classification_train_and_evaluate.py \
195-
do_training=False \
196-
do_testing=True \
197-
model.test_ds.dataset.manifest=<Path to test manifest file>" \
198-
pretrained_model=<Path to pretrained .nemo model or from list_available_models()>
199-
200-
To run inference with a pretrained HeteronymClassificationModel, run:
201-
202-
.. code-block::
203-
204-
python g2p_heteronym_classification_inference.py \
205-
manifest="<Path to .json manifest>" \
206-
pretrained_model="<Path to .nemo file or pretrained model name from list_available_models()>" \
207-
output_file="<Path to .json manifest to save prediction>"
208-
209-
Note, if the input manifest contains target "word_id", evaluation will be also performed. During inference, the model predicts heteronym `word_id` and saves predictions in `"pred_text"` field of the `output_file`:
210-
211-
.. code::
212-
213-
{"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "pred_text": "diffuse_vrb", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
214-
215-
To train a model with `Chinese Polyphones with Pinyin (CPP) <https://github.com/kakaobrain/g2pM/tree/master/data>`__ dataset, run:
216-
217-
.. code-block::
218-
219-
# prepare CPP manifest
220-
mkdir -p ./cpp_manifest
221-
git clone https://github.com/kakaobrain/g2pM.git
222-
python3 export_zh_cpp_data_to_manifest.py --data_folder g2pM/data/ --output_folder ./cpp_manifest
223-
224-
# model training and evaluation
225-
python3 heteronym_classification_train_and_evaluate.py \
226-
--config-name "heteronym_classification_zh.yaml" \
227-
train_manifest="./cpp_manifest/train.json" \
228-
validation_manifest="./cpp_manifest/dev.json" \
229-
model.test_ds.dataset.manifest="./cpp_manifest/test.json" \
230-
model.wordids="./cpp_manifest/wordid.tsv" \
231-
do_training=False \
232-
do_testing=True
233-
234126
Requirements
235127
------------
236128

237-
G2P requires NeMo NLP and ASR collections installed. See `Installation instructions <https://docs.nvidia.com/nemo-framework/user-guide/latest/installation.html>`__ for more details.
129+
G2P requires the NeMo ASR collection to be installed. See `Installation instructions <https://docs.nvidia.com/nemo-framework/user-guide/latest/installation.html>`__ for more details.
238130

239131

240132
References

examples/multimodal_autoregressive/megatron_mm_autoregressive_eval_image_generation.py

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import math
1717
import os
1818
import re
19+
import sys
1920

2021
import torch
2122
import torchvision
@@ -26,12 +27,17 @@
2627
)
2728
from pytorch_lightning.trainer.trainer import Trainer
2829

30+
from nemo.collections.common.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
31+
2932
# pylint: disable=line-too-long
3033
from nemo.collections.common.video_tokenizers.cosmos_tokenizer import CausalVideoTokenizer
31-
from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, SamplingParam
32-
from nemo.collections.nlp.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
3334
from nemo.core.config import hydra_runner
3435

36+
if sys.version_info >= (3, 8):
37+
from typing import TypedDict
38+
else:
39+
from typing_extensions import TypedDict
40+
3541
"""
3642
This is the script to run multimodal autoregresssive text generation.
3743
@@ -89,6 +95,23 @@
8995
"""
9096

9197

98+
class LengthParam(TypedDict):
99+
max_length: int # The maximum length of the sequence to be generated.
100+
min_length: int # The minimum length of the sequence to be generated.
101+
102+
103+
class SamplingParam(TypedDict):
104+
use_greedy: bool # Whether or not to use sampling ; use greedy decoding otherwise
105+
temperature: float # sampling temperature
106+
top_k: int # The number of highest probability vocabulary tokens to keep for top-k-filtering.
107+
top_p: float # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
108+
repetition_penalty: float # The parameter for repetition penalty. 1.0 means no penalty.
109+
add_BOS: bool # add the bos token at the begining of the prompt
110+
all_probs: bool # whether return the log prob for all the tokens in vocab
111+
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
112+
end_strings: List[str] # generation will stop when one of these tokens is generated
113+
114+
92115
def to_img(tokens_string, image_tokenizer):
93116
"""Converts visual tokens to images
94117

examples/multimodal_autoregressive/megatron_mm_autoregressive_eval_vision_understanding.py

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515

1616
import datetime
17+
import sys
1718

1819
import torch
1920
import torchvision
@@ -29,11 +30,16 @@
2930
from torch.utils.data import DataLoader
3031
from transformers import AutoModel, AutoTokenizer
3132

33+
from nemo.collections.common.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
34+
3235
# pylint: disable=line-too-long
33-
from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, SamplingParam
34-
from nemo.collections.nlp.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
3536
from nemo.core.config import hydra_runner
3637

38+
if sys.version_info >= (3, 8):
39+
from typing import TypedDict
40+
else:
41+
from typing_extensions import TypedDict
42+
3743
"""
3844
This is the script to run multimodal autoregresssive text generation.
3945
@@ -94,6 +100,23 @@
94100
VQ_HUB = "BAAI/Emu3-VisionTokenizer"
95101

96102

103+
class LengthParam(TypedDict):
104+
max_length: int # The maximum length of the sequence to be generated.
105+
min_length: int # The minimum length of the sequence to be generated.
106+
107+
108+
class SamplingParam(TypedDict):
109+
use_greedy: bool # Whether or not to use sampling ; use greedy decoding otherwise
110+
temperature: float # sampling temperature
111+
top_k: int # The number of highest probability vocabulary tokens to keep for top-k-filtering.
112+
top_p: float # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
113+
repetition_penalty: float # The parameter for repetition penalty. 1.0 means no penalty.
114+
add_BOS: bool # add the bos token at the begining of the prompt
115+
all_probs: bool # whether return the log prob for all the tokens in vocab
116+
compute_logprob: bool # a flag used to compute logprob of all the input text, a very special case of running inference, default False
117+
end_strings: List[str] # generation will stop when one of these tokens is generated
118+
119+
97120
def to_imgstr(image_tokens, tokenizer):
98121
"""Convert integer image tokens to visual tokens string"""
99122
image_tokens = image_tokens.cpu().numpy().tolist()

examples/tts/g2p/conf/g2p_heteronym_classification.yaml

Lines changed: 0 additions & 104 deletions
This file was deleted.

0 commit comments

Comments
 (0)