NVIDIA-NeMo
diff --git a/‎.github/workflows/cicd-main-speech.yml‎
Lines changed: 0 additions & 2 deletions b/‎.github/workflows/cicd-main-speech.yml‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎docs/source/tts/g2p.rst‎
Lines changed: 1 addition & 109 deletions b/‎docs/source/tts/g2p.rst‎
Lines changed: 1 addition & 109 deletions
diff --git a/‎examples/multimodal_autoregressive/megatron_mm_autoregressive_eval_image_generation.py‎
Lines changed: 25 additions & 2 deletions b/‎examples/multimodal_autoregressive/megatron_mm_autoregressive_eval_image_generation.py‎
Lines changed: 25 additions & 2 deletions
diff --git a/‎examples/multimodal_autoregressive/megatron_mm_autoregressive_eval_vision_understanding.py‎
Lines changed: 25 additions & 2 deletions b/‎examples/multimodal_autoregressive/megatron_mm_autoregressive_eval_vision_understanding.py‎
Lines changed: 25 additions & 2 deletions
diff --git a/‎examples/tts/g2p/conf/g2p_heteronym_classification.yaml‎
Lines changed: 0 additions & 104 deletions b/‎examples/tts/g2p/conf/g2p_heteronym_classification.yaml‎
Lines changed: 0 additions & 104 deletions
@@ -178,8 +178,6 @@ jobs:
             script: L2_TTS_Fast_dev_runs_1_Hifigan
           - runner: self-hosted-azure
             script: L2_G2P_Models_G2P_Conformer_training_evaluation_and_inference
-          - runner: self-hosted-azure
-            script: L2_G2P_Models_HeteronymClassificationModel_training_evaluation_and_inference
           - runner: self-hosted-azure
             script: SPEECHLM_HF_Training_DuplexS2S
           - runner: self-hosted-azure
 
@@ -24,8 +24,6 @@ The models can be trained using words or sentences as input.
 If trained with sentence-level input, the models can handle out-of-vocabulary (OOV) and heteronyms along with unambiguous words in a single pass.
 See :ref:`Sentence-level Dataset Preparation Pipeline <sentence_level_dataset_pipeline>` on how to label data for G2P model training.
 
-Additionally, we support a purpose-built BERT-based classification model for heteronym disambiguation, see :ref:`this <bert_heteronym_cl>` for details.
-
 Model Training, Evaluation and Inference
 ----------------------------------------
 
@@ -125,116 +123,10 @@ Finally, we mask-out OOV words with a special masking token, “<unk>” in the
 Using this unknown token forces a G2P model to produce the same masking token as a phonetic representation during training. During inference, the model generates phoneme predictions for OOV words without emitting the masking token as long as this token is not included in the grapheme input.
 
 
-
-.. _bert_heteronym_cl:
-
-Purpose-built BERT-based classification model for heteronym disambiguation
---------------------------------------------------------------------------
-
-HeteronymClassificationModel is a BERT-based :cite:`g2p--devlin2018bert` model represents a token classification model and can handle multiple heteronyms at once. The model takes a sentence as an input, and then for every word, it selects a heteronym option out of the available forms.
-We mask irrelevant forms to disregard the model’s predictions for non-ambiguous words. E.g., given  the input “The Poems are simple to read and easy to comprehend.” the model scores possible {READ_PRESENT and READ_PAST} options for the word “read”.
-Possible heteronym forms are extracted from the WikipediaHomographData :cite:`g2p--gorman2018improving`.
-
-The model expects input to be in `.json` manifest format, where is line contains at least the following fields:
-
-.. code::
-
-  {"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
-
-Manifest fields:
-
-* `text_graphemes` - input sentence
-
-* `start_end` - beginning and end of the heteronym span in the input sentence
-
-* `homograph_span` - heteronym word in the sentence
-
-* `word_id` - heteronym label, e.g., word `diffuse` has the following possible labels: `diffuse_vrb` and `diffuse_adj`. See `https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv <https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv>`__ for more details.
-
-To convert the WikipediaHomographData to `.json` format suitable for the HeteronymClassificationModel training, run:
-
-.. code-block::
-
-    # WikipediaHomographData could be downloaded from `https://github.com/google-research-datasets/WikipediaHomographData <https://github.com/google-research-datasets/WikipediaHomographData>`__.
-
-    python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
-            --data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/eval/
-            --output=eval.json
-    python NeMo/scripts/dataset_processing/g2p/export_wikihomograph_data_to_manifest.py \
-            --data_folder=<Path to WikipediaHomographData>/WikipediaHomographData-master/data/train/
-            --output=train.json
-
-To train the model, run:
-
-.. code-block::
-
-    python g2p_heteronym_classification_train_and_evaluate.py \
-        train_manifest=<Path to train manifest file>" \
-        validation_manifest=<Path to validation manifest file>" \
-        model.wordids=<Path to wordids.tsv file, similar to https://github.com/google-research-datasets/WikipediaHomographData/blob/master/data/wordids.tsv> \
-        do_training=True \
-        do_testing=False
-
-To train the model and evaluate it when the training is complete, run:
-
-.. code-block::
-
-    python g2p_heteronym_classification_train_and_evaluate.py \
-        train_manifest=<Path to train manifest file>" \
-        validation_manifest=<Path to validation manifest file>" \
-        model.test_ds.dataset.manifest=<Path to test manifest file>" \
-        model.wordids="<Path to wordids.tsv file>" \
-        do_training=True \
-        do_testing=True
-
-To evaluate pretrained model, run:
-
-.. code-block::
-
-    python g2p_heteronym_classification_train_and_evaluate.py \
-        do_training=False \
-        do_testing=True \
-        model.test_ds.dataset.manifest=<Path to test manifest file>"  \
-        pretrained_model=<Path to pretrained .nemo model or from list_available_models()>
-
-To run inference with a pretrained HeteronymClassificationModel, run:
-
-.. code-block::
-
-    python g2p_heteronym_classification_inference.py \
-        manifest="<Path to .json manifest>" \
-        pretrained_model="<Path to .nemo file or pretrained model name from list_available_models()>" \
-        output_file="<Path to .json manifest to save prediction>"
-
-Note, if the input manifest contains target "word_id", evaluation will be also performed. During inference, the model predicts heteronym `word_id` and saves predictions in `"pred_text"` field of the `output_file`:
-
-.. code::
-
-  {"text_graphemes": "Oxygen is less able to diffuse into the blood, leading to hypoxia.", "pred_text": "diffuse_vrb", "start_end": [23, 30], "homograph_span": "diffuse", "word_id": "diffuse_vrb"}
-
-To train a model with `Chinese Polyphones with Pinyin (CPP) <https://github.com/kakaobrain/g2pM/tree/master/data>`__ dataset, run:
-
-.. code-block::
-
-    # prepare CPP manifest
-    mkdir -p ./cpp_manifest
-    git clone https://github.com/kakaobrain/g2pM.git
-    python3 export_zh_cpp_data_to_manifest.py --data_folder g2pM/data/ --output_folder ./cpp_manifest
-    
-    # model training and evaluation
-    python3 heteronym_classification_train_and_evaluate.py \
-        --config-name "heteronym_classification_zh.yaml" \
-        train_manifest="./cpp_manifest/train.json" \
-        validation_manifest="./cpp_manifest/dev.json" \
-        model.test_ds.dataset.manifest="./cpp_manifest/test.json" \
-        model.wordids="./cpp_manifest/wordid.tsv" \
-        do_training=False \
-        do_testing=True
-
 Requirements
 ------------
 
-G2P requires NeMo NLP and ASR collections installed. See `Installation instructions <https://docs.nvidia.com/nemo-framework/user-guide/latest/installation.html>`__ for more details.
+G2P requires the NeMo ASR collection to be installed. See `Installation instructions <https://docs.nvidia.com/nemo-framework/user-guide/latest/installation.html>`__ for more details.
 
 
 References
 
@@ -16,6 +16,7 @@
 import math
 import os
 import re
+import sys
 
 import torch
 import torchvision
@@ -26,12 +27,17 @@
 )
 from pytorch_lightning.trainer.trainer import Trainer
 
+from nemo.collections.common.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
+
 # pylint: disable=line-too-long
 from nemo.collections.common.video_tokenizers.cosmos_tokenizer import CausalVideoTokenizer
-from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, SamplingParam
-from nemo.collections.nlp.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
 from nemo.core.config import hydra_runner
 
+if sys.version_info >= (3, 8):
+    from typing import TypedDict
+else:
+    from typing_extensions import TypedDict
+
 """
 This is the script to run multimodal autoregresssive text generation.
 
@@ -89,6 +95,23 @@
 """
 
 
+class LengthParam(TypedDict):
+    max_length: int  # The maximum length of the sequence to be generated.
+    min_length: int  # The minimum length of the sequence to be generated.
+
+
+class SamplingParam(TypedDict):
+    use_greedy: bool  # Whether or not to use sampling ; use greedy decoding otherwise
+    temperature: float  # sampling temperature
+    top_k: int  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    top_p: float  # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    repetition_penalty: float  # The parameter for repetition penalty. 1.0 means no penalty.
+    add_BOS: bool  # add the bos token at the begining of the prompt
+    all_probs: bool  # whether return the log prob for all the tokens in vocab
+    compute_logprob: bool  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+    end_strings: List[str]  # generation will stop when one of these tokens is generated
+
+
 def to_img(tokens_string, image_tokenizer):
     """Converts visual tokens to images
 
 
@@ -14,6 +14,7 @@
 
 
 import datetime
+import sys
 
 import torch
 import torchvision
@@ -29,11 +30,16 @@
 from torch.utils.data import DataLoader
 from transformers import AutoModel, AutoTokenizer
 
+from nemo.collections.common.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
+
 # pylint: disable=line-too-long
-from nemo.collections.nlp.modules.common.transformer.text_generation import LengthParam, SamplingParam
-from nemo.collections.nlp.parts.nlp_overrides import CustomProgressBar, NLPDDPStrategy
 from nemo.core.config import hydra_runner
 
+if sys.version_info >= (3, 8):
+    from typing import TypedDict
+else:
+    from typing_extensions import TypedDict
+
 """
 This is the script to run multimodal autoregresssive text generation.
 
@@ -94,6 +100,23 @@
 VQ_HUB = "BAAI/Emu3-VisionTokenizer"
 
 
+class LengthParam(TypedDict):
+    max_length: int  # The maximum length of the sequence to be generated.
+    min_length: int  # The minimum length of the sequence to be generated.
+
+
+class SamplingParam(TypedDict):
+    use_greedy: bool  # Whether or not to use sampling ; use greedy decoding otherwise
+    temperature: float  # sampling temperature
+    top_k: int  # The number of highest probability vocabulary tokens to keep for top-k-filtering.
+    top_p: float  # If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.
+    repetition_penalty: float  # The parameter for repetition penalty. 1.0 means no penalty.
+    add_BOS: bool  # add the bos token at the begining of the prompt
+    all_probs: bool  # whether return the log prob for all the tokens in vocab
+    compute_logprob: bool  # a flag used to compute logprob of all the input text, a very special case of running inference, default False
+    end_strings: List[str]  # generation will stop when one of these tokens is generated
+
+
 def to_imgstr(image_tokens, tokenizer):
     """Convert integer image tokens to visual tokens string"""
     image_tokens = image_tokens.cpu().numpy().tolist()