Skip to content

Commit 89e596d

Browse files
Nemo-run Processor for TopIPL training (#121)
* Ipl processors Signed-off-by: Nune <[email protected]> * remove Signed-off-by: Nune <[email protected]> * some commits Signed-off-by: Nune <[email protected]> * IPL Processors Signed-off-by: Nune <[email protected]> * IPL Processors Signed-off-by: Nune <[email protected]> * IPL Processors Signed-off-by: Nune <[email protected]> * Remove unneseccary files Signed-off-by: Nune <[email protected]> * IPL dependencies Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Config changes Signed-off-by: Nune <[email protected]> * Config place change Signed-off-by: Nune <[email protected]> * Moving configs Signed-off-by: Nune <[email protected]> * Readme file Signed-off-by: Nune <[email protected]> * Fix test Signed-off-by: Nune <[email protected]> * Update nemo_run_config.yaml Signed-off-by: Nune <[email protected]> * Update nemo_run_config.yaml Signed-off-by: Nune <[email protected]> * Adding copyrights Signed-off-by: Nune <[email protected]> * Adding imports from main Signed-off-by: Nune <[email protected]> * Adding copyrights Signed-off-by: Nune <[email protected]> * Doc update Signed-off-by: Nune <[email protected]> * Doc update Signed-off-by: Nune <[email protected]> * Doc update Signed-off-by: Nune <[email protected]> * Update config Signed-off-by: Nune <[email protected]> * Update nemo_run_config.yaml Signed-off-by: Nune <[email protected]> * Update ipl.txt Signed-off-by: Nune <[email protected]> * update Signed-off-by: Nune <[email protected]> * Small change Signed-off-by: Nune <[email protected]> * small update Signed-off-by: Nune <[email protected]> * forse jiwer Signed-off-by: George Zelenfroind <[email protected]> * attempt 1 to fix certificates Signed-off-by: George Zelenfroind <[email protected]> * attempt 2 to fix cert Signed-off-by: George Zelenfroind <[email protected]> * small change Signed-off-by: Nune <[email protected]> * Doc changes Signed-off-by: Nune <[email protected]> * Doc changes Signed-off-by: Nune <[email protected]> --------- Signed-off-by: Nune <[email protected]> Signed-off-by: George Zelenfroind <[email protected]> Co-authored-by: George Zelenfroind <[email protected]> Co-authored-by: George <[email protected]>
1 parent c53be5e commit 89e596d

File tree

14 files changed

+2856
-1
lines changed

14 files changed

+2856
-1
lines changed

.github/workflows/tests.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ jobs:
8686
AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY }}
8787
CLEAN_UP_TMP_PATH: 1
8888
run: |
89+
8990
wget https://uit.stanford.edu/sites/default/files/2023/10/11/incommon-rsa-ca2.pem #downloading cert manually [for CORAL]
9091
sudo cp incommon-rsa-ca2.pem /usr/local/share/ca-certificates/incommon-rsa-server-ca-2.crt # [cert for CORAL]
9192
sudo update-ca-certificates # [cert for CORAL]

dataset_configs/ipl/config.yaml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
documentation: |
2+
TopIPL
3+
######
4+
5+
This config is used to run the `TopIPL: Iterative Pseudo-Labeling for ASR <https://arxiv.org/abs/2506.07659>`_ training algorithm using NeMo-Run.
6+
7+
TopIPL is a **semi-supervised training method** for automatic speech recognition (ASR) that iteratively alternates between model training and pseudo-label generation for unlabeled data. It uses a **top-N checkpoint averaging strategy** to create a strong teacher model and maintains a **dynamic cache** of pseudo-labels throughout the process.
8+
9+
The pipeline is implemented as a processor compatible with the `nemo_run` framework. It generates an output manifest containing updated labels based on pseudo-labeling iterations.
10+
11+
This config performs the following steps:
12+
13+
1. Runs training and inference commands using NeMo-Run.
14+
2. Periodically stops training to generate pseudo-labels with a top-N checkpoint ensemble.
15+
3. Maintains a dynamic cache of pseudo-labels for unlabeled data.
16+
4. Produces a new output manifest after each iteration.
17+
18+
**Required arguments**
19+
20+
- **output_manifest_file**: path where the final manifest with pseudo-labels will be saved.
21+
- **nemo_run_config**: YAML config file specifying the training, inference, and IPL parameters.
22+
23+
**Training config requirements**
24+
25+
Your training config must include the following setting to enable IPL:
26+
27+
.. code-block:: yaml
28+
29+
exp_manager:
30+
create_ipl_epoch_stopper_callback: True
31+
32+
If you're not using Lhotse, also include:
33+
34+
.. code-block:: yaml
35+
36+
ipl_epoch_stopper_callback_params:
37+
stop_every_n_epochs: 2
38+
39+
### Prerequisites
40+
41+
- nemo_run
42+
- ``pip install -r ipl.txt``
43+
44+
processors_to_run: all
45+
46+
processors:
47+
- _target_: sdp.processors.IPL.nemo_run_processor.NemoRunIPLProcessor
48+
config_path: ./nemo_run_config.yaml
49+
output_manifest_file: ???
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# The script to be run.
16+
script: # Script path to run relative to directory
17+
script_config: # Training config file for the script. ipl_epoch_stopper_callback should be provided in the config
18+
inference_config: # Inference config file of unlabeled data for transcribe_speech_parallel
19+
20+
exp_name: null # populated by exp_manager.name if not provided
21+
results_dir: # Where to store the results of the run
22+
23+
# Path to the local NeMo repository. This is used to locate scripts and configs from NeMo.
24+
# To set this up:
25+
# 1. Clone the NeMo repository:
26+
# git clone https://github.com/NVIDIA/NeMo.git /your/desired/path/to/nemo
27+
# 2. Set the path here:
28+
# Make sure this path is valid and NeMo is up to date if you're using its scripts.
29+
nemo_directory: # Nemo directory path
30+
do_average: # Boolean value indicating whether to do average of checkpoints for pseudo-label generation
31+
p_cache: # Probability with which update pseudo-labeled set
32+
num_ipl_epochs: # How many epochs do pseudo-labeling
33+
34+
# Optional arguments
35+
num_runs:
36+
num_gpus:
37+
num_tasks_per_node:
38+
max_runtime: # Specify for clusters
39+
40+
########################################################################################################################
41+
42+
executor: slurm # or local
43+
44+
USER:
45+
46+
# Fields for cluster run
47+
ssh_tunnel:
48+
host:
49+
# ------------------------------- Fill this up! -------------------------------
50+
user: "${USER}" # your username; or resolved from ${USER} environment variable ; or can be null which resolved from ${USER} environment variable
51+
job_dir: "" # Job directory to keep created files
52+
identity: ""
53+
# -----------------------------------------------------------------------------
54+
55+
account:
56+
partition:
57+
job_name_prefix:
58+
59+
containers:
60+
asr: # Container image
61+
62+
63+
env_vars:
64+
- 'TOKENIZERS_PARALLELISM='
65+
- 'AIS_ENDPOINT='
66+
- 'LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE='
67+
- 'TORCH_CUDNN_V8_API_ENABLED='
68+
- 'PYTORCH_CUDA_ALLOC_CONF='
69+
- 'HYDRA_FULL_ERROR=1'
70+
71+
required_env_vars:
72+
- 'HF_TOKEN='
73+
- 'WANDB_KEY='
74+
75+
mounts:
76+
# Replace with your own paths in your cluster config
77+
- /path/to/mount:/where/to/mount/
78+
79+
timeouts:
80+
partition_name: # Specify time

docs/src/sdp/api.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -379,6 +379,14 @@ Miscellaneous
379379
.. autodata:: sdp.processors.tts.prepare_tts_segments.PrepareTTSSegmentsProcessor
380380
:annotation:
381381

382+
.. autodata:: sdp.processors.ipl.nemo_run_processor.NemoRunIPLProcessor
383+
:annotation:
384+
385+
.. autodata:: sdp.processors.ipl.ipl_processors.TrainingCommandGenerator
386+
:annotation:
387+
388+
.. autodata:: sdp.processors.ipl.ipl_processors.InferenceCommandGenerator
389+
:annotation:
382390

383391
.. _sdp-base-classes:
384392

docs/src/sdp/existing_configs.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,3 +407,21 @@ HiFiTTS-2
407407
config-docs/english/hifitts2/config_22khz
408408
config-docs/english/hifitts2/config_44khz
409409
config-docs/english/hifitts2/config_bandwidth
410+
411+
NemoRunIPL
412+
~~~~~~~~~~
413+
414+
**Supported configs**.
415+
416+
* **IPL**:
417+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/ipl/config.yaml>`__ |
418+
:doc:`documentation <config-docs/ipl/config>`
419+
* **NeMoRun**:
420+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/ipl/nemo_run_config.yaml>`__ |
421+
:doc:`documentation <config-docs/ipl/nemo_run_config>`
422+
423+
.. toctree::
424+
:hidden:
425+
426+
config-docs/ipl/config
427+
config-docs/ipl/nemo_run_config

requirements/ipl.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
nemo_run
2+
3+
# Nemo repository path is also required, it is used to locate scripts and configs from NeMo.
4+
#
5+
# To set this up:
6+
# 1. Clone the NeMo repository:
7+
# git clone https://github.com/NVIDIA/NeMo.git /your/desired/path/to/nemo
8+
# 2. Set the path in nemo_run_config.yaml:
9+
# nemo_directory: /your/desired/path/to/nemo
10+
#
11+
# Make sure this path is valid and NeMo is up to date if you're using its scripts.

sdp/processors/ipl/README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# 🧠 TopIPL: Iterative Pseudo-Labeling for ASR
2+
3+
TopIPL is an **iterative pseudo-labeling algorithm** designed for training ASR models using both labeled and unlabeled data. It maintains a **dynamic pseudo-label cache** and leverages **top-N averaged checkpoints** as a teacher model to generate high-quality pseudo-labels across training iterations.
4+
5+
## 📦 Contents
6+
7+
- `NemoRunIPLProcessor` — Command generator and job submitter for IPL runs, compatible with local and cluster environments.
8+
- `nemo_run_config.yaml` — Main configuration file. Users should define all required paths and parameters here.
9+
10+
## 🚀 Getting Started
11+
12+
TopIPL runs like any other processor in the `nemo_run` framework. To use it, you must pass:
13+
14+
- `output_manifest_file`: Path where the resulting manifest will be saved.
15+
- `nemo_run_config`: YAML file containing IPL setup, training/inference configs, and NeMo-Run settings.
16+
17+
### 🔧 Training Config Requirements
18+
19+
Your training config must:
20+
21+
```yaml
22+
exp_manager:
23+
create_ipl_epoch_stopper_callback: True
24+
```
25+
If you're not using Lhotse, also include:
26+
27+
```yaml
28+
ipl_epoch_stopper_callback_params:
29+
stop_every_n_epochs: 2
30+
31+
```
32+
33+
### Prerequisites
34+
35+
Before using TopIPL, make sure the following are set up:
36+
37+
- Clone the NeMo repository:
38+
```bash
39+
git clone https://github.com/NVIDIA/NeMo.git /your/desired/path/to/nemo
40+
41+
- Set the path to NeMo in your `nemo_run_config.yaml`: `nemo_directory: /your/desired/path/to/nemo`
42+
- `pip install -r requirements/ipl.txt`
43+
44+
### Running the Code
45+
46+
```bash
47+
python main.py --config-path=/path/to/directory/config --config-name=config.yaml

sdp/processors/ipl/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)