Skip to content

Commit 74f3210

Browse files
Merge branch 'main' into portuguese
2 parents 6cd918f + 89e596d commit 74f3210

File tree

22 files changed

+2916
-30
lines changed

22 files changed

+2916
-30
lines changed

.github/workflows/tests.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,14 +75,21 @@ jobs:
7575
pip install nemo-toolkit[asr,nlp]==1.23.0
7676
pip install nemo_text_processing
7777
pip install -r requirements/huggingface.txt
78+
pip install certifi #this needed to avoid problems with certificates [COORAL]
79+
export SSL_CERT_FILE=$(python -m certifi)
7880
python -m pip cache purge
81+
7982
8083
- name: Run all tests
8184
env:
8285
AWS_SECRET_KEY: ${{ secrets.AWS_SECRET_KEY }}
8386
AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY }}
8487
CLEAN_UP_TMP_PATH: 1
8588
run: |
89+
90+
wget https://uit.stanford.edu/sites/default/files/2023/10/11/incommon-rsa-ca2.pem #downloading cert manually [for CORAL]
91+
sudo cp incommon-rsa-ca2.pem /usr/local/share/ca-certificates/incommon-rsa-server-ca-2.crt # [cert for CORAL]
92+
sudo update-ca-certificates # [cert for CORAL]
8693
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
8794
python -m pytest tests/ --junitxml=pytest.xml --ignore=tests/test_tts_sdp_end_to_end.py --cov-report=term-missing:skip-covered --cov=sdp --durations=30 -rs | tee pytest-coverage.txt
8895

dataset_configs/english/coraal/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ documentation: |
1818
This config performs the following data processing.
1919
2020
1. Downloads CORAAL data based on the
21-
`official file list <http://lingtools.uoregon.edu/coraal/coraal_download_list.txt>`_.
21+
`official file list <https://lingtools.uoregon.edu/coraal/coraal_download_list.txt>`_. #Official mirror link
2222
There are a couple of errors in the links there, which are fixed in our code.
2323
2. Drops all utterances which contain only pauses. Set ``drop_pauses=False`` to undo.
2424
3. Groups all consecutive segments from the same speaker until 20 seconds duration

dataset_configs/ipl/config.yaml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
documentation: |
2+
TopIPL
3+
######
4+
5+
This config is used to run the `TopIPL: Iterative Pseudo-Labeling for ASR <https://arxiv.org/abs/2506.07659>`_ training algorithm using NeMo-Run.
6+
7+
TopIPL is a **semi-supervised training method** for automatic speech recognition (ASR) that iteratively alternates between model training and pseudo-label generation for unlabeled data. It uses a **top-N checkpoint averaging strategy** to create a strong teacher model and maintains a **dynamic cache** of pseudo-labels throughout the process.
8+
9+
The pipeline is implemented as a processor compatible with the `nemo_run` framework. It generates an output manifest containing updated labels based on pseudo-labeling iterations.
10+
11+
This config performs the following steps:
12+
13+
1. Runs training and inference commands using NeMo-Run.
14+
2. Periodically stops training to generate pseudo-labels with a top-N checkpoint ensemble.
15+
3. Maintains a dynamic cache of pseudo-labels for unlabeled data.
16+
4. Produces a new output manifest after each iteration.
17+
18+
**Required arguments**
19+
20+
- **output_manifest_file**: path where the final manifest with pseudo-labels will be saved.
21+
- **nemo_run_config**: YAML config file specifying the training, inference, and IPL parameters.
22+
23+
**Training config requirements**
24+
25+
Your training config must include the following setting to enable IPL:
26+
27+
.. code-block:: yaml
28+
29+
exp_manager:
30+
create_ipl_epoch_stopper_callback: True
31+
32+
If you're not using Lhotse, also include:
33+
34+
.. code-block:: yaml
35+
36+
ipl_epoch_stopper_callback_params:
37+
stop_every_n_epochs: 2
38+
39+
### Prerequisites
40+
41+
- nemo_run
42+
- ``pip install -r ipl.txt``
43+
44+
processors_to_run: all
45+
46+
processors:
47+
- _target_: sdp.processors.IPL.nemo_run_processor.NemoRunIPLProcessor
48+
config_path: ./nemo_run_config.yaml
49+
output_manifest_file: ???
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# The script to be run.
16+
script: # Script path to run relative to directory
17+
script_config: # Training config file for the script. ipl_epoch_stopper_callback should be provided in the config
18+
inference_config: # Inference config file of unlabeled data for transcribe_speech_parallel
19+
20+
exp_name: null # populated by exp_manager.name if not provided
21+
results_dir: # Where to store the results of the run
22+
23+
# Path to the local NeMo repository. This is used to locate scripts and configs from NeMo.
24+
# To set this up:
25+
# 1. Clone the NeMo repository:
26+
# git clone https://github.com/NVIDIA/NeMo.git /your/desired/path/to/nemo
27+
# 2. Set the path here:
28+
# Make sure this path is valid and NeMo is up to date if you're using its scripts.
29+
nemo_directory: # Nemo directory path
30+
do_average: # Boolean value indicating whether to do average of checkpoints for pseudo-label generation
31+
p_cache: # Probability with which update pseudo-labeled set
32+
num_ipl_epochs: # How many epochs do pseudo-labeling
33+
34+
# Optional arguments
35+
num_runs:
36+
num_gpus:
37+
num_tasks_per_node:
38+
max_runtime: # Specify for clusters
39+
40+
########################################################################################################################
41+
42+
executor: slurm # or local
43+
44+
USER:
45+
46+
# Fields for cluster run
47+
ssh_tunnel:
48+
host:
49+
# ------------------------------- Fill this up! -------------------------------
50+
user: "${USER}" # your username; or resolved from ${USER} environment variable ; or can be null which resolved from ${USER} environment variable
51+
job_dir: "" # Job directory to keep created files
52+
identity: ""
53+
# -----------------------------------------------------------------------------
54+
55+
account:
56+
partition:
57+
job_name_prefix:
58+
59+
containers:
60+
asr: # Container image
61+
62+
63+
env_vars:
64+
- 'TOKENIZERS_PARALLELISM='
65+
- 'AIS_ENDPOINT='
66+
- 'LHOTSE_AUDIO_DURATION_MISMATCH_TOLERANCE='
67+
- 'TORCH_CUDNN_V8_API_ENABLED='
68+
- 'PYTORCH_CUDA_ALLOC_CONF='
69+
- 'HYDRA_FULL_ERROR=1'
70+
71+
required_env_vars:
72+
- 'HF_TOKEN='
73+
- 'WANDB_KEY='
74+
75+
mounts:
76+
# Replace with your own paths in your cluster config
77+
- /path/to/mount:/where/to/mount/
78+
79+
timeouts:
80+
partition_name: # Specify time

docker/Dockerfile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ RUN apt-get update \
2121
# Update pip
2222
RUN pip install --upgrade pip
2323

24+
#install typing-ext manually
25+
RUN pip install typing-extensions
26+
2427
# Clone the NeMo SDP repository
2528
COPY . /src/NeMo-speech-data-processor
2629
RUN rm -rf /src/NeMo-speech-data-processor/.git
@@ -34,4 +37,4 @@ RUN find requirements/ -name "*.txt" -exec pip install -r {} \;
3437
WORKDIR /src/NeMo-speech-data-processor
3538

3639
# Set up entrypoint
37-
CMD ["bash"]
40+
CMD ["bash"]

docs/src/conf.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,6 @@
4545
"numpy",
4646
"tqdm",
4747
"soundfile",
48-
"ndjson",
4948
"boto3",
5049
"webvtt_py",
5150
"python_docx",
@@ -189,3 +188,8 @@ def setup(app):
189188
]
190189
# nitpick_ignore_regex = [('py:class', '*')]
191190

191+
#adding this especially for coraal, temporary
192+
linkcheck_ignore = [
193+
r'https://lingtools\.uoregon\.edu/coraal/coraal_download_list\.txt',
194+
]
195+
# https://lingtools.uoregon.edu/coraal/coraal_download_list.txt

docs/src/sdp/api.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -379,6 +379,14 @@ Miscellaneous
379379
.. autodata:: sdp.processors.tts.prepare_tts_segments.PrepareTTSSegmentsProcessor
380380
:annotation:
381381

382+
.. autodata:: sdp.processors.ipl.nemo_run_processor.NemoRunIPLProcessor
383+
:annotation:
384+
385+
.. autodata:: sdp.processors.ipl.ipl_processors.TrainingCommandGenerator
386+
:annotation:
387+
388+
.. autodata:: sdp.processors.ipl.ipl_processors.InferenceCommandGenerator
389+
:annotation:
382390

383391
.. _sdp-base-classes:
384392

docs/src/sdp/existing_configs.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -408,6 +408,7 @@ HiFiTTS-2
408408
config-docs/english/hifitts2/config_44khz
409409
config-docs/english/hifitts2/config_bandwidth
410410

411+
411412
Unlabeled Portuguese Data
412413
~~~~~~~~~~~~~~~~~~~~~~~~~
413414

@@ -418,3 +419,20 @@ Unlabeled Portuguese Data
418419
:hidden:
419420

420421
config-docs/portuguese/unlabeled/config
422+
423+
NemoRunIPL
424+
425+
**Supported configs**.
426+
427+
* **IPL**:
428+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/ipl/config.yaml>`__ |
429+
:doc:`documentation <config-docs/ipl/config>`
430+
* **NeMoRun**:
431+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/ipl/nemo_run_config.yaml>`__ |
432+
:doc:`documentation <config-docs/ipl/nemo_run_config>`
433+
434+
.. toctree::
435+
:hidden:
436+
437+
config-docs/ipl/config
438+
config-docs/ipl/nemo_run_config

requirements/ipl.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
nemo_run
2+
3+
# Nemo repository path is also required, it is used to locate scripts and configs from NeMo.
4+
#
5+
# To set this up:
6+
# 1. Clone the NeMo repository:
7+
# git clone https://github.com/NVIDIA/NeMo.git /your/desired/path/to/nemo
8+
# 2. Set the path in nemo_run_config.yaml:
9+
# nemo_directory: /your/desired/path/to/nemo
10+
#
11+
# Make sure this path is valid and NeMo is up to date if you're using its scripts.

requirements/main.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ ffmpeg
44
hydra-core
55
joblib
66
librosa>=0.10.0 # specify >=0.10.0 so that librosa.get_duration(path=...) will work
7-
numpy==1.26
7+
numpy>=1.26, <2.0 # module was used numpy 1.x and may crash in 2.x
88
omegaconf
99
pandas
1010
rarfile
@@ -18,7 +18,7 @@ python-docx
1818
pydub
1919
dask
2020
distributed
21-
21+
jiwer>=3.1.0,<4.0.0
2222
# toloka-kit # Temporarily disabled due to Toloka's technical pause; keep as reference for past and future API support
2323
# for some processers, additionally https://github.com/NVIDIA/NeMo is required
2424
# for some processers, additionally nemo_text_processing is required

0 commit comments

Comments
 (0)