Skip to content

Commit 9081c95

Browse files
authored
Portuguese (#77)
* segment_duration Signed-off-by: Nikolay Karpov <[email protected]> * nemo_path Signed-off-by: Nikolay Karpov <[email protected]> * mv to unlabeled Signed-off-by: Nikolay Karpov <[email protected]> * add docs Signed-off-by: Nikolay Karpov <[email protected]> * fix from main Signed-off-by: Nikolay Karpov <[email protected]> * Test check (#128) * Test check Signed-off-by: Nune <[email protected]> * Update Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Small changes Signed-off-by: Nune <[email protected]> * Ignore duration as it differs based on setup Signed-off-by: Nune <[email protected]> * Bringing back tests Signed-off-by: Nune <[email protected]> * remove prints Signed-off-by: Nune <[email protected]> * remove prints Signed-off-by: Nune <[email protected]> * Remove Signed-off-by: Nune <[email protected]> --------- Signed-off-by: Nune <[email protected]> * Doc update Signed-off-by: Nune <[email protected]>
1 parent e60baa5 commit 9081c95

File tree

12 files changed

+1291
-147
lines changed

12 files changed

+1291
-147
lines changed
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
documentation: |
2+
Unlabeled Data Processing Pipeline
3+
##################################
4+
5+
This pipeline processes unlabeled data for iterative pseudo-labeling training.
6+
7+
The pipeline performs the following steps:
8+
1. Creates an initial manifest by searching for all WAV files in the `raw_data_dir` folder.
9+
2. Counts the duration of each WAV file.
10+
3. Identifies the language using the `langid_ambernet` NeMo model.
11+
4. Filters out audios that are tagged with a different language.
12+
5. Filters out audios that are too long to be processed.
13+
6. Applies the VAD algorithm from the NeMo repository.
14+
7. Forms segments by joining adjacent segments up to a duration threshold.
15+
8. Splits long audios into shorter segments.
16+
9. Removes empty files and extra fields from the manifest.
17+
18+
**Required inputs**:
19+
- `workspace_dir`: Directory for intermediate files, containing the following subfolders:
20+
- `${workspace_dir}/wavs/` - Folder with source long files.
21+
- `${workspace_dir}/sdp/` - Folder to store manifests.
22+
- `${workspace_dir}/sdp/vad/` - Folder to store temporary files from the VAD algorithm.
23+
- `${workspace_dir}/splited_wavs/` - Folder to store split short files.
24+
25+
- `language_short`: Two-letter language code.
26+
- `nemo_path`: Path to NeMo installation.
27+
- `final_manifest`: Path to the final output manifest.
28+
29+
processors_to_run: "0:"
30+
workspace_dir: ???
31+
manifest_dir: ${workspace_dir}/sdp
32+
language_short: pt
33+
nemo_path: ???
34+
final_manifest: ${manifest_dir}/final_manifest.json
35+
36+
processors:
37+
- _target_: sdp.processors.CreateInitialManifestByExt
38+
raw_data_dir: ${workspace_dir}/wavs
39+
extension: wav
40+
output_file_key: audio_filepath
41+
output_manifest_file: ${manifest_dir}/manifest0.json
42+
43+
- _target_: sdp.processors.GetAudioDuration
44+
audio_filepath_key: audio_filepath
45+
duration_key: duration
46+
output_manifest_file: ${manifest_dir}/manifest1.json
47+
48+
- _target_: sdp.processors.AudioLid
49+
output_manifest_file: ${manifest_dir}/manifest2.json
50+
input_audio_key: audio_filepath
51+
output_lang_key: audio_lang
52+
should_run: False
53+
device: cuda
54+
pretrained_model: "langid_ambernet"
55+
segment_duration: 20
56+
num_segments: 3
57+
58+
- _target_: sdp.processors.PreserveByValue
59+
output_manifest_file: ${manifest_dir}/manifest3.json
60+
input_value_key: audio_lang
61+
should_run: False
62+
target_value: ${language_short}
63+
64+
- _target_: sdp.processors.PreserveByValue
65+
output_manifest_file: ${manifest_dir}/manifest4.json
66+
input_value_key: duration
67+
operator: le
68+
target_value: 20000.0
69+
70+
- _target_: sdp.processors.Subprocess
71+
cmd: 'rm -rf ${manifest_dir}/vad/*'
72+
73+
- _target_: sdp.processors.Subprocess
74+
input_manifest_file: ${manifest_dir}/manifest4.json
75+
output_manifest_file: ${manifest_dir}/vad
76+
input_manifest_arg: "manifest_filepath"
77+
output_manifest_arg: "output_dir"
78+
cmd: 'python sdp/processors/nemo/speech_to_text_with_vad.py audio_type=wav vad_model=vad_multilingual_frame_marblenet vad_config=sdp/processors/nemo/frame_vad_infer_postprocess.yaml'
79+
80+
- _target_: sdp.processors.RenameFields
81+
input_manifest_file: ${manifest_dir}/vad/temp_manifest_vad_rttm-onset0.3-offset0.3-pad_onset0.2-pad_offset0.2-min_duration_on0.2-min_duration_off0.2-filter_speech_firstTrue.json
82+
output_manifest_file: ${manifest_dir}/manifest7.json
83+
rename_fields: {"audio_filepath":"source_filepath"}
84+
85+
- _target_: sdp.processors.nemo.rttm.GetRttmSegments
86+
output_manifest_file: ${manifest_dir}/manifest8.json
87+
rttm_key: rttm_file
88+
output_file_key: audio_segments
89+
duration_key: duration
90+
duration_threshold: 20.0
91+
92+
- _target_: sdp.processors.nemo.rttm.SplitAudioFile
93+
output_manifest_file: ${manifest_dir}/manifest9.json
94+
splited_audio_dir: ${workspace_dir}/splited_wavs/
95+
segments_key: audio_segments
96+
duration_key: duration
97+
input_file_key: source_filepath
98+
output_file_key: audio_filepath
99+
100+
- _target_: sdp.processors.PreserveByValue
101+
output_manifest_file: ${manifest_dir}/manifest10.json
102+
input_value_key: duration
103+
operator: gt
104+
target_value: 0.0
105+
106+
- _target_: sdp.processors.KeepOnlySpecifiedFields
107+
output_manifest_file: ${final_manifest}
108+
fields_to_keep: ["audio_filepath", "duration"]

docs/src/sdp/existing_configs.rst

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -408,8 +408,19 @@ HiFiTTS-2
408408
config-docs/english/hifitts2/config_44khz
409409
config-docs/english/hifitts2/config_bandwidth
410410

411+
412+
Unlabeled Portuguese Data
413+
~~~~~~~~~~~~~~~~~~~~~~~~~
414+
415+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/portuguese/unlabeled/config.yaml>`__ |
416+
:doc:`documentation <config-docs/portuguese/unlabeled/config>`
417+
418+
.. toctree::
419+
:hidden:
420+
421+
config-docs/portuguese/unlabeled/config
422+
411423
NemoRunIPL
412-
~~~~~~~~~~
413424

414425
**Supported configs**.
415426

@@ -419,13 +430,13 @@ NemoRunIPL
419430
* **NeMoRun**:
420431
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/ipl/nemo_run_config.yaml>`__ |
421432
:doc:`documentation <config-docs/ipl/nemo_run_config>`
422-
433+
423434
.. toctree::
424435
:hidden:
425436

426437
config-docs/ipl/config
427438
config-docs/ipl/nemo_run_config
428-
439+
429440
Earnings21/22
430441
~~~~~~~~~~~~~
431442

@@ -438,4 +449,4 @@ Earnings21/22
438449
.. toctree::
439450
:hidden:
440451

441-
config-docs/english/earnings/config
452+
config-docs/english/earnings/config

sdp/processors/__init__.py

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,8 @@
3232
CreateInitialManifestFleurs,
3333
)
3434
from sdp.processors.datasets.hifitts2.download_dataset import DownloadHiFiTTS2
35-
from sdp.processors.datasets.hifitts2.remove_failed_chapters import RemovedFailedChapters
36-
from sdp.processors.datasets.uzbekvoice.create_initial_manifest import (
37-
CreateInitialManifestUzbekvoice,
35+
from sdp.processors.datasets.hifitts2.remove_failed_chapters import (
36+
RemovedFailedChapters,
3837
)
3938
from sdp.processors.datasets.ksc2.create_initial_manifest import (
4039
CreateInitialManifestKSC2,
@@ -44,13 +43,15 @@
4443
CreateInitialManifestLibrispeech,
4544
)
4645
from sdp.processors.datasets.masc import (
47-
CreateInitialManifestMASC,
4846
AggregateSegments,
47+
CreateInitialManifestMASC,
48+
GetCaptionFileSegments,
4949
RegExpVttEntries,
50-
GetCaptionFileSegments
5150
)
52-
from sdp.processors.datasets.mediaspeech.create_initial_manifest import CreateInitialManifestMediaSpeech
5351
from sdp.processors.datasets.mcv.create_initial_manifest import CreateInitialManifestMCV
52+
from sdp.processors.datasets.mediaspeech.create_initial_manifest import (
53+
CreateInitialManifestMediaSpeech,
54+
)
5455
from sdp.processors.datasets.mls.create_initial_manifest import CreateInitialManifestMLS
5556
from sdp.processors.datasets.mls.restore_pc import RestorePCForMLS
5657
from sdp.processors.datasets.mtedx.create_initial_manifest import (
@@ -67,18 +68,20 @@
6768
CreateInitialManifestSLR140,
6869
CustomDataSplitSLR140,
6970
)
71+
from sdp.processors.datasets.uzbekvoice.create_initial_manifest import (
72+
CreateInitialManifestUzbekvoice,
73+
)
7074
from sdp.processors.datasets.voxpopuli.create_initial_manifest import (
7175
CreateInitialManifestVoxpopuli,
7276
)
7377
from sdp.processors.datasets.voxpopuli.normalize_from_non_pc_text import (
7478
NormalizeFromNonPCTextVoxpopuli,
7579
)
76-
from sdp.processors.datasets.ytc.create_initial_manifest import (
77-
CreateInitialManifestYTC,
80+
from sdp.processors.datasets.ytc.create_initial_manifest import CreateInitialManifestYTC
81+
from sdp.processors.huggingface.create_initial_manifest import (
82+
CreateInitialManifestHuggingFace,
7883
)
7984
from sdp.processors.huggingface.speech_recognition import ASRTransformers
80-
from sdp.processors.huggingface.create_initial_manifest import CreateInitialManifestHuggingFace
81-
8285
from sdp.processors.modify_manifest.common import (
8386
AddConstantFields,
8487
ApplyInnerJoin,
@@ -89,7 +92,9 @@
8992
RenameFields,
9093
SortManifest,
9194
SplitOnFixedDuration,
95+
Subprocess,
9296
DropSpecifiedFields,
97+
9398
)
9499
from sdp.processors.modify_manifest.create_manifest import (
95100
CreateCombinedManifests,
@@ -104,8 +109,8 @@
104109
GetWER,
105110
InsIfASRInsertion,
106111
InverseNormalizeText,
107-
NormalizeText,
108112
MakeSentence,
113+
NormalizeText,
109114
ReadDocxLines,
110115
ReadTxtLines,
111116
SplitLineBySentence,
@@ -130,8 +135,8 @@
130135
DropLowWordMatchRate,
131136
DropNonAlphabet,
132137
DropOnAttribute,
133-
PreserveByValue,
134138
DropRepeatedFields,
139+
PreserveByValue,
135140
)
136141
from sdp.processors.modify_manifest.make_letters_uppercase_after_period import (
137142
MakeLettersUppercaseAfterPeriod,
@@ -148,6 +153,7 @@
148153
)
149154
from sdp.processors.nemo.asr_inference import ASRInference
150155
from sdp.processors.nemo.estimate_bandwidth import EstimateBandwidth
156+
from sdp.processors.nemo.lid_inference import AudioLid
151157
from sdp.processors.nemo.pc_inference import PCInference
152158
from sdp.processors.toloka.accept_if import AcceptIfWERLess
153159
from sdp.processors.toloka.create_pool import CreateTolokaPool

sdp/processors/modify_manifest/common.py

Lines changed: 77 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,14 @@
1414

1515
import json
1616
import os
17+
import subprocess
1718
from pathlib import Path
18-
from typing import Dict, List, Union, Optional
19+
from typing import Dict, List, Optional, Union
1920

2021
import pandas as pd
2122
from tqdm import tqdm
2223

24+
from sdp.logging import logger
2325
from sdp.processors.base_processor import (
2426
BaseParallelProcessor,
2527
BaseProcessor,
@@ -28,6 +30,71 @@
2830
)
2931
from sdp.utils.common import load_manifest
3032

33+
34+
class Subprocess(BaseProcessor):
35+
"""
36+
Processor for handling subprocess execution with additional features for managing input and output manifests.
37+
38+
Args:
39+
cmd (str): The command to be executed as a subprocess.
40+
input_manifest_arg (str, optional): The argument specifying the input manifest. Defaults to an empty string.
41+
output_manifest_arg (str, optional): The argument specifying the output manifest. Defaults to an empty string.
42+
arg_separator (str, optional): The separator used between argument and value. Defaults to "=".
43+
**kwargs: Additional keyword arguments to be passed to the base class.
44+
45+
Example:
46+
47+
_target_: sdp.processors.datasets.commoncrawl.Subprocess
48+
output_manifest_file: /workspace/manifest.json
49+
input_manifest_arg: "--manifest"
50+
output_manifest_arg: "--output_filename"
51+
arg_separator: "="
52+
cmd: "python /workspace/NeMo-text-processing/nemo_text_processing/text_normalization/normalize_with_audio.py \
53+
--language=en --n_jobs=-1 --batch_size=600 --manifest_text_field=text --cache_dir=${workspace_dir}/cache --overwrite_cache \
54+
--whitelist=/workspace/NeMo-text-processing/nemo_text_processing/text_normalization/en/data/whitelist/asr_with_pc.tsv"
55+
56+
"""
57+
58+
def __init__(
59+
self,
60+
cmd: str,
61+
input_manifest_arg: str = "",
62+
output_manifest_arg: str = "",
63+
arg_separator: str = "=",
64+
**kwargs,
65+
):
66+
super().__init__(**kwargs)
67+
self.input_manifest_arg = input_manifest_arg
68+
self.output_manifest_arg = output_manifest_arg
69+
self.arg_separator = arg_separator
70+
self.cmd = cmd
71+
72+
def process(self):
73+
os.makedirs(os.path.dirname(self.output_manifest_file), exist_ok=True)
74+
if self.cmd.find(self.input_manifest_file) != -1 or self.cmd.find(self.output_manifest_file) != -1:
75+
logger.error(
76+
"input_manifest_file "
77+
+ self.input_manifest_file
78+
+ " and output_manifest_file "
79+
+ self.output_manifest_file
80+
+ " should be exluded from cmd line!"
81+
)
82+
raise ValueError
83+
process_args = [x for x in self.cmd.split(" ") if x]
84+
if self.arg_separator == " ":
85+
if self.input_manifest_arg:
86+
process_args.extend([self.input_manifest_arg, self.input_manifest_file])
87+
if self.output_manifest_arg:
88+
process_args.extend([self.output_manifest_arg, self.output_manifest_file])
89+
else:
90+
if self.input_manifest_arg:
91+
process_args.extend([self.input_manifest_arg + self.arg_separator + self.input_manifest_file])
92+
if self.output_manifest_arg:
93+
process_args.extend([self.output_manifest_arg + self.arg_separator + self.output_manifest_file])
94+
subprocess.run(" ".join(process_args), shell=True)
95+
96+
97+
3198
class CombineSources(BaseParallelProcessor):
3299
"""Can be used to create a single field from two alternative sources.
33100
@@ -104,24 +171,24 @@ class AddConstantFields(BaseParallelProcessor):
104171
This processor adds constant fields to all manifest entries using Dask BaseParallelProcessor.
105172
It is useful when you want to attach fixed information (e.g., a language label or metadata)
106173
to each entry for downstream tasks such as language identification model training.
107-
174+
108175
Args:
109176
fields (dict): A dictionary containing key-value pairs of fields to add to each manifest entry.
110177
For example::
111-
178+
112179
{
113180
"label": "en",
114181
"metadata": "mcv-11.0-2022-09-21"
115182
}
116-
183+
117184
Returns:
118185
dict: The same data as in the input manifest with the added constant fields as specified in
119186
the ``fields`` dictionary.
120-
187+
121188
Example:
122-
189+
123190
.. code-block:: yaml
124-
191+
125192
- _target_: sdp.processors.modify_manifest.common.AddConstantFields
126193
input_manifest_file: ${workspace_dir}/input_manifest.json
127194
output_manifest_file: ${workspace_dir}/output_manifest.json
@@ -139,7 +206,6 @@ def process_dataset_entry(self, data_entry: Dict):
139206
return [DataEntry(data=data_entry)]
140207

141208

142-
143209
class DuplicateFields(BaseParallelProcessor):
144210
"""This processor duplicates fields in all manifest entries.
145211
@@ -154,8 +220,8 @@ class DuplicateFields(BaseParallelProcessor):
154220
155221
Returns:
156222
The same data as in the input manifest with duplicated fields
157-
as specified in the ``duplicate_fields`` input dictionary.
158-
223+
as specified in the ``duplicate_fields`` input dictionary.
224+
159225
Example:
160226
.. code-block:: yaml
161227
@@ -165,6 +231,7 @@ class DuplicateFields(BaseParallelProcessor):
165231
duplicate_fields: {"text":"answer"}
166232
167233
"""
234+
168235
def __init__(
169236
self,
170237
duplicate_fields: Dict,

0 commit comments

Comments
 (0)