Skip to content

Commit accfa81

Browse files
arhamm1lbliiigreptile-apps[bot]
authored
Update curate-audio/process-data/text-integration/index.md (#1247)
* Update curate-audio/process-data/text-integration/index.md Signed-off-by: Arham Mehta <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/text-integration/index.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: L.B. <[email protected]> --------- Signed-off-by: Arham Mehta <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: L.B. <[email protected]> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
1 parent a74cdcc commit accfa81

File tree

1 file changed

+72
-12
lines changed
  • docs/curate-audio/process-data/text-integration

1 file changed

+72
-12
lines changed

docs/curate-audio/process-data/text-integration/index.md

Lines changed: 72 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,23 @@ Convert processed audio data from `AudioBatch` to `DocumentBatch` format using t
1515

1616
## How it Works
1717

18-
The `AudioToDocumentStage` provides basic format conversion:
18+
The `AudioToDocumentStage` provides straightforward format conversion between NeMo Curator's audio and text data structures:
1919

2020
1. **Format Conversion**: Transform `AudioBatch` objects to `DocumentBatch` format
2121
2. **Metadata Preservation**: All fields from the audio data are preserved in the conversion
2222
3. **Export Ready**: Convert audio processing results to pandas DataFrame format for analysis or export
2323

24+
**Common use cases:**
25+
- Export ASR results and quality metrics for analysis
26+
- Save filtered audio datasets with transcriptions
27+
- Integrate audio processing outputs with downstream text workflows
28+
2429
## Basic Conversion
2530

2631
### AudioBatch to DocumentBatch
2732

33+
Use `AudioToDocumentStage` to convert audio processing results to document format:
34+
2835
```python
2936
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
3037
from nemo_curator.tasks import AudioBatch
@@ -51,6 +58,12 @@ document_batch = document_batches[0]
5158
print(f"Converted {len(document_batch.data)} audio records to DocumentBatch")
5259
```
5360

61+
**Parameters:**
62+
- `AudioToDocumentStage()` has no configuration parameters; it performs direct format conversion
63+
64+
**Returns:**
65+
- List of `DocumentBatch` objects containing a pandas DataFrame with all original audio fields
66+
5467
### What Gets Preserved
5568

5669
The conversion preserves all fields from your audio processing pipeline:
@@ -65,10 +78,16 @@ The conversion preserves all fields from your audio processing pipeline:
6578
# - Any other metadata fields you've added
6679
```
6780

81+
:::{note}
82+
Field names and values are preserved exactly as they appear in the `AudioBatch`. No data transformation or cleaning is performed during conversion.
83+
:::
84+
6885
## Integration in Pipelines
6986

7087
### Complete Audio Processing with Export
7188

89+
The most common use case is adding `AudioToDocumentStage` at the end of your audio pipeline to enable result export:
90+
7291
```python
7392
from nemo_curator.pipeline import Pipeline
7493
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
@@ -80,24 +99,58 @@ from nemo_curator.stages.text.io.writer import JsonlWriter
8099
# Create pipeline that processes audio and exports results
81100
pipeline = Pipeline(name="audio_processing_with_export")
82101

83-
# Audio processing stages
84-
pipeline.add_stage(InferenceAsrNemoStage(model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"))
85-
pipeline.add_stage(GetPairwiseWerStage(text_key="text", pred_text_key="pred_text"))
86-
pipeline.add_stage(GetAudioDurationStage(audio_filepath_key="audio_filepath", duration_key="duration"))
87-
88-
# Convert to DocumentBatch for export
102+
# 1. Load audio data
103+
pipeline.add_stage(CreateInitialManifestFleursStage(
104+
lang="en_us",
105+
split="test",
106+
raw_data_dir="./audio_data"
107+
).with_(batch_size=8))
108+
109+
# 2. Run ASR inference
110+
pipeline.add_stage(InferenceAsrNemoStage(
111+
pipeline.add_stage(InferenceAsrNemoStage(
112+
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
113+
pred_text_key="pred_text"
114+
).with_(resources=Resources(gpus=1.0)))
115+
116+
# 3. Calculate quality metrics
117+
pipeline.add_stage(GetPairwiseWerStage(
118+
pipeline.add_stage(GetPairwiseWerStage(
119+
text_key="text",
120+
pred_text_key="pred_text",
121+
wer_key="wer"
122+
))
123+
pipeline.add_stage(GetAudioDurationStage(
124+
audio_filepath_key="audio_filepath",
125+
duration_key="duration"
126+
))
127+
128+
# 4. Convert to DocumentBatch for export
129+
pipeline.add_stage(AudioToDocumentStage())
89130
pipeline.add_stage(AudioToDocumentStage())
90131

91-
# Export results
132+
# 5. Export to JSONL format
92133
pipeline.add_stage(JsonlWriter(path="/output/processed_audio_results"))
134+
135+
# Execute pipeline
136+
executor = XennaExecutor()
137+
pipeline.run(executor)
138+
```
139+
140+
**Output format:** The `JsonlWriter` creates a JSONL file where each line contains one audio sample with all fields:
141+
142+
```json
143+
{"audio_filepath": "/data/audio/sample1.wav", "text": "hello world", "pred_text": "hello world", "wer": 0.0, "duration": 1.5}
144+
{"audio_filepath": "/data/audio/sample2.wav", "text": "test audio", "pred_text": "test odio", "wer": 50.0, "duration": 2.1}
93145
```
94146

95147
## Custom Integration
96148

97-
If you need to apply text processing to your ASR transcriptions, you will need to implement custom stages. The `AudioToDocumentStage` provides the foundation for this by converting to the standard `DocumentBatch` format.
149+
While `AudioToDocumentStage` converts audio data to `DocumentBatch` format, NeMo Curator's built-in text processing stages (filters, classifiers, etc.) are designed for text documents, not audio transcriptions. For audio-specific text processing, implement custom stages that operate on the converted `DocumentBatch` data.
98150

99151
### Example: Custom Text Processing
100152

153+
101154
```python
102155
from nemo_curator.stages.function_decorators import processing_stage
103156
from nemo_curator.tasks import DocumentBatch
@@ -141,12 +194,19 @@ document_batch.data # pandas DataFrame with columns:
141194

142195
## Limitations
143196

144-
:::{note}
145-
**Text Processing Integration**: NeMo Curator's text processing stages are designed for `DocumentBatch` inputs, but they may not be optimized for audio-derived transcriptions. You may need to implement custom processing for audio-specific workflows.
197+
:::{important}
198+
**Text Processing Integration**: NeMo Curator's text processing stages are designed for `DocumentBatch` inputs (text documents such as articles, web pages), but they are not designed for audio-derived transcriptions. You should implement custom processing stages for audio-specific workflows.
199+
200+
**Reasons for incompatibility:**
201+
- Text filters assume document-level content (e.g., paragraph structure, word count thresholds designed for articles)
202+
- ASR transcriptions have different characteristics (shorter, may contain recognition errors, conversational language)
203+
- Audio-specific metrics (WER, duration, speech rate) require custom filtering logic
204+
205+
**Recommendation:** Use `PreserveByValueStage` for audio quality filtering, or create custom stages for transcription-specific processing.
146206
:::
147207

148208
## Related Topics
149209

150210
- **[Audio Processing Overview](../index.md)** - Complete audio processing workflow
151211
- **[Quality Assessment](../quality-assessment/index.md)** - Audio quality metrics and filtering
152-
- **[ASR Inference](../asr-inference/index.md)** - Speech recognition processing
212+
- **[ASR Inference](../asr-inference/index.md)** - Speech recognition processing

0 commit comments

Comments
 (0)