Skip to content

Commit 52b4fce

Browse files
arhamm1lbliii
andauthored
Update wer-filtering.md (#1246)
* Update wer-filtering.md Signed-off-by: Arham Mehta <[email protected]> * style guide minor edits Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Arham Mehta <[email protected]> Signed-off-by: Lawrence Lane <[email protected]> Co-authored-by: L.B. <[email protected]>
1 parent 57f9666 commit 52b4fce

File tree

1 file changed

+135
-6
lines changed

1 file changed

+135
-6
lines changed

docs/curate-audio/process-data/quality-assessment/wer-filtering.md

Lines changed: 135 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,25 +16,40 @@ Filter audio samples based on Word Error Rate (WER) thresholds to ensure high-qu
1616

1717
### What is Word Error Rate?
1818

19-
Word Error Rate measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:
19+
Word Error Rate (WER) measures transcription accuracy by calculating the percentage of words that differ between ground truth and ASR predictions:
2020

2121
```text
2222
WER = (Substitutions + Deletions + Insertions) / Total_Reference_Words × 100
2323
```
2424

25+
**Components:**
26+
27+
- **Substitutions**: Words incorrectly replaced (for example, "cat" → "hat")
28+
- **Deletions**: Words omitted from the prediction
29+
- **Insertions**: Extra words added to the prediction
30+
- **Total_Reference_Words**: Total word count in ground truth transcription
31+
32+
A lower WER indicates higher transcription accuracy.
33+
2534
### WER Quality Levels
2635

36+
The following table provides general guidelines for interpreting WER values. Adjust thresholds based on your specific domain requirements and use case:
37+
2738
| WER Range | Quality Level | Recommended Use |
2839
|-----------|---------------|-----------------|
2940
| 0-10% | Excellent | Production ASR training, high-quality datasets |
3041
| 10-25% | Good | General ASR training, most applications |
3142
| 25-50% | Moderate | Supplementary training data, domain adaptation |
3243
| 50-75% | Poor | Review required, potential filtering |
33-
| 75%+ | Very Poor | Strong candidate for removal |
44+
| 75%+ | Poor | Strong candidate for removal |
3445

3546
## Basic WER Filtering
3647

37-
### Calculate WER
48+
Follow these steps to calculate WER values and apply threshold-based filtering to your audio dataset:
49+
50+
### Step 1: Calculate WER
51+
52+
Use `GetPairwiseWerStage` to compute WER between ground truth transcriptions and ASR model predictions:
3853

3954
```python
4055
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
@@ -50,7 +65,17 @@ wer_stage = GetPairwiseWerStage(
5065
pipeline.add_stage(wer_stage)
5166
```
5267

53-
### Apply WER Threshold
68+
**Parameters:**
69+
70+
- `text_key`: Field name containing ground truth transcriptions in your manifest
71+
- `pred_text_key`: Field name containing ASR predictions (from `InferenceAsrNemoStage` or similar)
72+
- `wer_key`: Field name to store calculated WER values (default: `"wer"`)
73+
74+
**Prerequisites:** Your audio manifest must contain both ground truth transcriptions and ASR predictions before calculating WER.
75+
76+
### Step 2: Apply WER Threshold
77+
78+
Use `PreserveByValueStage` to filter audio samples based on the calculated WER values:
5479

5580
```python
5681
from nemo_curator.stages.audio.common import PreserveByValueStage
@@ -65,12 +90,28 @@ wer_filter = PreserveByValueStage(
6590
pipeline.add_stage(wer_filter)
6691
```
6792

68-
## Advanced WER Filtering
93+
**Parameters:**
6994

95+
- `input_value_key`: Field containing WER values (matches `wer_key` from previous stage)
96+
- `target_value`: WER threshold (percentage as float, e.g., `30.0` for 30%)
97+
- `operator`: Comparison operator (`"le"` for ≤, `"lt"` for <, `"ge"` for ≥, `"gt"` for >)
98+
99+
The stage preserves samples meeting the threshold criteria and filters out others.
100+
101+
## Advanced WER Filtering
70102

71103
### Statistical WER Filtering
72104

73-
For statistical analysis-based threshold selection, you can analyze your dataset's WER distribution and then apply the calculated threshold using NeMo Curator's `PreserveByValueStage`:
105+
Rather than using fixed thresholds, you can analyze your dataset's WER distribution to determine optimal filtering thresholds. This approach is useful when working with domain-specific data or evaluating data quality.
106+
107+
**Workflow:**
108+
109+
1. Calculate WER for all samples using `GetPairwiseWerStage`
110+
2. Export results and analyze WER distribution (mean, median, percentiles)
111+
3. Determine threshold based on your quality requirements (for example, keep samples below 75th percentile)
112+
4. Apply the calculated threshold using `PreserveByValueStage`
113+
114+
**Example:**
74115

75116
```python
76117
# Apply calculated statistical threshold
@@ -79,12 +120,22 @@ statistical_filter = PreserveByValueStage(
79120
target_value=calculated_threshold, # From your statistical analysis
80121
operator="le"
81122
)
123+
124+
pipeline.add_stage(statistical_filter)
82125
```
83126

127+
:::{tip}
128+
Use `AudioToDocumentStage` and `JsonlWriter` to export WER values for analysis in tools like pandas, numpy, or visualization libraries.
129+
:::
130+
84131
## Domain-Specific WER Filtering
85132

133+
Different speech domains have varying acoustic characteristics and transcription complexity. Adjust WER thresholds based on your specific domain:
134+
86135
### Conversational Speech
87136

137+
Conversational speech typically has higher WER due to informal language, disfluencies, overlapping speech, and background noise. Use more lenient thresholds:
138+
88139
```python
89140
# More lenient thresholds for conversational speech
90141
conversational_wer_config = {
@@ -98,10 +149,16 @@ conversational_filter = PreserveByValueStage(
98149
target_value=conversational_wer_config["good_threshold"],
99150
operator="le"
100151
)
152+
153+
pipeline.add_stage(conversational_filter)
101154
```
102155

156+
**Use cases:** Call center recordings, meeting transcriptions, casual interviews, social media audio
157+
103158
### Broadcast and News
104159

160+
Broadcast speech features professional speakers, controlled environments, and clear articulation, enabling stricter quality standards:
161+
105162
```python
106163
# Stricter thresholds for high-quality broadcast speech
107164
broadcast_wer_config = {
@@ -115,4 +172,76 @@ broadcast_filter = PreserveByValueStage(
115172
target_value=broadcast_wer_config["good_threshold"],
116173
operator="le"
117174
)
175+
176+
pipeline.add_stage(broadcast_filter)
118177
```
178+
179+
**Use cases:** News broadcasts, audiobooks, podcasts, prepared presentations, voiceovers
180+
181+
## Complete WER Filtering Example
182+
183+
Here's a complete pipeline demonstrating WER calculation and filtering:
184+
185+
```python
186+
from nemo_curator.pipeline import Pipeline
187+
from nemo_curator.backends.xenna import XennaExecutor
188+
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
189+
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
190+
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
191+
from nemo_curator.stages.audio.common import PreserveByValueStage
192+
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
193+
from nemo_curator.stages.text.io.writer import JsonlWriter
194+
from nemo_curator.stages.resources import Resources
195+
196+
# Create WER filtering pipeline
197+
pipeline = Pipeline(name="wer_filtering")
198+
199+
# 1. Load audio data with ground truth transcriptions
200+
pipeline.add_stage(CreateInitialManifestFleursStage(
201+
lang="en_us",
202+
split="validation",
203+
raw_data_dir="./audio_data"
204+
).with_(batch_size=8))
205+
206+
# 2. Run ASR inference to generate predictions
207+
pipeline.add_stage(InferenceAsrNemoStage(
208+
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
209+
pred_text_key="pred_text"
210+
).with_(resources=Resources(gpus=1.0)))
211+
212+
# 3. Calculate WER
213+
pipeline.add_stage(GetPairwiseWerStage(
214+
text_key="text",
215+
pred_text_key="pred_text",
216+
wer_key="wer"
217+
))
218+
219+
# 4. Filter by WER threshold (keep WER ≤ 30%)
220+
pipeline.add_stage(PreserveByValueStage(
221+
input_value_key="wer",
222+
target_value=30.0,
223+
operator="le"
224+
))
225+
226+
# 5. Export filtered results
227+
pipeline.add_stage(AudioToDocumentStage())
228+
pipeline.add_stage(JsonlWriter(path="./filtered_audio"))
229+
230+
# Execute pipeline
231+
executor = XennaExecutor()
232+
pipeline.run(executor)
233+
```
234+
235+
## Best Practices
236+
237+
- **Start with lenient thresholds**: Begin with higher WER thresholds (for example, 50%) and progressively tighten based on dataset size and quality requirements.
238+
- **Consider domain characteristics**: Adjust thresholds based on speech type (conversational compared to broadcast compared to read speech).
239+
- **Analyze before filtering**: Export WER distributions to understand your data before applying aggressive filters.
240+
- **Balance quality and quantity**: Stricter thresholds improve data quality but reduce dataset size; find the right balance for your use case.
241+
- **Check ASR model**: Ensure your ASR model is appropriate for the language and domain before using WER for filtering.
242+
243+
## Related Topics
244+
245+
- **[Quality Assessment Overview](index.md)** - Complete guide to audio quality assessment
246+
- **[Duration Filtering](duration-filtering.md)** - Filter by audio length and speech rate
247+
- **[ASR Inference](../asr-inference/index.md)** - Generate ASR predictions for WER calculation

0 commit comments

Comments
 (0)