You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Substitutions**: Words incorrectly replaced (for example, "cat" → "hat")
28
+
-**Deletions**: Words omitted from the prediction
29
+
-**Insertions**: Extra words added to the prediction
30
+
-**Total_Reference_Words**: Total word count in ground truth transcription
31
+
32
+
A lower WER indicates higher transcription accuracy.
33
+
25
34
### WER Quality Levels
26
35
36
+
The following table provides general guidelines for interpreting WER values. Adjust thresholds based on your specific domain requirements and use case:
37
+
27
38
| WER Range | Quality Level | Recommended Use |
28
39
|-----------|---------------|-----------------|
29
40
| 0-10% | Excellent | Production ASR training, high-quality datasets |
30
41
| 10-25% | Good | General ASR training, most applications |
-`input_value_key`: Field containing WER values (matches `wer_key` from previous stage)
96
+
-`target_value`: WER threshold (percentage as float, e.g., `30.0` for 30%)
97
+
-`operator`: Comparison operator (`"le"` for ≤, `"lt"` for <, `"ge"` for ≥, `"gt"` for >)
98
+
99
+
The stage preserves samples meeting the threshold criteria and filters out others.
100
+
101
+
## Advanced WER Filtering
70
102
71
103
### Statistical WER Filtering
72
104
73
-
For statistical analysis-based threshold selection, you can analyze your dataset's WER distribution and then apply the calculated threshold using NeMo Curator's `PreserveByValueStage`:
105
+
Rather than using fixed thresholds, you can analyze your dataset's WER distribution to determine optimal filtering thresholds. This approach is useful when working with domain-specific data or evaluating data quality.
106
+
107
+
**Workflow:**
108
+
109
+
1. Calculate WER for all samples using `GetPairwiseWerStage`
110
+
2. Export results and analyze WER distribution (mean, median, percentiles)
111
+
3. Determine threshold based on your quality requirements (for example, keep samples below 75th percentile)
112
+
4. Apply the calculated threshold using `PreserveByValueStage`
target_value=calculated_threshold, # From your statistical analysis
80
121
operator="le"
81
122
)
123
+
124
+
pipeline.add_stage(statistical_filter)
82
125
```
83
126
127
+
:::{tip}
128
+
Use `AudioToDocumentStage` and `JsonlWriter` to export WER values for analysis in tools like pandas, numpy, or visualization libraries.
129
+
:::
130
+
84
131
## Domain-Specific WER Filtering
85
132
133
+
Different speech domains have varying acoustic characteristics and transcription complexity. Adjust WER thresholds based on your specific domain:
134
+
86
135
### Conversational Speech
87
136
137
+
Conversational speech typically has higher WER due to informal language, disfluencies, overlapping speech, and background noise. Use more lenient thresholds:
138
+
88
139
```python
89
140
# More lenient thresholds for conversational speech
-**Start with lenient thresholds**: Begin with higher WER thresholds (for example, 50%) and progressively tighten based on dataset size and quality requirements.
238
+
-**Consider domain characteristics**: Adjust thresholds based on speech type (conversational compared to broadcast compared to read speech).
239
+
-**Analyze before filtering**: Export WER distributions to understand your data before applying aggressive filters.
240
+
-**Balance quality and quantity**: Stricter thresholds improve data quality but reduce dataset size; find the right balance for your use case.
241
+
-**Check ASR model**: Ensure your ASR model is appropriate for the language and domain before using WER for filtering.
242
+
243
+
## Related Topics
244
+
245
+
-**[Quality Assessment Overview](index.md)** - Complete guide to audio quality assessment
246
+
-**[Duration Filtering](duration-filtering.md)** - Filter by audio length and speech rate
247
+
-**[ASR Inference](../asr-inference/index.md)** - Generate ASR predictions for WER calculation
0 commit comments