Skip to content

Commit 9cdf9ef

Browse files
committed
add more pages
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
1 parent 2d014a3 commit 9cdf9ef

File tree

3 files changed

+198
-119
lines changed

3 files changed

+198
-119
lines changed

docs/about/release-notes/migration-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ modality: "universal"
1515
This guide explains how to transition existing Dask-based NeMo Curator workflows to the new Ray-based pipeline architecture.
1616

1717
```{seealso}
18-
For broader NeMo Framework migration topics, refer to the [NeMo Framework 2.0 Migration Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/migration/index.html).
18+
For broader NeMo Framework migration topics, refer to the [NeMo Framework 2.0 Migration Guide](https://docs.nvidia.com/nemo-framework/user-guide/25.11/nemo-2.0/migration/index.html).
1919
```
2020

2121
## Overview

docs/curate-text/process-data/quality-assessment/heuristic.md

Lines changed: 161 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,8 @@ NeMo Curator includes more than 30 heuristic filters for assessing document qual
147147
|--------|-------------|----------------|---------------|
148148
| **RepeatedLinesFilter** | Detects repeated lines | `max_repeated_line_fraction` | 0.7 |
149149
| **RepeatedParagraphsFilter** | Detects repeated paragraphs | `max_repeated_paragraphs_ratio` | 0.7 |
150+
| **RepeatedLinesByCharFilter** | Detects repeated lines by character count | `max_repeated_lines_char_ratio` | 0.8 |
151+
| **RepeatedParagraphsByCharFilter** | Detects repeated paragraphs by character count | `max_repeated_paragraphs_char_ratio` | 0.8 |
150152
| **RepeatingTopNGramsFilter** | Detects excessive repetition of n-grams | `n`, `max_repeating_ngram_ratio` | n=2, ratio=0.2 |
151153
| **RepeatingDuplicateNGramsFilter** | Detects duplicate n-grams | `n`, `max_repeating_duplicate_ngram_ratio` | n=2, ratio=0.2 |
152154

@@ -178,39 +180,46 @@ NeMo Curator includes more than 30 heuristic filters for assessing document qual
178180
| **PornographicUrlsFilter** | Detects URLs containing "porn" substring | None | N/A |
179181
| **EllipsisFilter** | Limits excessive ellipses | `max_num_lines_ending_with_ellipsis_ratio` | 0.3 |
180182
| **HistogramFilter** | Filters based on character distribution | `threshold` | 0.8 |
181-
| **SubstringFilter** | Filters based on presence of specific substring in a position | `substring`, `position` | "", "any" |
183+
| **SubstringFilter** | Filters based on presence of specific substring in a position | `substring`, `position` | N/A (required) |
182184

183185
## Configuration
184186

187+
NeMo Curator pipelines can be configured using YAML files with [Hydra](https://hydra.cc/). The configuration uses `_target_` to specify class paths:
188+
185189
::::{tab-set}
186190

187-
:::{tab-item} Example Configuration
191+
:::{tab-item} Hydra Configuration
188192
```yaml
189-
# Sample filter configuration (simplified)
190-
filters:
191-
- name: ScoreFilter
192-
filter:
193-
name: WordCountFilter
193+
# Hydra-based pipeline configuration
194+
input_path: /path/to/input
195+
output_path: /path/to/output
196+
text_field: text
197+
198+
stages:
199+
- _target_: nemo_curator.stages.text.io.reader.JsonlReader
200+
file_paths: ${input_path}
201+
fields: null
202+
203+
- _target_: nemo_curator.stages.text.modules.score_filter.ScoreFilter
204+
filter_obj:
205+
_target_: nemo_curator.stages.text.filters.heuristic_filter.WordCountFilter
194206
min_words: 50
195207
max_words: 100000
196-
text_field: text
208+
text_field: ${text_field}
197209
score_field: word_count
198210

199-
- name: ScoreFilter
200-
filter:
201-
name: PunctuationFilter
211+
- _target_: nemo_curator.stages.text.modules.score_filter.ScoreFilter
212+
filter_obj:
213+
_target_: nemo_curator.stages.text.filters.heuristic_filter.PunctuationFilter
202214
max_num_sentences_without_endmark_ratio: 0.85
203-
text_field: text
204-
score_field: punctuation_ratio
205-
206-
- name: ScoreFilter
207-
filter:
208-
name: RepeatingTopNGramsFilter
209-
n: 2
210-
max_repeating_ngram_ratio: 0.18
211-
text_field: text
212-
score_field: ngram_repetition
215+
text_field: ${text_field}
216+
score_field: null
217+
218+
- _target_: nemo_curator.stages.text.io.writer.JsonlWriter
219+
path: ${output_path}
213220
```
221+
222+
See `nemo_curator/config/text/` for complete pipeline examples.
214223
:::
215224

216225
::::
@@ -241,23 +250,7 @@ pipeline.add_stage(ScoreFilter(filter_obj=RepeatingTopNGramsFilter(), text_field
241250
:::
242251

243252
:::{tab-item} Performance Tuning
244-
```python
245-
# Optimize filter performance with proper configuration
246-
247-
# Configure executor for better performance
248-
executor_config = {
249-
"execution_mode": "streaming",
250-
"cpu_allocation_percentage": 0.95,
251-
"logging_interval": 30
252-
}
253-
254-
# Use custom executor configuration when needed
255-
executor = XennaExecutor(config=executor_config)
256-
results = pipeline.run(executor)
257-
258-
# Or use default configuration
259-
# results = pipeline.run()
260-
```
253+
See the [Performance Tuning](#performance-tuning) section below for executor configuration examples using Xenna or Ray backends.
261254
:::
262255

263256
:::{tab-item} Precision vs. Recall
@@ -321,26 +314,126 @@ quality_pipeline.add_stage(ScoreFilter(
321314

322315
## Analyzing Filter Results
323316

324-
When working with non-English data or tuning your filtering pipeline, it's valuable to examine which filters are removing documents:
317+
When tuning filter thresholds, analyze score distributions before applying filters. NeMo Curator provides two modules for this workflow:
318+
319+
- **`Score`**: Computes scores and adds them as columns without removing documents
320+
- **`ScoreFilter`**: Computes scores, filters based on thresholds, and optionally retains scores in output
321+
322+
Use `Score` first to understand your data distribution, then apply `ScoreFilter` with tuned thresholds.
325323

326324
::::{tab-set}
327325

328-
:::{tab-item} Filter Analysis
326+
:::{tab-item} Score Without Filtering
327+
Use `Score` to add score columns to your data without removing any documents:
328+
329329
```python
330-
import pandas as pd
330+
from nemo_curator.pipeline import Pipeline
331+
from nemo_curator.stages.text.io.reader import JsonlReader
332+
from nemo_curator.stages.text.io.writer import JsonlWriter
333+
from nemo_curator.stages.text.modules import Score
334+
from nemo_curator.stages.text.filters import WordCountFilter, RepeatingTopNGramsFilter
331335
332-
# Load scores from filter run
333-
scores = pd.read_json("output/scores/scores.jsonl", lines=True)
336+
# Create scoring pipeline (no filtering)
337+
pipeline = Pipeline(name="score_analysis")
334338
335-
# Analyze rejection reasons
336-
rejection_counts = scores[scores["rejected"] == True].groupby("rejected_by").size()
337-
print(f"Documents rejected by filter:\n{rejection_counts}")
339+
# Load data
340+
pipeline.add_stage(JsonlReader(file_paths="input_data/", fields=["text", "id"]))
338341
339-
# Analyze score distributions
342+
# Add scores without filtering
343+
pipeline.add_stage(Score(
344+
score_fn=WordCountFilter(min_words=80),
345+
text_field="text",
346+
score_field="word_count"
347+
))
348+
pipeline.add_stage(Score(
349+
score_fn=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
350+
text_field="text",
351+
score_field="ngram_ratio"
352+
))
353+
354+
# Write scored data (all documents preserved)
355+
pipeline.add_stage(JsonlWriter(path="scored_output/"))
356+
357+
pipeline.run()
358+
```
359+
360+
Output files are written to the `scored_output/` directory with one file per input partition.
361+
:::
362+
363+
:::{tab-item} Analyze Score Distribution
364+
Load the scored output and analyze distributions to tune filter thresholds:
365+
366+
```python
367+
import glob
368+
import pandas as pd
340369
import matplotlib.pyplot as plt
341-
scores.hist(column="word_count", bins=50)
342-
plt.title("Word Count Distribution")
343-
plt.savefig("word_count_hist.png")
370+
371+
# Load all scored output files
372+
files = glob.glob("scored_output/*.jsonl")
373+
scored_data = pd.concat([pd.read_json(f, lines=True) for f in files], ignore_index=True)
374+
375+
# Analyze score distributions
376+
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
377+
378+
# Word count distribution
379+
axes[0].hist(scored_data["word_count"], bins=50, edgecolor="black")
380+
axes[0].axvline(x=80, color="red", linestyle="--", label="Threshold (80)")
381+
axes[0].set_title("Word Count Distribution")
382+
axes[0].set_xlabel("Word Count")
383+
axes[0].legend()
384+
385+
# N-gram ratio distribution
386+
axes[1].hist(scored_data["ngram_ratio"], bins=50, edgecolor="black")
387+
axes[1].axvline(x=0.18, color="red", linestyle="--", label="Threshold (0.18)")
388+
axes[1].set_title("3-gram Repetition Ratio")
389+
axes[1].set_xlabel("Ratio")
390+
axes[1].legend()
391+
392+
plt.tight_layout()
393+
plt.savefig("score_distributions.png")
394+
395+
# Print statistics
396+
print(f"Total documents: {len(scored_data)}")
397+
print(f"Documents below word count threshold: {(scored_data['word_count'] < 80).sum()}")
398+
print(f"Documents above ngram threshold: {(scored_data['ngram_ratio'] > 0.18).sum()}")
399+
```
400+
401+
For large datasets, consider sampling or using Ray, Dask, or Polars for memory-efficient analysis.
402+
:::
403+
404+
:::{tab-item} Apply Tuned Filters
405+
After analyzing distributions, apply filters with your chosen thresholds:
406+
407+
```python
408+
from nemo_curator.pipeline import Pipeline
409+
from nemo_curator.stages.text.io.reader import JsonlReader
410+
from nemo_curator.stages.text.io.writer import JsonlWriter
411+
from nemo_curator.stages.text.modules import ScoreFilter
412+
from nemo_curator.stages.text.filters import WordCountFilter, RepeatingTopNGramsFilter
413+
414+
pipeline = Pipeline(name="filtering_pipeline")
415+
pipeline.add_stage(JsonlReader(file_paths="input_data/", fields=["text", "id"]))
416+
417+
# Filter with tuned thresholds (scores retained in output)
418+
pipeline.add_stage(ScoreFilter(
419+
filter_obj=WordCountFilter(min_words=80),
420+
text_field="text",
421+
score_field="word_count"
422+
))
423+
pipeline.add_stage(ScoreFilter(
424+
filter_obj=RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
425+
text_field="text",
426+
score_field="ngram_ratio"
427+
))
428+
429+
pipeline.add_stage(JsonlWriter(path="filtered_output/"))
430+
431+
# Run with default XennaExecutor
432+
pipeline.run()
433+
434+
# Or use Ray for distributed processing (see Performance Tuning section)
435+
# from nemo_curator.backends.experimental.ray_data import RayDataExecutor
436+
# pipeline.run(RayDataExecutor(ignore_head_node=True))
344437
```
345438
:::
346439

@@ -352,43 +445,35 @@ For large datasets, consider these performance optimizations:
352445

353446
::::{tab-set}
354447

355-
:::{tab-item} Memory Efficient Processing
448+
:::{tab-item} XennaExecutor (Default)
449+
`XennaExecutor` is the default executor, optimized for streaming workloads. You can customize its configuration or use the defaults:
450+
356451
```python
357-
# Process large datasets efficiently using pipeline streaming
452+
from nemo_curator.backends.xenna import XennaExecutor
358453
359-
# Configure for streaming processing
360-
executor_config = {
454+
# Custom configuration for streaming processing
455+
executor = XennaExecutor(config={
361456
"execution_mode": "streaming",
362-
"cpu_allocation_percentage": 0.8,
457+
"cpu_allocation_percentage": 0.95,
363458
"logging_interval": 60
364-
}
365-
366-
# Use custom configuration for large datasets
367-
executor = XennaExecutor(config=executor_config)
459+
})
368460
results = pipeline.run(executor)
369-
370-
# Default configuration works for most cases
371-
# results = pipeline.run()
372461
```
462+
463+
If no executor is specified, `pipeline.run()` uses `XennaExecutor` with default settings.
373464
:::
374465

375-
:::{tab-item} Distributed Processing
376-
```python
377-
# Scale processing across multiple workers
466+
:::{tab-item} RayDataExecutor (Experimental)
467+
`RayDataExecutor` provides distributed processing using Ray Data. It has shown performance improvements for filtering workloads compared to the default executor.
378468

379-
# Configure for distributed processing
380-
executor_config = {
381-
"execution_mode": "streaming",
382-
"cpu_allocation_percentage": 0.95,
383-
"max_workers_per_stage": 8
384-
}
469+
```python
470+
from nemo_curator.backends.experimental.ray_data import RayDataExecutor
385471
386-
# Use custom configuration for distributed processing
387-
executor = XennaExecutor(config=executor_config)
472+
executor = RayDataExecutor(
473+
config={"ignore_failures": False},
474+
ignore_head_node=True # Exclude head node from computation
475+
)
388476
results = pipeline.run(executor)
389-
390-
# Default configuration uses single worker
391-
# results = pipeline.run()
392477
```
393478
:::
394479

0 commit comments

Comments
 (0)