Skip to content

Commit dc93a5a

Browse files
lbliiipraateekmahajansarahyuricksuiyoubithomasdhc
authored
text curation refactor updates (#1082)
* text curation updates Signed-off-by: Lawrence Lane <[email protected]> * concepts Signed-off-by: Lawrence Lane <[email protected]> * remove synthetic docs not for this release Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * text concepts and getting started changes Signed-off-by: Lawrence Lane <[email protected]> * links, concepts Signed-off-by: Lawrence Lane <[email protected]> * crosslinks Signed-off-by: Lawrence Lane <[email protected]> * quality assessment updates Signed-off-by: Lawrence Lane <[email protected]> * more cleanup Signed-off-by: Lawrence Lane <[email protected]> * semdedup Signed-off-by: Lawrence Lane <[email protected]> * example import cleanup Signed-off-by: Lawrence Lane <[email protected]> * concepts Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/concepts/text/data-acquisition-concepts.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * feedback batch 1 Signed-off-by: Lawrence Lane <[email protected]> * feedback batch 2 Signed-off-by: Lawrence Lane <[email protected]> * file_paths="/path/to/jsonl_directory", Signed-off-by: Lawrence Lane <[email protected]> * revert removal of xenna for common crawl executors Signed-off-by: Lawrence Lane <[email protected]> * quickstart installation steps Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/concepts/text/data-acquisition-concepts.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * data loading concepts updates / simplification Signed-off-by: Lawrence Lane <[email protected]> * data processing feedback Signed-off-by: Lawrence Lane <[email protected]> * read-existing pg updates Signed-off-by: Lawrence Lane <[email protected]> * add-id updates Signed-off-by: Lawrence Lane <[email protected]> * dedup updates Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> * Skip IV2 Unit Test if package not installed (#1111) * Skip IV2 Unit Test if package not installed Signed-off-by: Ao Tang <[email protected]> * syntax Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: Ao Tang <[email protected]> Co-authored-by: Dong Hyuk Chang <[email protected]> * Llane/docs audio modality staging (#1028) * docs: audio modality staging Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * concept updates Signed-off-by: Lawrence Lane <[email protected]> * Update docs/reference/infrastructure/resumable-processing.md Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/reference/infrastructure/distributed-computing.md Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/audio-analysis/format-validation.md Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/curate-audio/process-data/audio-analysis/duration-calculation.md Co-authored-by: Copilot <[email protected]> Signed-off-by: L.B. <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * remove Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * mermaid fixes Signed-off-by: Lawrence Lane <[email protected]> * minor things lost from cutting out other commits Signed-off-by: Lawrence Lane <[email protected]> * removed any suggestive metric language Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * remove Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * pip and uv install stuff Signed-off-by: Lawrence Lane <[email protected]> * simplifying some pages Signed-off-by: Lawrence Lane <[email protected]> * link fixes Signed-off-by: Lawrence Lane <[email protected]> * update Signed-off-by: Lawrence Lane <[email protected]> * missed feedback from sarah Signed-off-by: Lawrence Lane <[email protected]> * missed feedback continued Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * nemo models page feedback Signed-off-by: Lawrence Lane <[email protected]> * installation update Signed-off-by: Lawrence Lane <[email protected]> * feedback follow up Signed-off-by: Lawrence Lane <[email protected]> * duration filtering, wer filtering example removals Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Copilot <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Signed-off-by: Ao Tang <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Co-authored-by: Ao Tang <[email protected]> Co-authored-by: Dong Hyuk Chang <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 62bc37f commit dc93a5a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+3002
-6838
lines changed

CHANGELOG.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -57,10 +57,6 @@
5757
- Easiness
5858
- Answerability
5959
- Q&A Retrieval Generation Pipeline
60-
- Parallel Dataset Curation for Machine Translation
61-
- Load/Write Bitext Files
62-
- Heuristic filtering (Histogram, Length Ratio)
63-
- Classifier filtering (Comet, Cometoid)
6460

6561
## NVIDIA NeMo Curator 0.5.0
6662

docs/about/concepts/text/data-acquisition-concepts.md

Lines changed: 140 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -9,23 +9,68 @@ modality: "text-only"
99
---
1010

1111
(about-concepts-text-data-acquisition)=
12+
1213
# Data Acquisition Concepts
1314

1415
This guide covers the core concepts for acquiring and processing text data from remote sources in NeMo Curator. Data acquisition focuses on downloading, extracting, and converting remote data sources into {ref}`DocumentDataset <documentdataset>` format for further processing.
1516

1617
## Overview
1718

18-
Data acquisition in NeMo Curator follows a three-stage architecture:
19+
Data acquisition in NeMo Curator follows a four-stage architecture:
1920

20-
1. **Download**: Retrieve raw data files from remote sources
21-
2. **Iterate**: Extract individual records from downloaded containers
22-
3. **Extract**: Convert raw content to clean, structured text
21+
1. **Generate URLs**: Discover and generate download URLs from minimal input
22+
2. **Download**: Retrieve raw data files from remote sources
23+
3. **Iterate**: Extract individual records from downloaded containers
24+
4. **Extract**: Convert raw content to clean, structured text
2325

2426
This process transforms diverse remote data sources into a standardized `DocumentDataset` that can be used throughout the text curation pipeline.
2527

2628
## Core Components
2729

28-
The data acquisition framework consists of three abstract base classes that define the acquisition workflow:
30+
The data acquisition framework consists of four abstract base classes that define the acquisition workflow:
31+
32+
### URLGenerator
33+
34+
Generates URLs for downloading from minimal input configuration.
35+
36+
```{list-table}
37+
:header-rows: 1
38+
39+
* - Feature
40+
- Description
41+
* - URL Generation
42+
- - Generates download URLs from configuration parameters
43+
- Supports date ranges and filtering criteria
44+
- Handles API calls to discover available resources
45+
- Manages URL limits and pagination
46+
* - Configuration
47+
- - Accepts minimal input parameters
48+
- Supports various data source patterns
49+
- Handles authentication requirements
50+
- Provides URL validation and filtering
51+
* - Extensibility
52+
- - Abstract base class for custom implementations
53+
- Plugin architecture for new data sources
54+
- Configurable generation parameters
55+
- Integration with existing discovery systems
56+
```
57+
58+
**Example Implementation**:
59+
60+
```python
61+
class CustomURLGenerator(URLGenerator):
62+
def __init__(self, base_urls):
63+
super().__init__()
64+
self._base_urls = base_urls
65+
66+
def generate_urls(self):
67+
# Custom URL generation logic
68+
urls = []
69+
for base_url in self._base_urls:
70+
# Discover or construct actual download URLs
71+
urls.extend(discover_files_at_url(base_url))
72+
return urls
73+
```
2974

3075
### DocumentDownloader
3176

@@ -54,17 +99,25 @@ Connects to and downloads data from remote repositories.
5499
```
55100

56101
**Example Implementation**:
102+
57103
```python
58104
class CustomDownloader(DocumentDownloader):
59105
def __init__(self, download_dir):
60106
super().__init__()
61107
self._download_dir = download_dir
62108

63-
def download(self, url):
109+
def _get_output_filename(self, url):
110+
# Extract filename from URL
111+
return url.split('/')[-1]
112+
113+
def _download_to_path(self, url, path):
64114
# Custom download logic
65-
output_file = os.path.join(self._download_dir, filename)
66-
# ... download implementation ...
67-
return output_file
115+
# Return (success_bool, error_message)
116+
try:
117+
# ... download implementation ...
118+
return True, None
119+
except Exception as e:
120+
return False, str(e)
68121
```
69122

70123
### DocumentIterator
@@ -94,6 +147,7 @@ Extracts individual records from downloaded containers.
94147
```
95148

96149
**Example Implementation**:
150+
97151
```python
98152
class CustomIterator(DocumentIterator):
99153
def __init__(self, log_frequency=1000):
@@ -103,10 +157,13 @@ class CustomIterator(DocumentIterator):
103157
def iterate(self, file_path):
104158
# Custom iteration logic
105159
for record in parse_container(file_path):
106-
yield record_metadata, record_content
160+
yield {"content": record_content, "metadata": record_metadata}
161+
162+
def output_columns(self):
163+
return ["content", "metadata"]
107164
```
108165

109-
### DocumentExtractor
166+
### DocumentExtractor (Optional)
110167

111168
Converts raw content formats to clean, structured text.
112169

@@ -133,15 +190,23 @@ Converts raw content formats to clean, structured text.
133190
```
134191

135192
**Example Implementation**:
193+
136194
```python
137195
class CustomExtractor(DocumentExtractor):
138196
def __init__(self):
139197
super().__init__()
140198

141-
def extract(self, content):
199+
def extract(self, record):
142200
# Custom extraction logic
143-
cleaned_text = clean_content(content)
201+
cleaned_text = clean_content(record["content"])
202+
detected_lang = detect_language(cleaned_text)
144203
return {"text": cleaned_text, "language": detected_lang}
204+
205+
def input_columns(self):
206+
return ["content", "metadata"]
207+
208+
def output_columns(self):
209+
return ["text", "language"]
145210
```
146211

147212
## Supported Data Sources
@@ -189,60 +254,67 @@ Extensible framework for implementing custom data loaders through abstract base
189254

190255
::::
191256

192-
## Integration with DocumentDataset
257+
## Integration with Pipeline Architecture
193258

194-
The data acquisition process seamlessly integrates with NeMo Curator's core data structures:
259+
The data acquisition process seamlessly integrates with NeMo Curator's pipeline-based architecture:
195260

196261
### Acquisition Workflow
197262

198263
```python
199-
from nemo_curator.download import download_and_extract
200-
201-
# Define acquisition components
202-
downloader = CustomDownloader(download_dir)
203-
iterator = CustomIterator()
204-
extractor = CustomExtractor()
205-
206-
# Acquire data and create DocumentDataset
207-
dataset = download_and_extract(
208-
urls=data_urls,
209-
output_paths=output_paths,
210-
downloader=downloader,
211-
iterator=iterator,
212-
extractor=extractor,
213-
output_format={"text": str, "language": str, "url": str}
264+
from nemo_curator.pipeline import Pipeline
265+
from nemo_curator.stages.text.download import DocumentDownloadExtractStage, URLGenerator
266+
267+
# Define acquisition pipeline
268+
pipeline = Pipeline(name="data_acquisition")
269+
270+
# Create download and extract stage with custom components
271+
download_extract_stage = DocumentDownloadExtractStage(
272+
url_generator=CustomURLGenerator(data_urls),
273+
downloader=CustomDownloader(download_dir),
274+
iterator=CustomIterator(),
275+
extractor=CustomExtractor()
214276
)
277+
pipeline.add_stage(download_extract_stage)
278+
279+
# Execute acquisition pipeline
280+
results = pipeline.run()
215281

216-
# Result is a standard DocumentDataset ready for processing
217-
print(f"Acquired {len(dataset)} documents")
218282
```
219283

220-
### Batch Processing
284+
### Scaling and Parallel Processing
221285

222-
For large-scale data acquisition, use the batch processing capabilities:
286+
The `DocumentDownloadExtractStage` automatically handles parallel processing through the distributed computing framework:
223287

224288
```python
225-
from nemo_curator.download import batch_download
226-
227-
# Download multiple sources in parallel
228-
downloaded_files = batch_download(urls, downloader)
289+
# Simple pipeline - parallelism handled automatically
290+
pipeline = Pipeline(name="scalable_acquisition")
291+
292+
# The composite stage handles URL generation and parallel downloads internally
293+
download_extract_stage = DocumentDownloadExtractStage(
294+
url_generator=CustomURLGenerator(base_urls), # Generates URLs from config
295+
downloader=CustomDownloader(download_dir),
296+
iterator=CustomIterator(),
297+
extractor=CustomExtractor(),
298+
url_limit=1000 # Limit number of URLs to process
299+
)
300+
pipeline.add_stage(download_extract_stage)
229301

230-
# Process downloaded files through iteration and extraction
231-
# ... (custom processing logic)
302+
# Execute with distributed executor for automatic scaling
303+
results = pipeline.run(executor)
232304
```
233305

234306
## Configuration and Customization
235307

236308
### Configuration Files
237309

238-
Data acquisition components can be configured through YAML files:
310+
You can configure data acquisition components through YAML files:
239311

240312
```yaml
241313
# downloader_config.yaml
242314
download_module: "my_package.CustomDownloader"
243315
download_params:
244316
download_dir: "/data/downloads"
245-
parallel_downloads: 4
317+
verbose: true
246318

247319
iterator_module: "my_package.CustomIterator"
248320
iterator_params:
@@ -274,26 +346,29 @@ downloader, iterator, extractor, format = build_downloader(
274346

275347
### Parallel Processing
276348

277-
Data acquisition leverages `Dask` for distributed processing:
349+
Data acquisition leverages distributed computing frameworks for scalable processing:
278350

279-
- **Parallel Downloads**: Multiple URLs downloaded simultaneously
280-
- **Concurrent Extraction**: Multiple files processed in parallel
351+
- **Parallel Downloads**: Each URL in the generated list downloads through separate workers
352+
- **Concurrent Extraction**: Files process in parallel across workers
281353
- **Memory Management**: Streaming processing for large files
282354
- **Fault Tolerance**: Automatic retry and recovery mechanisms
283355

284356
### Scaling Strategies
285357

286-
**Single Node**:
287-
- Use multiple worker processes for CPU-bound extraction
358+
**Single Node**:
359+
360+
- Use worker processes for CPU-bound extraction
288361
- Optimize I/O operations for local storage
289362
- Balance download and processing throughput
290363

291364
**Multi-Node**:
365+
292366
- Distribute download tasks across cluster nodes
293367
- Use shared storage for intermediate files
294-
- Coordinate processing through `Dask` distributed scheduler
368+
- Coordinate processing through distributed scheduler
295369

296370
**Cloud Deployment**:
371+
297372
- Leverage cloud storage for source data
298373
- Use auto-scaling for variable workloads
299374
- Optimize network bandwidth and storage costs
@@ -307,15 +382,25 @@ Data acquisition includes basic content-level deduplication during extraction (e
307382
```
308383

309384
```python
310-
# Acquisition produces DocumentDataset
311-
acquired_dataset = download_and_extract(...)
385+
from nemo_curator.stages.text.io.writer import ParquetWriter
386+
387+
# Create acquisition pipeline with all stages including writer
388+
acquisition_pipeline = Pipeline(name="data_acquisition")
389+
# ... add acquisition stages ...
390+
391+
# Add writer to save results directly
392+
writer = ParquetWriter(path="acquired_data/")
393+
acquisition_pipeline.add_stage(writer)
394+
395+
# Run pipeline to acquire and save data in one execution
396+
results = acquisition_pipeline.run(executor)
312397

313-
# Save in standard format for loading
314-
acquired_dataset.to_parquet("acquired_data/")
398+
# Later: Load using pipeline-based data loading
399+
from nemo_curator.stages.text.io.reader import ParquetReader
315400

316-
# Later: Load using standard data loading
317-
from nemo_curator.datasets import DocumentDataset
318-
dataset = DocumentDataset.read_parquet("acquired_data/")
401+
load_pipeline = Pipeline(name="load_acquired_data")
402+
reader = ParquetReader(file_paths="acquired_data/")
403+
load_pipeline.add_stage(reader)
319404
```
320405

321406
This enables you to:

docs/about/concepts/text/data-curation-pipeline.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
description: "Comprehensive overview of NeMo Curator's text curation pipeline architecture including data acquisition, processing, and synthetic data generation"
2+
description: "Comprehensive overview of NeMo Curator's text curation pipeline architecture including data acquisition and processing"
33
categories: ["concepts-architecture"]
44
tags: ["pipeline", "architecture", "text-curation", "distributed", "gpu-accelerated", "overview"]
55
personas: ["data-scientist-focused", "mle-focused"]
@@ -35,27 +35,21 @@ Multiple input sources provide the foundation for text curation:
3535
Raw data is downloaded, extracted, and converted into standardized formats:
3636
- **Download & Extraction**: Retrieve and process remote data sources
3737
- **Cleaning & Pre-processing**: Convert formats and normalize text
38-
- **DocumentDataset Creation**: Standardize data into NeMo Curator's core data structure
38+
- **DocumentBatch Creation**: Standardize data into NeMo Curator's core data structure
3939

4040
### 3. Quality Assessment & Filtering
4141
Multiple filtering stages ensure data quality:
4242
- **Heuristic Quality Filtering**: Rule-based filters for basic quality checks
4343
- **Model-based Quality Filtering**: AI-powered content assessment
4444
- **PII Removal**: Privacy-preserving data cleaning
45-
- **Task Decontamination**: Remove potential test set contamination
4645

4746
### 4. Deduplication
4847
Remove duplicate and near-duplicate content:
49-
- **Exact Deduplication**: Remove identical documents
50-
- **Fuzzy Deduplication**: Remove near-duplicates using similarity
48+
- **Exact Deduplication**: Remove identical documents using MD5 hashing
49+
- **Fuzzy Deduplication**: Remove near-duplicates using MinHash and LSH similarity
5150
- **Semantic Deduplication**: Remove semantically similar content using embeddings
5251

53-
### 5. Synthetic Data Generation
54-
Create high-quality synthetic content using LLMs:
55-
- **LLM-based Generation**: Use large language models to create new content
56-
- **Quality Control**: Ensure synthetic data meets quality standards
57-
58-
### 6. Final Preparation
52+
### 5. Final Preparation
5953
Prepare the curated dataset for training:
6054
- **Blending/Shuffling**: Combine and randomize data sources
6155
- **Format Standardization**: Ensure consistent output format
@@ -95,13 +89,6 @@ Components for downloading and extracting data from remote sources
9589
Concepts for filtering, deduplication, and classification
9690
:::
9791

98-
:::{grid-item-card} {octicon}`beaker;1.5em;sd-mr-1` Data Generation
99-
:link: about-concepts-text-data-gen
100-
:link-type: ref
101-
102-
Concepts for generating high-quality synthetic text
103-
:::
104-
10592
::::
10693

10794
## Processing Modes
@@ -131,4 +118,4 @@ The architecture scales from single machines to large clusters:
131118

132119
---
133120

134-
For hands-on experience, see the {ref}`Text Curation Getting Started Guide <gs-text>`.
121+
For hands-on experience, see the {ref}`Text Curation Getting Started Guide <gs-text>`.

0 commit comments

Comments
 (0)