Skip to content

Commit 7c66fc9

Browse files
lbliiisarahyurickpraateekmahajan
authored
text curation review feedback (#1129)
* text curation feedback Signed-off-by: Lawrence Lane <[email protected]> * quickstart tiles Signed-off-by: Lawrence Lane <[email protected]> * more changes Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * basic pass updates Signed-off-by: Lawrence Lane <[email protected]> * dask to ray Signed-off-by: Lawrence Lane <[email protected]> * Update docs/about/concepts/text/data-loading-concepts.md Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * remove Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * dedupe page refinement Signed-off-by: Lawrence Lane <[email protected]> * more admin page updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * remove kubernetes Signed-off-by: Lawrence Lane <[email protected]> * changelog Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * Apply suggestion from @sarahyurick Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: L.B. <[email protected]> * remove pages Signed-off-by: Lawrence Lane <[email protected]> * Apply suggestion from @praateekmahajan Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * updates; remove custom quality assessment Signed-off-by: Lawrence Lane <[email protected]> * fix gpu processing eg Signed-off-by: Lawrence Lane <[email protected]> * python support standardization Signed-off-by: Lawrence Lane <[email protected]> * remove singularity Signed-off-by: Lawrence Lane <[email protected]> * modality install updates Signed-off-by: Lawrence Lane <[email protected]> * updates Signed-off-by: Lawrence Lane <[email protected]> * feedback Signed-off-by: Lawrence Lane <[email protected]> * remove common installation issues Signed-off-by: Lawrence Lane <[email protected]> * remove module import verification Signed-off-by: Lawrence Lane <[email protected]> * prereq Signed-off-by: Lawrence Lane <[email protected]> * remove --extra-index-url Signed-off-by: Lawrence Lane <[email protected]> * audio install Signed-off-by: Lawrence Lane <[email protected]> * new container version var (not 1.0.0 but 25.09) Signed-off-by: Lawrence Lane <[email protected]> * build issues Signed-off-by: Lawrence Lane <[email protected]> * Update docs/get-started/audio.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/get-started/audio.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/get-started/video.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/get-started/video.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/get-started/text.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * Update docs/get-started/video.md Co-authored-by: Praateek Mahajan <[email protected]> Signed-off-by: L.B. <[email protected]> * changelog version fix Signed-off-by: Lawrence Lane <[email protected]> * remove slurm stuff; hide deployment directory for now Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]>
1 parent 81786a9 commit 7c66fc9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1091
-6125
lines changed

docs/_extensions/myst_codeblock_substitutions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
```
1313
1414
Or in inline code:
15-
Use `nvcr.io/nvidia/nemo-curator:{{ current_release }}` for the Docker image.
15+
Use `nvcr.io/nvidia/nemo-curator:{{ container_version }}` for the Docker image.
1616
1717
The substitutions will be replaced with their values from myst_substitutions in conf.py.
1818
"""
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
---
2+
description: "Comprehensive overview of deduplication techniques across text, image, and video modalities including exact, fuzzy, and semantic approaches"
3+
categories: ["concepts-architecture"]
4+
tags: ["deduplication", "exact-dedup", "fuzzy-dedup", "semantic-dedup", "multimodal", "gpu-accelerated"]
5+
personas: ["data-scientist-focused", "mle-focused"]
6+
difficulty: "intermediate"
7+
content_type: "concept"
8+
modality: "multimodal"
9+
---
10+
11+
(about-concepts-deduplication)=
12+
13+
# Deduplication Concepts
14+
15+
This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings.
16+
17+
## Overview
18+
19+
Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text, image, and video modalities.
20+
21+
Removing duplicates offers several benefits:
22+
23+
- **Improved Training Efficiency**: Prevents overrepresentation of repeated content
24+
- **Reduced Dataset Size**: Significantly reduces storage and processing requirements
25+
- **Better Model Performance**: Eliminates redundant examples that can bias training
26+
27+
## Deduplication Approaches
28+
29+
NeMo Curator implements three main deduplication strategies, each with different strengths and use cases:
30+
31+
### Exact Deduplication
32+
33+
- **Method**: Hash-based matching (MD5)
34+
- **Best For**: Identical copies and character-for-character matches
35+
- **Speed**: Very fast
36+
- **GPU Required**: Yes (for distributed processing)
37+
38+
Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content.
39+
40+
**Modalities Supported**: Text, Image, Video
41+
42+
### Fuzzy Deduplication
43+
44+
- **Method**: MinHash and Locality-Sensitive Hashing (LSH)
45+
- **Best For**: Near-duplicates with minor changes (reformatting, small edits)
46+
- **Speed**: Fast
47+
- **GPU Required**: Yes
48+
49+
Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits.
50+
51+
**Modalities Supported**: Text
52+
53+
### Semantic Deduplication
54+
55+
- **Method**: Embedding-based similarity using neural networks
56+
- **Best For**: Content with similar meaning but different expression
57+
- **Speed**: Moderate (requires embedding generation)
58+
- **GPU Required**: Yes
59+
60+
Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation.
61+
62+
**Modalities Supported**: Text, Image, Video
63+
64+
## Multimodal Applications
65+
66+
### Text Deduplication
67+
68+
Text deduplication is the most mature implementation, offering all three approaches:
69+
70+
- **Exact**: Remove identical documents using MD5 hashing
71+
- **Fuzzy**: Remove near-duplicates using MinHash and LSH similarity
72+
- **Semantic**: Remove semantically similar content using embeddings
73+
74+
Text deduplication can handle web-scale datasets and is commonly used for:
75+
76+
- Web crawl data (Common Crawl)
77+
- Academic papers (ArXiv)
78+
- Code repositories
79+
- General text corpora
80+
81+
### Video Deduplication
82+
83+
Video deduplication uses the semantic deduplication workflow with video embeddings:
84+
85+
- **Semantic Clustering**: Uses the general K-means clustering workflow on video embeddings
86+
- **Pairwise Similarity**: Computes within-cluster similarity using the semantic deduplication pipeline
87+
- **Representative Selection**: Leverages the semantic workflow to identify and remove redundant content
88+
89+
Video deduplication is particularly effective for:
90+
91+
- Educational content with similar presentations
92+
- News clips covering the same events
93+
- Entertainment content with repeated segments
94+
95+
### Image Deduplication
96+
97+
Image deduplication capabilities focus on removing duplicate images from datasets:
98+
99+
- **Duplicate Removal**: Filters out images identified as duplicates from previous deduplication stages
100+
- **Integration Support**: Works with image processing pipelines through `ImageBatch` tasks
101+
102+
## Architecture and Performance
103+
104+
### Distributed Processing
105+
106+
All deduplication workflows leverage distributed computing frameworks:
107+
108+
- **Ray Backend**: Provides scalable distributed processing
109+
- **GPU Acceleration**: Essential for embedding generation and similarity computation
110+
- **Memory Optimization**: Streaming processing for large datasets
111+
112+
### Scalability Characteristics
113+
114+
```{list-table} Deduplication Scalability
115+
:header-rows: 1
116+
:widths: 20 25 25 30
117+
118+
* - Method
119+
- Dataset Size
120+
- Memory Requirements
121+
- Processing Time
122+
* - Exact
123+
- Unlimited
124+
- Low (hash storage)
125+
- Linear with data size
126+
* - Fuzzy
127+
- Petabyte-scale
128+
- Moderate (LSH tables)
129+
- Sub-linear with LSH
130+
* - Semantic
131+
- Terabyte-scale
132+
- High (embeddings)
133+
- Depends on model inference
134+
```
135+
136+
## Implementation Patterns
137+
138+
### Workflow-Based Processing
139+
140+
NeMo Curator provides high-level workflows that encapsulate the complete deduplication process:
141+
142+
```python
143+
# Text exact deduplication
144+
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
145+
146+
# Text fuzzy deduplication
147+
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
148+
149+
# Text semantic deduplication
150+
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
151+
```
152+
153+
### Stage-Based Processing
154+
155+
For fine-grained control, individual stages can be composed into custom pipelines:
156+
157+
```python
158+
# Video semantic deduplication stages
159+
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
160+
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
161+
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
162+
```
163+
164+
## Integration with Pipeline Architecture
165+
166+
Deduplication integrates seamlessly with NeMo Curator's pipeline-based architecture:
167+
168+
1. **Input Compatibility**: Works with `DocumentBatch` tasks from any data loading stage
169+
2. **Output Integration**: Produces standardized outputs for downstream processing
170+
3. **Chaining Support**: Can be combined with filtering and cleaning stages
171+
4. **Executor Support**: Compatible with all distributed execution backends

docs/about/concepts/index.md

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ modality: "universal"
1111
(about-concepts)=
1212
# Concepts
1313

14-
Learn about the core components and concepts introduced by NeMo Curator. The following concepts are organized by each major modality.
14+
Learn about the core components and concepts introduced by NeMo Curator.
1515

1616
## Modality Concepts
1717

@@ -20,18 +20,18 @@ Learn about working with specific modalities using NeMo Curator.
2020
::::{grid} 1 1 1 2
2121
:gutter: 1 1 1 2
2222

23-
:::{grid-item-card} {octicon}`image;1.5em;sd-mr-1` Image Curation Concepts
24-
:link: about-concepts-image
23+
:::{grid-item-card} {octicon}`typography;1.5em;sd-mr-1` Text Curation Concepts
24+
:link: about-concepts-text
2525
:link-type: ref
2626

27-
Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering, deduplication), and dataset export.
27+
Learn about text data curation, covering data loading and processing (filtering, classification, deduplication).
2828
:::
2929

30-
:::{grid-item-card} {octicon}`typography;1.5em;sd-mr-1` Text Curation Concepts
31-
:link: about-concepts-text
30+
:::{grid-item-card} {octicon}`image;1.5em;sd-mr-1` Image Curation Concepts
31+
:link: about-concepts-image
3232
:link-type: ref
3333

34-
Learn about text data curation, covering data loading, processing (filtering, deduplication, classification), and synthetic data generation.
34+
Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering), and dataset export.
3535
:::
3636

3737
:::{grid-item-card} {octicon}`video;1.5em;sd-mr-1` Video Curation Concepts
@@ -49,12 +49,28 @@ Learn about speech data curation, ASR inference, quality assessment, and audio-t
4949
:::
5050
::::
5151

52+
## Universal Concepts
53+
54+
Core concepts that apply across all modalities in NeMo Curator.
55+
56+
::::{grid} 1 1 1 1
57+
:gutter: 1 1 1 1
58+
59+
:::{grid-item-card} {octicon}`duplicate;1.5em;sd-mr-1` Deduplication Concepts
60+
:link: about-concepts-deduplication
61+
:link-type: ref
62+
63+
Comprehensive overview of deduplication techniques across text, image, and video modalities including exact, fuzzy, and semantic approaches.
64+
:::
65+
::::
66+
5267
```{toctree}
5368
:hidden:
5469
:maxdepth: 2
5570
56-
Image Concepts <image/index.md>
5771
Text Concepts <text/index.md>
72+
Image Concepts <image/index.md>
5873
Video Concepts <video/index.md>
5974
Audio Concepts <audio/index.md>
75+
Deduplication Concepts <deduplication.md>
6076
```

0 commit comments

Comments
 (0)