You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* docs: vdr feedback
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/admin/installation.md
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/admin/installation.md
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/admin/installation.md
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/curate-text/process-data/quality-assessment/distributed-classifier.md
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* feedback
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* feedback
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* re order sidebar
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* release notes draft
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* feedback
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* remove more internvid content
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* release note fix
Signed-off-by: Lawrence Lane <llane@nvidia.com>
* Update docs/curate-video/tutorials/split-dedup.md
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
---------
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Copy file name to clipboardExpand all lines: docs/admin/installation.md
+22-5Lines changed: 22 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,14 +18,15 @@ This guide covers installing NeMo Curator with support for **all modalities** an
18
18
19
19
### System Requirements
20
20
21
-
For comprehensive system requirements and production deployment specifications, see[Production Deployment Requirements](deployment/requirements.md).
21
+
For comprehensive system requirements and production deployment specifications, refer to[Production Deployment Requirements](deployment/requirements.md).
22
22
23
23
**Quick Start Requirements:**
24
24
25
25
-**OS**: Ubuntu 24.04/22.04/20.04 (recommended)
26
26
-**Python**: 3.10, 3.11, or 3.12
27
27
-**Memory**: 16GB+ RAM for basic text processing
28
28
-**GPU** (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
29
+
-**CUDA 12** (required for `audio_cuda12`, `video_cuda12`, `image_cuda12`, and `text_cuda12` extras)
29
30
30
31
### Development vs Production
31
32
@@ -41,9 +42,13 @@ For comprehensive system requirements and production deployment specifications,
41
42
42
43
Choose one of the following installation methods based on your needs:
43
44
45
+
:::{tip}
46
+
**Docker is the recommended installation method** for video and audio workflows. The NeMo Curator container includes FFmpeg (with NVENC support) pre-configured, avoiding manual dependency setup. Refer to the [Container Installation](#container-installation) tab below.
47
+
:::
48
+
44
49
::::{tab-set}
45
50
46
-
:::{tab-item} PyPI Installation (Recommended)
51
+
:::{tab-item} PyPI Installation
47
52
48
53
Install NeMo Curator from the Python Package Index using `uv` for proper dependency resolution.
:::{tab-item} Container Installation (Recommended for Video/Audio)
93
98
94
-
NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed. You can run it with:
99
+
NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed, including FFmpeg with NVENC support.
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:{{ container_version }}
102
107
```
103
108
109
+
```{important}
110
+
After entering the container, activate the virtual environment before running any NeMo Curator commands:
111
+
112
+
source /opt/venv/env.sh
113
+
114
+
The container uses a virtual environment at `/opt/venv`. If you see `No module named nemo_curator`, the environment has not been activated.
115
+
```
116
+
104
117
Alternatively, you can build the NeMo Curator container locally using the provided Dockerfile:
105
118
106
119
```bash
@@ -115,7 +128,7 @@ docker run --gpus all -it --rm nemo-curator:latest
115
128
116
129
**Benefits:**
117
130
118
-
- Pre-configured environment with all dependencies
131
+
- Pre-configured environment with all dependencies (FFmpeg, CUDA libraries)
119
132
- Consistent runtime across different systems
120
133
- Ideal for production deployments
121
134
@@ -157,6 +170,10 @@ If encoders are missing, reinstall `FFmpeg` with the required options or use the
157
170
:::
158
171
::::
159
172
173
+
```{note}
174
+
**FFmpeg build requires CUDA toolkit (nvcc):** If you encounter `ERROR: failed checking for nvcc` during FFmpeg installation, ensure that the CUDA toolkit is installed and `nvcc` is available on your `PATH`. You can verify with `nvcc --version`. If using the NeMo Curator container, FFmpeg is pre-installed with NVENC support.
Copy file name to clipboardExpand all lines: docs/curate-text/process-data/deduplication/fuzzy.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,10 @@ Ideal for detecting documents with minor differences such as formatting changes,
34
34
- Ray cluster with GPU support (required for distributed processing)
35
35
- Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
36
36
37
+
```{note}
38
+
**Running in Docker**: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that Ray workers can access the GPU. Without GPU access, you may see `CUDARuntimeError` or `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
39
+
```
40
+
37
41
## Quick Start
38
42
39
43
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
Copy file name to clipboardExpand all lines: docs/curate-text/process-data/deduplication/semdedup.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,10 @@ Based on [SemDeDup: Data-efficient learning at web-scale through semantic dedupl
42
42
- GPU acceleration (required for embedding generation and clustering)
43
43
- Stable document identifiers for removal (either existing IDs or IDs managed by the workflow and removal stages)
44
44
45
+
```{note}
46
+
**Running in Docker**: When running semantic deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that CUDA GPUs are available. Without this flag, you will see `RuntimeError: No CUDA GPUs are available`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
47
+
```
48
+
45
49
## Quick Start
46
50
47
51
Get started with semantic deduplication using the following example of identifying duplicates, then remove them in one step:
Copy file name to clipboardExpand all lines: docs/curate-text/process-data/quality-assessment/distributed-classifier.md
+25-11Lines changed: 25 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ The distributed data classification in NeMo Curator works by:
20
20
21
21
1.**Parallel Processing**: Chunking datasets across multiple computing nodes and GPUs to accelerate classification
22
22
2.**Pre-trained Models**: Using specialized models for different classification tasks
23
-
3.**Batched Inference**: Optimizing throughput with intelligent batching via CrossFit integration
23
+
3.**Batched Inference**: Optimizing throughput with intelligent batching
24
24
4.**Consistent API**: Providing a unified interface through the `DistributedDataClassifier` base class
25
25
26
26
The `DistributedDataClassifier` is designed to run on GPU clusters with minimal code changes regardless of which specific classifier you're using. All classifiers support filtering based on classification results and storing prediction scores as metadata.
@@ -29,6 +29,16 @@ The `DistributedDataClassifier` is designed to run on GPU clusters with minimal
29
29
Distributed classification requires GPU acceleration and is not supported for CPU-only processing. As long as GPU resources are available and NeMo Curator is correctly installed, GPU acceleration is handled automatically.
30
30
:::
31
31
32
+
```{tip}
33
+
**Running the tutorial notebooks**: The classification tutorial notebooks require the `text_cuda12` or `all` installation extra to include all relevant dependencies. If you encounter `ModuleNotFoundError`, reinstall with the appropriate extra:
34
+
35
+
uv pip install "nemo-curator[text_cuda12]"
36
+
37
+
When using classifiers that download from Hugging Face (such as Aegis and InstructionDataGuard), set your `HF_TOKEN` environment variable to avoid rate limiting:
38
+
39
+
export HF_TOKEN="your_token_here"
40
+
```
41
+
32
42
---
33
43
34
44
## Usage
@@ -39,16 +49,16 @@ NVIDIA NeMo Curator provides a base class `DistributedDataClassifier` that can b
| DomainClassifier |Categorize English text by domain |[nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier)|`filter_by`, `text_field`| None |
43
-
| MultilingualDomainClassifier |Categorize text in 52 languages by domain|[nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier)|`filter_by`, `text_field`| None |
| ContentTypeClassifier |Categorize by speech type|[nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta)|`filter_by`, `text_field`| None |
51
-
| PromptTaskComplexityClassifier |Classify prompt tasks and complexity |[nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier)|`text_field`| None |
52
+
| DomainClassifier |Assigns one of 26 domain labels (such as "Sports," "Science," "News") to English text|[nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier)|`filter_by`, `text_field`| None |
53
+
| MultilingualDomainClassifier |Assigns domain labels to text in 52 languages; same labels as DomainClassifier|[nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier)|`filter_by`, `text_field`| None |
54
+
| QualityClassifier |Rates document quality as "Low," "Medium," or "High" using a DeBERTa model|[nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta)|`filter_by`, `text_field`| None |
55
+
| AegisClassifier |Detects unsafe content across 13 risk categories (violence, hate speech, and others) using LlamaGuard|[nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0)|`aegis_variant`, `filter_by`| HuggingFace token |
| FineWebEduClassifier |Scores educational value from 0 to 5 (0=spam, 5=scholarly) for training data selection|[HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)|`label_field`, `int_field`| None |
58
+
| FineWebMixtralEduClassifier |Scores educational value from 0 to 5 using Mixtral 8x22B annotation data|[nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)|`label_field`, `int_field`, `model_inference_batch_size=1024`| None |
59
+
| FineWebNemotronEduClassifier |Scores educational value from 0 to 5 using Nemotron-4-340B annotation data|[nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)|`label_field`, `int_field`, `model_inference_batch_size=1024`| None |
60
+
| ContentTypeClassifier |Categorizes text into 11 speech types (such as "Blogs," "News," "Academic")|[nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta)|`filter_by`, `text_field`| None |
61
+
| PromptTaskComplexityClassifier |Labels prompts by task type (such as QA and summarization) and complexity dimensions|[nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier)|`text_field`| None |
52
62
53
63
### Domain Classifier
54
64
@@ -365,6 +375,10 @@ pipeline.add_stage(writer)
365
375
results = pipeline.run() # Uses XennaExecutor by default
366
376
```
367
377
378
+
## Custom Model Integration
379
+
380
+
You can integrate your own classification models by extending `DistributedDataClassifier`. Refer to the [Text Classifiers README](https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/stages/text/classifiers#text-classifiers) for implementation details and examples.
381
+
368
382
## Performance Optimization
369
383
370
384
NVIDIA NeMo Curator's distributed classifiers are optimized for high-throughput processing through several key features:
Copy file name to clipboardExpand all lines: docs/curate-video/process-data/dedup.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ modality: "video-only"
15
15
Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
16
16
17
17
## Before You Start
18
-
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `iv2_embd_parquet/` or `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
18
+
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
19
19
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
20
20
21
21
@@ -24,7 +24,7 @@ Use clip-level embeddings to identify near-duplicate video clips so your dataset
24
24
Duplicate identification operates on clip-level embeddings produced during processing:
25
25
26
26
1.**Inputs**
27
-
- Parquet batches from `ClipWriterStage` under `iv2_embd_parquet/` or `ce1_embd_parquet/`
27
+
- Parquet batches from `ClipWriterStage` under `ce1_embd_parquet/`
28
28
- Columns: `id`, `embedding`
29
29
30
30
2.**Outputs**
@@ -50,13 +50,13 @@ from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
50
50
from nemo_curator.backends.xenna import XennaExecutor
51
51
52
52
workflow = SemanticDeduplicationWorkflow(
53
-
input_path="/path/to/embeddings/", # e.g., iv2_embd_parquet/ or ce1_embd_parquet/
0 commit comments