Skip to content

Commit 242c9aa

Browse files
lbliiisarahyurick
andauthored
docs: vdr feedback (#1477)
* docs: vdr feedback Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/admin/installation.md Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/admin/installation.md Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/admin/installation.md Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/curate-text/process-data/quality-assessment/distributed-classifier.md Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Lawrence Lane <llane@nvidia.com> * feedback Signed-off-by: Lawrence Lane <llane@nvidia.com> * feedback Signed-off-by: Lawrence Lane <llane@nvidia.com> * re order sidebar Signed-off-by: Lawrence Lane <llane@nvidia.com> * release notes draft Signed-off-by: Lawrence Lane <llane@nvidia.com> * feedback Signed-off-by: Lawrence Lane <llane@nvidia.com> * remove more internvid content Signed-off-by: Lawrence Lane <llane@nvidia.com> * release note fix Signed-off-by: Lawrence Lane <llane@nvidia.com> * Update docs/curate-video/tutorials/split-dedup.md Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com> Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
1 parent f1eccef commit 242c9aa

File tree

25 files changed

+285
-263
lines changed

25 files changed

+285
-263
lines changed

docs/about/release-notes/index.md

Lines changed: 122 additions & 185 deletions
Large diffs are not rendered by default.

docs/admin/installation.md

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,15 @@ This guide covers installing NeMo Curator with support for **all modalities** an
1818

1919
### System Requirements
2020

21-
For comprehensive system requirements and production deployment specifications, see [Production Deployment Requirements](deployment/requirements.md).
21+
For comprehensive system requirements and production deployment specifications, refer to [Production Deployment Requirements](deployment/requirements.md).
2222

2323
**Quick Start Requirements:**
2424

2525
- **OS**: Ubuntu 24.04/22.04/20.04 (recommended)
2626
- **Python**: 3.10, 3.11, or 3.12
2727
- **Memory**: 16GB+ RAM for basic text processing
2828
- **GPU** (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
29+
- **CUDA 12** (required for `audio_cuda12`, `video_cuda12`, `image_cuda12`, and `text_cuda12` extras)
2930

3031
### Development vs Production
3132

@@ -41,9 +42,13 @@ For comprehensive system requirements and production deployment specifications,
4142

4243
Choose one of the following installation methods based on your needs:
4344

45+
:::{tip}
46+
**Docker is the recommended installation method** for video and audio workflows. The NeMo Curator container includes FFmpeg (with NVENC support) pre-configured, avoiding manual dependency setup. Refer to the [Container Installation](#container-installation) tab below.
47+
:::
48+
4449
::::{tab-set}
4550

46-
:::{tab-item} PyPI Installation (Recommended)
51+
:::{tab-item} PyPI Installation
4752

4853
Install NeMo Curator from the Python Package Index using `uv` for proper dependency resolution.
4954

@@ -89,9 +94,9 @@ uv sync --all-extras --all-groups
8994

9095
:::
9196

92-
:::{tab-item} Container Installation
97+
:::{tab-item} Container Installation (Recommended for Video/Audio)
9398

94-
NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed. You can run it with:
99+
NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed, including FFmpeg with NVENC support.
95100

96101
```bash
97102
# Pull the container from NGC
@@ -101,6 +106,14 @@ docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }}
101106
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:{{ container_version }}
102107
```
103108

109+
```{important}
110+
After entering the container, activate the virtual environment before running any NeMo Curator commands:
111+
112+
source /opt/venv/env.sh
113+
114+
The container uses a virtual environment at `/opt/venv`. If you see `No module named nemo_curator`, the environment has not been activated.
115+
```
116+
104117
Alternatively, you can build the NeMo Curator container locally using the provided Dockerfile:
105118

106119
```bash
@@ -115,7 +128,7 @@ docker run --gpus all -it --rm nemo-curator:latest
115128

116129
**Benefits:**
117130

118-
- Pre-configured environment with all dependencies
131+
- Pre-configured environment with all dependencies (FFmpeg, CUDA libraries)
119132
- Consistent runtime across different systems
120133
- Ideal for production deployments
121134

@@ -157,6 +170,10 @@ If encoders are missing, reinstall `FFmpeg` with the required options or use the
157170
:::
158171
::::
159172

173+
```{note}
174+
**FFmpeg build requires CUDA toolkit (nvcc):** If you encounter `ERROR: failed checking for nvcc` during FFmpeg installation, ensure that the CUDA toolkit is installed and `nvcc` is available on your `PATH`. You can verify with `nvcc --version`. If using the NeMo Curator container, FFmpeg is pre-installed with NVENC support.
175+
```
176+
160177
---
161178

162179
## Package Extras
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"filename": "get-started/text.md", "lineno": 119, "status": "broken", "code": 0, "uri": "https://huggingface.co/settings/tokens", "info": "unauthorized"}

docs/curate-text/process-data/deduplication/fuzzy.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ Ideal for detecting documents with minor differences such as formatting changes,
3434
- Ray cluster with GPU support (required for distributed processing)
3535
- Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
3636

37+
```{note}
38+
**Running in Docker**: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that Ray workers can access the GPU. Without GPU access, you may see `CUDARuntimeError` or `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
39+
```
40+
3741
## Quick Start
3842

3943
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:

docs/curate-text/process-data/deduplication/semdedup.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,10 @@ Based on [SemDeDup: Data-efficient learning at web-scale through semantic dedupl
4242
- GPU acceleration (required for embedding generation and clustering)
4343
- Stable document identifiers for removal (either existing IDs or IDs managed by the workflow and removal stages)
4444

45+
```{note}
46+
**Running in Docker**: When running semantic deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that CUDA GPUs are available. Without this flag, you will see `RuntimeError: No CUDA GPUs are available`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
47+
```
48+
4549
## Quick Start
4650

4751
Get started with semantic deduplication using the following example of identifying duplicates, then remove them in one step:

docs/curate-text/process-data/quality-assessment/distributed-classifier.md

Lines changed: 25 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ The distributed data classification in NeMo Curator works by:
2020

2121
1. **Parallel Processing**: Chunking datasets across multiple computing nodes and GPUs to accelerate classification
2222
2. **Pre-trained Models**: Using specialized models for different classification tasks
23-
3. **Batched Inference**: Optimizing throughput with intelligent batching via CrossFit integration
23+
3. **Batched Inference**: Optimizing throughput with intelligent batching
2424
4. **Consistent API**: Providing a unified interface through the `DistributedDataClassifier` base class
2525

2626
The `DistributedDataClassifier` is designed to run on GPU clusters with minimal code changes regardless of which specific classifier you're using. All classifiers support filtering based on classification results and storing prediction scores as metadata.
@@ -29,6 +29,16 @@ The `DistributedDataClassifier` is designed to run on GPU clusters with minimal
2929
Distributed classification requires GPU acceleration and is not supported for CPU-only processing. As long as GPU resources are available and NeMo Curator is correctly installed, GPU acceleration is handled automatically.
3030
:::
3131

32+
```{tip}
33+
**Running the tutorial notebooks**: The classification tutorial notebooks require the `text_cuda12` or `all` installation extra to include all relevant dependencies. If you encounter `ModuleNotFoundError`, reinstall with the appropriate extra:
34+
35+
uv pip install "nemo-curator[text_cuda12]"
36+
37+
When using classifiers that download from Hugging Face (such as Aegis and InstructionDataGuard), set your `HF_TOKEN` environment variable to avoid rate limiting:
38+
39+
export HF_TOKEN="your_token_here"
40+
```
41+
3242
---
3343

3444
## Usage
@@ -39,16 +49,16 @@ NVIDIA NeMo Curator provides a base class `DistributedDataClassifier` that can b
3949

4050
| Classifier | Purpose | Model Location | Key Parameters | Requirements |
4151
|---|---|---|---|---|
42-
| DomainClassifier | Categorize English text by domain | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) | `filter_by`, `text_field` | None |
43-
| MultilingualDomainClassifier | Categorize text in 52 languages by domain | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) | `filter_by`, `text_field` | None |
44-
| QualityClassifier | Assess document quality | [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) | `filter_by`, `text_field` | None |
45-
| AegisClassifier | Detect unsafe content | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) | `aegis_variant`, `filter_by` | HuggingFace token |
46-
| InstructionDataGuardClassifier | Detect poisoning attacks | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) | `text_field`, `label_field` | HuggingFace token |
47-
| FineWebEduClassifier | Score educational value | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) | `label_field`, `int_field` | None |
48-
| FineWebMixtralEduClassifier | Score educational value (Mixtral annotations) | [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
49-
| FineWebNemotronEduClassifier | Score educational value (Nemotron annotations) | [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
50-
| ContentTypeClassifier | Categorize by speech type | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) | `filter_by`, `text_field` | None |
51-
| PromptTaskComplexityClassifier | Classify prompt tasks and complexity | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) | `text_field` | None |
52+
| DomainClassifier | Assigns one of 26 domain labels (such as "Sports," "Science," "News") to English text | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) | `filter_by`, `text_field` | None |
53+
| MultilingualDomainClassifier | Assigns domain labels to text in 52 languages; same labels as DomainClassifier | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) | `filter_by`, `text_field` | None |
54+
| QualityClassifier | Rates document quality as "Low," "Medium," or "High" using a DeBERTa model | [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) | `filter_by`, `text_field` | None |
55+
| AegisClassifier | Detects unsafe content across 13 risk categories (violence, hate speech, and others) using LlamaGuard | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) | `aegis_variant`, `filter_by` | HuggingFace token |
56+
| InstructionDataGuardClassifier | Identifies LLM poisoning attacks in instruction-response pairs | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) | `text_field`, `label_field` | HuggingFace token |
57+
| FineWebEduClassifier | Scores educational value from 0 to 5 (0=spam, 5=scholarly) for training data selection | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) | `label_field`, `int_field` | None |
58+
| FineWebMixtralEduClassifier | Scores educational value from 0 to 5 using Mixtral 8x22B annotation data | [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
59+
| FineWebNemotronEduClassifier | Scores educational value from 0 to 5 using Nemotron-4-340B annotation data | [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
60+
| ContentTypeClassifier | Categorizes text into 11 speech types (such as "Blogs," "News," "Academic") | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) | `filter_by`, `text_field` | None |
61+
| PromptTaskComplexityClassifier | Labels prompts by task type (such as QA and summarization) and complexity dimensions | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) | `text_field` | None |
5262

5363
### Domain Classifier
5464

@@ -365,6 +375,10 @@ pipeline.add_stage(writer)
365375
results = pipeline.run() # Uses XennaExecutor by default
366376
```
367377

378+
## Custom Model Integration
379+
380+
You can integrate your own classification models by extending `DistributedDataClassifier`. Refer to the [Text Classifiers README](https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/stages/text/classifiers#text-classifiers) for implementation details and examples.
381+
368382
## Performance Optimization
369383

370384
NVIDIA NeMo Curator's distributed classifiers are optimized for high-throughput processing through several key features:

docs/curate-video/process-data/captions-preview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ pipe.run()
6868
:::{tab-item} Script Flags
6969

7070
```bash
71-
python -m nemo_curator.examples.video.video_split_clip_example \
71+
python tutorials/video/getting-started/video_split_clip_example.py \
7272
... \
7373
--generate-captions \
7474
--captioning-algorithm qwen \

docs/curate-video/process-data/clipping.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,15 +84,15 @@ pipe.run()
8484

8585
```bash
8686
# Fixed stride
87-
python -m nemo_curator.examples.video.video_split_clip_example \
87+
python tutorials/video/getting-started/video_split_clip_example.py \
8888
... \
8989
--splitting-algorithm fixed_stride \
9090
--fixed-stride-split-duration 10.0 \
9191
--fixed-stride-min-clip-length-s 2.0 \
9292
--limit-clips 0
9393

9494
# TransNetV2
95-
python -m nemo_curator.examples.video.video_split_clip_example \
95+
python tutorials/video/getting-started/video_split_clip_example.py \
9696
... \
9797
--splitting-algorithm transnetv2 \
9898
--transnetv2-frame-decoder-mode pynvc \

docs/curate-video/process-data/dedup.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ modality: "video-only"
1515
Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
1616

1717
## Before You Start
18-
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `iv2_embd_parquet/` or `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
18+
- Make sure you have embeddings which are written by the [`ClipWriterStage`](video-save-export) under `ce1_embd_parquet/`. For a runnable workflow, refer to the [Split and Remove Duplicates Workflow](video-tutorials-split-dedup). The embeddings must be in parquet files containing the columns `id` and `embedding`.
1919
- Verify local paths or configure S3-compatible credentials. Provide `storage_options` in read/write keyword arguments when reading or writing cloud paths.
2020

2121

@@ -24,7 +24,7 @@ Use clip-level embeddings to identify near-duplicate video clips so your dataset
2424
Duplicate identification operates on clip-level embeddings produced during processing:
2525

2626
1. **Inputs**
27-
- Parquet batches from `ClipWriterStage` under `iv2_embd_parquet/` or `ce1_embd_parquet/`
27+
- Parquet batches from `ClipWriterStage` under `ce1_embd_parquet/`
2828
- Columns: `id`, `embedding`
2929

3030
2. **Outputs**
@@ -50,13 +50,13 @@ from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
5050
from nemo_curator.backends.xenna import XennaExecutor
5151

5252
workflow = SemanticDeduplicationWorkflow(
53-
input_path="/path/to/embeddings/", # e.g., iv2_embd_parquet/ or ce1_embd_parquet/
53+
input_path="/path/to/embeddings/", # e.g., ce1_embd_parquet/
5454
output_path="/path/to/duplicates/",
5555
cache_path="/path/to/cache/", # Optional: defaults to output_path
5656
n_clusters=1000,
5757
id_field="id",
5858
embedding_field="embedding",
59-
embedding_dim=512, # 512 for InternVideo2, varies for Cosmos-Embed1
59+
embedding_dim=768, # Embedding dimension (768 for Cosmos-Embed1, varies by model)
6060
input_filetype="parquet",
6161
eps=0.1, # Similarity threshold: cosine_sim >= 1.0 - eps identifies duplicates
6262
ranking_strategy=RankingStrategy.metadata_based(

docs/curate-video/process-data/embeddings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ pipe.run()
6565

6666
```bash
6767
# Cosmos-Embed1 (224p)
68-
python -m nemo_curator.examples.video.video_split_clip_example \
68+
python tutorials/video/getting-started/video_split_clip_example.py \
6969
... \
7070
--generate-embeddings \
7171
--embedding-algorithm cosmos-embed1-224p \

0 commit comments

Comments
 (0)