continue adding more changes

sarahyurick · sarahyurick · commit bda3993d2eb1 · 2026-02-11T15:42:50.000-08:00
Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;
diff --git a/api-design.md b/api-design.md
@@ -132,8 +132,6 @@ class Resources:
     cpus: float = 1.0 # Number of CPU cores
     gpu_memory_gb: float = 0.0 # Number of GPU memory in GB (Only for single GPU)
     gpus: float = 0.0 # Number of GPUs (Only for multi-GPU)
-    nvdecs: int = 0 # Number of NVDEC decoders
-    nvencs: int = 0 # Number of NVENC encoders
     entire_gpu: bool = False # Whether to use the entire GPU
 ```
 
diff --git a/docs/admin/installation.md b/docs/admin/installation.md
@@ -18,14 +18,15 @@ This guide covers installing NeMo Curator with support for **all modalities** an
 
 ### System Requirements
 
-For comprehensive system requirements and production deployment specifications, see [Production Deployment Requirements](deployment/requirements.md).
+For comprehensive system requirements and production deployment specifications, refer to [Production Deployment Requirements](deployment/requirements.md).
 
 **Quick Start Requirements:**
 
 - **OS**: Ubuntu 24.04/22.04/20.04 (recommended)
 - **Python**: 3.10, 3.11, or 3.12
 - **Memory**: 16GB+ RAM for basic text processing
 - **GPU** (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
+- **CUDA 12** (required for `audio_cuda12`, `video_cuda12`, `image_cuda12`, and `text_cuda12` extras)
 
 ### Development vs Production
 
@@ -41,9 +42,13 @@ For comprehensive system requirements and production deployment specifications,
 
 Choose one of the following installation methods based on your needs:
 
+:::{tip}
+**Docker is the recommended installation method** for video and audio workflows. The NeMo Curator container includes FFmpeg (with NVENC support) pre-configured, avoiding manual dependency setup. Refer to the [Container Installation](#container-installation) tab below.
+:::
+
 ::::{tab-set}
 
-:::{tab-item} PyPI Installation (Recommended)
+:::{tab-item} PyPI Installation
 
 Install NeMo Curator from the Python Package Index using `uv` for proper dependency resolution.
 
@@ -89,9 +94,9 @@ uv sync --all-extras --all-groups
 
 :::
 
-:::{tab-item} Container Installation
+:::{tab-item} Container Installation (Recommended for Video/Audio)
 
-NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed. You can run it with:
+NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed, including FFmpeg with NVENC support.
 
 ```bash
 # Pull the container from NGC
@@ -101,6 +106,14 @@ docker pull nvcr.io/nvidia/nemo-curator:{{ container_version }}
 docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:{{ container_version }}
 ```
 
+```{important}
+After entering the container, activate the virtual environment before running any NeMo Curator commands:
+
+    source /opt/venv/env.sh
+
+The container uses a virtual environment at `/opt/venv`. If you see `No module named nemo_curator`, the environment has not been activated.
+```
+
 Alternatively, you can build the NeMo Curator container locally using the provided Dockerfile:
 
 ```bash
@@ -115,7 +128,7 @@ docker run --gpus all -it --rm nemo-curator:latest
 
 **Benefits:**
 
-- Pre-configured environment with all dependencies
+- Pre-configured environment with all dependencies (FFmpeg, CUDA libraries)
 - Consistent runtime across different systems
 - Ideal for production deployments
 
@@ -157,6 +170,10 @@ If encoders are missing, reinstall `FFmpeg` with the required options or use the
 :::
 ::::
 
+```{note}
+**FFmpeg build requires CUDA toolkit (nvcc):** If you encounter `ERROR: failed checking for nvcc` during FFmpeg installation, ensure that the CUDA toolkit is installed and `nvcc` is available on your `PATH`. You can verify with `nvcc --version`. If using the NeMo Curator container, FFmpeg is pre-installed with NVENC support.
+```
+
 ---
 
 ## Package Extras
diff --git a/docs/broken_links_false_positives.json b/docs/broken_links_false_positives.json
@@ -0,0 +1 @@
+{"filename": "get-started/text.md", "lineno": 119, "status": "broken", "code": 0, "uri": "https://huggingface.co/settings/tokens", "info": "unauthorized"}
diff --git a/docs/curate-text/process-data/deduplication/fuzzy.md b/docs/curate-text/process-data/deduplication/fuzzy.md
@@ -34,6 +34,10 @@ Ideal for detecting documents with minor differences such as formatting changes,
 - Ray cluster with GPU support (required for distributed processing)
 - Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
 
+```{note}
+**Running in Docker**: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that Ray workers can access the GPU. Without GPU access, you may see `CUDARuntimeError` or `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
+```
+
 ## Quick Start
 
 Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
diff --git a/docs/curate-text/process-data/deduplication/semdedup.md b/docs/curate-text/process-data/deduplication/semdedup.md
@@ -42,6 +42,10 @@ Based on [SemDeDup: Data-efficient learning at web-scale through semantic dedupl
 - GPU acceleration (required for embedding generation and clustering)
 - Stable document identifiers for removal (either existing IDs or IDs managed by the workflow and removal stages)
 
+```{note}
+**Running in Docker**: When running semantic deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that CUDA GPUs are available. Without this flag, you will see `RuntimeError: No CUDA GPUs are available`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
+```
+
 ## Quick Start
 
 Get started with semantic deduplication using the following example of identifying duplicates, then remove them in one step:
diff --git a/docs/curate-text/process-data/quality-assessment/distributed-classifier.md b/docs/curate-text/process-data/quality-assessment/distributed-classifier.md
@@ -20,7 +20,7 @@ The distributed data classification in NeMo Curator works by:
 
 1. **Parallel Processing**: Chunking datasets across multiple computing nodes and GPUs to accelerate classification
 2. **Pre-trained Models**: Using specialized models for different classification tasks
-3. **Batched Inference**: Optimizing throughput with intelligent batching via CrossFit integration
+3. **Batched Inference**: Optimizing throughput with intelligent batching
 4. **Consistent API**: Providing a unified interface through the `DistributedDataClassifier` base class
 
 The `DistributedDataClassifier` is designed to run on GPU clusters with minimal code changes regardless of which specific classifier you're using. All classifiers support filtering based on classification results and storing prediction scores as metadata.
@@ -29,6 +29,16 @@ The `DistributedDataClassifier` is designed to run on GPU clusters with minimal
 Distributed classification requires GPU acceleration and is not supported for CPU-only processing. As long as GPU resources are available and NeMo Curator is correctly installed, GPU acceleration is handled automatically.
 :::
 
+```{tip}
+**Running the tutorial notebooks**: The classification tutorial notebooks require the `text_cuda12` or `all` installation extra to include all relevant dependencies. If you encounter `ModuleNotFoundError`, reinstall with the appropriate extra:
+
+    uv pip install "nemo-curator[text_cuda12]"
+
+When using classifiers that download from Hugging Face (such as Aegis and InstructionDataGuard), set your `HF_TOKEN` environment variable to avoid rate limiting:
+
+    export HF_TOKEN="your_token_here"
+```
+
 ---
 
 ## Usage
@@ -39,16 +49,16 @@ NVIDIA NeMo Curator provides a base class `DistributedDataClassifier` that can b
 
 | Classifier | Purpose | Model Location | Key Parameters | Requirements |
 |---|---|---|---|---|
-| DomainClassifier | Categorize English text by domain | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) | `filter_by`, `text_field` | None |
-| MultilingualDomainClassifier | Categorize text in 52 languages by domain | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) | `filter_by`, `text_field` | None |
-| QualityClassifier | Assess document quality | [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) | `filter_by`, `text_field` | None |
-| AegisClassifier | Detect unsafe content | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) | `aegis_variant`, `filter_by` | HuggingFace token |
-| InstructionDataGuardClassifier | Detect poisoning attacks | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) | `text_field`, `label_field` | HuggingFace token |
-| FineWebEduClassifier | Score educational value | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) | `label_field`, `int_field` | None |
-| FineWebMixtralEduClassifier | Score educational value (Mixtral annotations) | [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
-| FineWebNemotronEduClassifier | Score educational value (Nemotron annotations) | [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
-| ContentTypeClassifier | Categorize by speech type | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) | `filter_by`, `text_field` | None |
-| PromptTaskComplexityClassifier | Classify prompt tasks and complexity | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) | `text_field` | None |
+| DomainClassifier | Assigns one of 26 domain labels (such as "Sports," "Science," "News") to English text | [nvidia/domain-classifier](https://huggingface.co/nvidia/domain-classifier) | `filter_by`, `text_field` | None |
+| MultilingualDomainClassifier | Assigns domain labels to text in 52 languages; same labels as DomainClassifier | [nvidia/multilingual-domain-classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) | `filter_by`, `text_field` | None |
+| QualityClassifier | Rates document quality as "Low," "Medium," or "High" using a DeBERTa model | [nvidia/quality-classifier-deberta](https://huggingface.co/nvidia/quality-classifier-deberta) | `filter_by`, `text_field` | None |
+| AegisClassifier | Detects unsafe content across 13 risk categories (violence, hate speech, and others) using LlamaGuard | [nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) | `aegis_variant`, `filter_by` | HuggingFace token |
+| InstructionDataGuardClassifier | Identifies LLM poisoning attacks in instruction-response pairs | [nvidia/instruction-data-guard](https://huggingface.co/nvidia/instruction-data-guard) | `text_field`, `label_field` | HuggingFace token |
+| FineWebEduClassifier | Scores educational value from 0 to 5 (0=spam, 5=scholarly) for training data selection | [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) | `label_field`, `int_field` | None |
+| FineWebMixtralEduClassifier | Scores educational value from 0 to 5 using Mixtral 8x22B annotation data | [nvidia/nemocurator-fineweb-mixtral-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
+| FineWebNemotronEduClassifier | Scores educational value from 0 to 5 using Nemotron-4-340B annotation data | [nvidia/nemocurator-fineweb-nemotron-4-edu-classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) | `label_field`, `int_field`, `model_inference_batch_size=1024` | None |
+| ContentTypeClassifier | Categorizes text into 11 speech types (such as "Blogs," "News," "Academic") | [nvidia/content-type-classifier-deberta](https://huggingface.co/nvidia/content-type-classifier-deberta) | `filter_by`, `text_field` | None |
+| PromptTaskComplexityClassifier | Labels prompts by task type (such as QA and summarization) and complexity dimensions | [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier) | `text_field` | None |
 
 ### Domain Classifier
 
@@ -365,6 +375,10 @@ pipeline.add_stage(writer)
 results = pipeline.run()  # Uses XennaExecutor by default
 ```
 
+## Custom Model Integration
+
+You can integrate your own classification models by extending `DistributedDataClassifier`. Refer to the [Text Classifiers README](https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/stages/text/classifiers#text-classifiers) for implementation details and examples.
+
 ## Performance Optimization
 
 NVIDIA NeMo Curator's distributed classifiers are optimized for high-throughput processing through several key features:
diff --git a/docs/curate-video/index.md b/docs/curate-video/index.md
@@ -33,7 +33,7 @@ Understand how components work together so you can plan, scale, and troubleshoot
 ```
 
 ```{note}
-Video pipelines use the `XennaExecutor` backend by default, which provides optimized support for GPU-accelerated video processing including hardware decoders (`nvdecs`) and encoders (`nvencs`). You do not need to import or configure the executor unless you want to use an alternative backend. For more information about customizing backends, refer to [Add a Custom Stage](video-tutorials-pipeline-cust-add-stage).
+Video pipelines use the `XennaExecutor` backend by default, which provides optimized support for GPU-accelerated video processing including hardware decoders and encoders. You do not need to import or configure the executor unless you want to use an alternative backend. For more information about customizing backends, refer to [Pipeline Execution Backends](reference-execution-backends).
 ```
 
 ---
diff --git a/docs/curate-video/load-data/index.md b/docs/curate-video/load-data/index.md
@@ -18,11 +18,20 @@ Load video data for curation using NeMo Curator.
 
 NeMo Curator loads videos with a composite stage that discovers files and extracts metadata:
 
-1. `VideoReader` decomposes into a partitioning stage plus a reader stage.
-2. Local paths use `FilePartitioningStage` to list files; remote URLs (for example, `s3://`, `gcs://`, `http(s)://`) use `ClientPartitioningStage` backed by `fsspec`.
-3. For remote datasets, you can optionally supply an explicit file list using `ClientPartitioningStage.input_list_json_path`.
-4. `VideoReaderStage` downloads bytes (local or via `FSPath`) and calls `video.populate_metadata()` to extract resolution, fps, duration, encoding format, and other fields.
-5. Set `video_limit` to cap discovery; use `None` for unlimited. Set `verbose=True` to log detailed per-video information.
+`VideoReader` is a composite stage that is broken down into a
+1. Partitioning (list files) stage
+  - Local paths use `FilePartitioningStage` to list files
+  - Remote URLs (for example, `s3://`, `gcs://`)
+    - use `ClientPartitioningStage` backed by `fsspec`.
+    - Optional `input_list_json_path` allows explicit file lists under a root prefix.
+
+2. Reader stage (`VideoReaderStage`)
+  - This stage downloads the bytes (local or via `FSPath`) for each listed file
+  - Calls `video.populate_metadata()` to extract resolution, fps, duration, encoding format, and other fields.
+
+You can set
+  - `video_limit` to limit the number of files to be processed; use `None` for unlimited.
+  - `verbose=True` to log detailed per-video information.
 
 ---
 
@@ -32,24 +41,6 @@ NeMo Curator loads videos with a composite stage that discovers files and extrac
 
 Use `VideoReader` to load videos from local paths or remote URLs.
 
-### Local Paths
-
-- Examples: `/data/videos/`, `/mnt/datasets/av/`
-- Uses `FilePartitioningStage` to recursively discover files.
-- Filters by extensions: `.mp4`, `.mov`, `.avi`, `.mkv`, `.webm`.
-- Set `video_limit` to cap discovery during testing (`None` means unlimited).
-
-### Remote Paths
-
-- Examples: `s3://bucket/path/`, `gcs://bucket/path/`, `https://host/path/`, and other fsspec-supported protocols such as `s3a://` and `abfs://`.
-- Uses `ClientPartitioningStage` backed by `fsspec` to list files.
-- Optional `input_list_json_path` allows explicit file lists under a root prefix.
-- Wraps entries as `FSPath` for efficient byte access during reading.
-
-```{tip}
-Use an object storage prefix (for example, `s3://my-bucket/videos/`) to stream from cloud storage. Configure credentials in your environment or client configuration.
-```
-
 ### Example
 
 ```python
diff --git a/docs/curate-video/process-data/captions-preview.md b/docs/curate-video/process-data/captions-preview.md
@@ -68,7 +68,7 @@ pipe.run()
 :::{tab-item} Script Flags
 
 ```bash
-python -m nemo_curator.examples.video.video_split_clip_example \
+python tutorials/video/getting-started/video_split_clip_example.py \
   ... \
   --generate-captions \
   --captioning-algorithm qwen \
diff --git a/docs/curate-video/process-data/clipping.md b/docs/curate-video/process-data/clipping.md
@@ -84,15 +84,15 @@ pipe.run()
 
 ```bash
 # Fixed stride
-python -m nemo_curator.examples.video.video_split_clip_example \
+python tutorials/video/getting-started/video_split_clip_example.py \
   ... \
   --splitting-algorithm fixed_stride \
   --fixed-stride-split-duration 10.0 \
   --fixed-stride-min-clip-length-s 2.0 \
   --limit-clips 0
 
 # TransNetV2
-python -m nemo_curator.examples.video.video_split_clip_example \
+python tutorials/video/getting-started/video_split_clip_example.py \
   ... \
   --splitting-algorithm transnetv2 \
   --transnetv2-frame-decoder-mode pynvc \

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+{"filename": "get-started/text.md", "lineno": 119, "status": "broken", "code": 0, "uri": "https://huggingface.co/settings/tokens", "info": "unauthorized"}`