add more pages

sarahyurick · sarahyurick · commit 2fdfe14e385b · 2026-02-11T15:07:00.000-08:00
Signed-off-by: Sarah Yurick &lt;sarahyurick@gmail.com&gt;
diff --git a/docs/about/concepts/video/abstractions.md b/docs/about/concepts/video/abstractions.md
@@ -48,7 +48,7 @@ A stage represents a single step in your data curation workflow. Video stages ar
 Each processing stage:
 
 1. Inherits from `ProcessingStage`
-2. Declares a stable `name` and `resources: Resources` (CPU cores, GPU memory, optional NVDEC/NVENC, or more than one GPU)
+2. Declares a stable `name` and `resources: Resources` (CPU cores, GPU memory, entire GPU flag, or multiple GPUs)
 3. Defines `inputs()`/`outputs()` to document required attributes and produced attributes on tasks
 4. Implements `setup(worker_metadata)` for model initialization and `process(task)` to transform tasks
 
@@ -75,9 +75,8 @@ Refer to the stage base and resources definitions in Curator for full details.
 `Resources` support both fractional and whole‑GPU semantics:
 
 - `gpu_memory_gb`: Request a fraction of a single GPU by memory; Curator rounds to a fractional GPU share and enforces that `gpu_memory_gb` stays within one device.
-- `entire_gpu`: Request an entire GPU regardless of memory (also implies access to NVDEC/NVENC on that device).
+- `entire_gpu`: Request an entire GPU regardless of memory (also implies access to hardware decoders and encoders on that device).
 - `gpus`: Request more than one GPU for a stage that is multi‑GPU aware.
-- `nvdecs` / `nvencs`: Request hardware decode/encode units when needed.
 
 Choose one of `gpu_memory_gb` (single‑GPU fractional) or `gpus` (multi‑GPU). Combining both is not allowed.
 
diff --git a/docs/about/release-notes/index.md b/docs/about/release-notes/index.md
@@ -37,14 +37,6 @@ Enhanced features for the experimental Ray Actor Pool execution backend:
 
 Learn more in the [Execution Backends documentation](../../reference/infrastructure/execution-backends.md).
 
-### Enhanced Embedding Generation
-
-Expanded embedding support with new model integrations:
-
-- **vLLM Integration**: High-performance LLM-based embedding generation with automatic batching
-- **Sentence Transformers**: Support for popular sentence embedding models
-- **Unified API**: Consistent embedding interface across text, image, and video modalities
-
 ### YAML Configuration Support
 
 Declarative pipeline configuration for text curation workflows:
@@ -65,7 +57,6 @@ python -m nemo_curator.config.run --config_file heuristic_filter_english_pipelin
 
 New API for tracking and analyzing pipeline execution:
 
-- **WorkflowRunResult**: Structured results object capturing execution metrics
 - **Performance Metrics**: Automatic tracking of processing time, throughput, and resource usage
 - **Better Debugging**: Detailed logs and error reporting for failed stages
 
diff --git a/docs/curate-audio/tutorials/beginner.md b/docs/curate-audio/tutorials/beginner.md
@@ -55,7 +55,7 @@ cd NeMo-Curator/tutorials/audio/fleurs/
 
 ## Prerequisites
 
-* NeMo Curator installed (see [Installation Guide](docs/admin/installation.md))
+* NeMo Curator installed (see [Installation Guide](admin-installation))
 * NVIDIA GPU (required for ASR inference, minimum 16GB VRAM recommended)
 * Internet connection for dataset download
 * Basic Python knowledge
@@ -373,7 +373,7 @@ After completing this tutorial, explore:
 
 ## Related Topics
 
-- **[Audio Curation Quickstart](docs/get-started/audio.md)**: Quick introduction to audio curation
+- **[Audio Curation Quickstart](gs-audio)**: Quick introduction to audio curation
 - **[FLEURS Dataset](../load-data/fleurs-dataset.md)**: Detailed FLEURS dataset documentation
 - **[Quality Assessment](../process-data/quality-assessment/index.md)**: Comprehensive quality metrics guide
 - **[Save & Export](../save-export.md)**: Advanced export options and formats
diff --git a/docs/curate-text/process-data/deduplication/index.md b/docs/curate-text/process-data/deduplication/index.md
@@ -24,8 +24,8 @@ NeMo Curator provides three deduplication approaches, each optimized for differe
 
 ::::{tab-item} Exact
 
-**Method**: MD5 hashing  
-**Detects**: Character-for-character identical documents  
+**Method**: MD5 hashing
+**Detects**: Character-for-character identical documents
 **Speed**: Fastest
 
 Computes MD5 hashes for each document's text content and groups documents with identical hashes. Best for removing exact copies.
@@ -58,8 +58,8 @@ For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate ID
 
 ::::{tab-item} Fuzzy
 
-**Method**: MinHash + Locality Sensitive Hashing (LSH)  
-**Detects**: Near-duplicates with minor edits (~80% similarity)  
+**Method**: MinHash + Locality Sensitive Hashing (LSH)
+**Detects**: Near-duplicates with minor edits (~80% similarity)
 **Speed**: Fast
 
 Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
@@ -96,8 +96,8 @@ For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate ID
 
 ::::{tab-item} Semantic
 
-**Method**: Embeddings + clustering + pairwise similarity  
-**Detects**: Semantically similar content (paraphrases, translations)  
+**Method**: Embeddings + clustering + pairwise similarity
+**Detects**: Semantically similar content (paraphrases, translations)
 **Speed**: Moderate
 
 Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
@@ -110,7 +110,7 @@ from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplic
 
 text_workflow = TextSemanticDeduplicationWorkflow(
     input_path="/path/to/input/data",
-    output_path="/path/to/output", 
+    output_path="/path/to/output",
     cache_path="/path/to/cache",
     text_field="text",
     model_identifier="sentence-transformers/all-MiniLM-L6-v2",
@@ -126,7 +126,7 @@ text_workflow.run()
 - `TextSemanticDeduplicationWorkflow`: For raw text with automatic embedding generation
 - `SemanticDeduplicationWorkflow`: For pre-computed embeddings
 
-See {ref}`Semantic Deduplication <text-process-data-dedup-semdedup>` for details.
+See {ref}`Semantic Deduplication <text-process-data-format-sem-dedup>` for details.
 :::
 
 :::{dropdown} Advanced: Step-by-Step Semantic Deduplication
@@ -378,7 +378,7 @@ For detailed implementation guides, see:
 
 - {ref}`Exact Duplicate Removal <text-process-data-dedup-exact>`
 - {ref}`Fuzzy Duplicate Removal <text-process-data-dedup-fuzzy>`
-- {ref}`Semantic Deduplication <text-process-data-dedup-semdedup>`
+- {ref}`Semantic Deduplication <text-process-data-format-sem-dedup>`
 
 :::{dropdown} Performance Considerations
 :icon: zap
@@ -455,7 +455,7 @@ The ID Generator ensures consistent IDs across workflow stages.
 
 - **New to deduplication**: Start with {ref}`Exact Duplicate Removal <text-process-data-dedup-exact>` for the fastest approach
 - **Need near-duplicate detection**: See {ref}`Fuzzy Duplicate Removal <text-process-data-dedup-fuzzy>` for MinHash-based matching
-- **Require semantic matching**: Explore {ref}`Semantic Deduplication <text-process-data-dedup-semdedup>` for meaning-based deduplication
+- **Require semantic matching**: Explore {ref}`Semantic Deduplication <text-process-data-format-sem-dedup>` for meaning-based deduplication
 
 **For hands-on guidance**: See {ref}`Text Curation Tutorials <text-tutorials>` for step-by-step examples.