Skip to content

Commit 2fdfe14

Browse files
committed
add more pages
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
1 parent 9cdf9ef commit 2fdfe14

File tree

4 files changed

+14
-24
lines changed

4 files changed

+14
-24
lines changed

docs/about/concepts/video/abstractions.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ A stage represents a single step in your data curation workflow. Video stages ar
4848
Each processing stage:
4949

5050
1. Inherits from `ProcessingStage`
51-
2. Declares a stable `name` and `resources: Resources` (CPU cores, GPU memory, optional NVDEC/NVENC, or more than one GPU)
51+
2. Declares a stable `name` and `resources: Resources` (CPU cores, GPU memory, entire GPU flag, or multiple GPUs)
5252
3. Defines `inputs()`/`outputs()` to document required attributes and produced attributes on tasks
5353
4. Implements `setup(worker_metadata)` for model initialization and `process(task)` to transform tasks
5454

@@ -75,9 +75,8 @@ Refer to the stage base and resources definitions in Curator for full details.
7575
`Resources` support both fractional and whole‑GPU semantics:
7676

7777
- `gpu_memory_gb`: Request a fraction of a single GPU by memory; Curator rounds to a fractional GPU share and enforces that `gpu_memory_gb` stays within one device.
78-
- `entire_gpu`: Request an entire GPU regardless of memory (also implies access to NVDEC/NVENC on that device).
78+
- `entire_gpu`: Request an entire GPU regardless of memory (also implies access to hardware decoders and encoders on that device).
7979
- `gpus`: Request more than one GPU for a stage that is multi‑GPU aware.
80-
- `nvdecs` / `nvencs`: Request hardware decode/encode units when needed.
8180

8281
Choose one of `gpu_memory_gb` (single‑GPU fractional) or `gpus` (multi‑GPU). Combining both is not allowed.
8382

docs/about/release-notes/index.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,6 @@ Enhanced features for the experimental Ray Actor Pool execution backend:
3737

3838
Learn more in the [Execution Backends documentation](../../reference/infrastructure/execution-backends.md).
3939

40-
### Enhanced Embedding Generation
41-
42-
Expanded embedding support with new model integrations:
43-
44-
- **vLLM Integration**: High-performance LLM-based embedding generation with automatic batching
45-
- **Sentence Transformers**: Support for popular sentence embedding models
46-
- **Unified API**: Consistent embedding interface across text, image, and video modalities
47-
4840
### YAML Configuration Support
4941

5042
Declarative pipeline configuration for text curation workflows:
@@ -65,7 +57,6 @@ python -m nemo_curator.config.run --config_file heuristic_filter_english_pipelin
6557

6658
New API for tracking and analyzing pipeline execution:
6759

68-
- **WorkflowRunResult**: Structured results object capturing execution metrics
6960
- **Performance Metrics**: Automatic tracking of processing time, throughput, and resource usage
7061
- **Better Debugging**: Detailed logs and error reporting for failed stages
7162

docs/curate-audio/tutorials/beginner.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ cd NeMo-Curator/tutorials/audio/fleurs/
5555

5656
## Prerequisites
5757

58-
* NeMo Curator installed (see [Installation Guide](docs/admin/installation.md))
58+
* NeMo Curator installed (see [Installation Guide](admin-installation))
5959
* NVIDIA GPU (required for ASR inference, minimum 16GB VRAM recommended)
6060
* Internet connection for dataset download
6161
* Basic Python knowledge
@@ -373,7 +373,7 @@ After completing this tutorial, explore:
373373

374374
## Related Topics
375375

376-
- **[Audio Curation Quickstart](docs/get-started/audio.md)**: Quick introduction to audio curation
376+
- **[Audio Curation Quickstart](gs-audio)**: Quick introduction to audio curation
377377
- **[FLEURS Dataset](../load-data/fleurs-dataset.md)**: Detailed FLEURS dataset documentation
378378
- **[Quality Assessment](../process-data/quality-assessment/index.md)**: Comprehensive quality metrics guide
379379
- **[Save & Export](../save-export.md)**: Advanced export options and formats

docs/curate-text/process-data/deduplication/index.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ NeMo Curator provides three deduplication approaches, each optimized for differe
2424

2525
::::{tab-item} Exact
2626

27-
**Method**: MD5 hashing
28-
**Detects**: Character-for-character identical documents
27+
**Method**: MD5 hashing
28+
**Detects**: Character-for-character identical documents
2929
**Speed**: Fastest
3030

3131
Computes MD5 hashes for each document's text content and groups documents with identical hashes. Best for removing exact copies.
@@ -58,8 +58,8 @@ For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate ID
5858

5959
::::{tab-item} Fuzzy
6060

61-
**Method**: MinHash + Locality Sensitive Hashing (LSH)
62-
**Detects**: Near-duplicates with minor edits (~80% similarity)
61+
**Method**: MinHash + Locality Sensitive Hashing (LSH)
62+
**Detects**: Near-duplicates with minor edits (~80% similarity)
6363
**Speed**: Fast
6464

6565
Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
@@ -96,8 +96,8 @@ For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate ID
9696

9797
::::{tab-item} Semantic
9898

99-
**Method**: Embeddings + clustering + pairwise similarity
100-
**Detects**: Semantically similar content (paraphrases, translations)
99+
**Method**: Embeddings + clustering + pairwise similarity
100+
**Detects**: Semantically similar content (paraphrases, translations)
101101
**Speed**: Moderate
102102

103103
Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
@@ -110,7 +110,7 @@ from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplic
110110

111111
text_workflow = TextSemanticDeduplicationWorkflow(
112112
input_path="/path/to/input/data",
113-
output_path="/path/to/output",
113+
output_path="/path/to/output",
114114
cache_path="/path/to/cache",
115115
text_field="text",
116116
model_identifier="sentence-transformers/all-MiniLM-L6-v2",
@@ -126,7 +126,7 @@ text_workflow.run()
126126
- `TextSemanticDeduplicationWorkflow`: For raw text with automatic embedding generation
127127
- `SemanticDeduplicationWorkflow`: For pre-computed embeddings
128128

129-
See {ref}`Semantic Deduplication <text-process-data-dedup-semdedup>` for details.
129+
See {ref}`Semantic Deduplication <text-process-data-format-sem-dedup>` for details.
130130
:::
131131

132132
:::{dropdown} Advanced: Step-by-Step Semantic Deduplication
@@ -378,7 +378,7 @@ For detailed implementation guides, see:
378378

379379
- {ref}`Exact Duplicate Removal <text-process-data-dedup-exact>`
380380
- {ref}`Fuzzy Duplicate Removal <text-process-data-dedup-fuzzy>`
381-
- {ref}`Semantic Deduplication <text-process-data-dedup-semdedup>`
381+
- {ref}`Semantic Deduplication <text-process-data-format-sem-dedup>`
382382

383383
:::{dropdown} Performance Considerations
384384
:icon: zap
@@ -455,7 +455,7 @@ The ID Generator ensures consistent IDs across workflow stages.
455455

456456
- **New to deduplication**: Start with {ref}`Exact Duplicate Removal <text-process-data-dedup-exact>` for the fastest approach
457457
- **Need near-duplicate detection**: See {ref}`Fuzzy Duplicate Removal <text-process-data-dedup-fuzzy>` for MinHash-based matching
458-
- **Require semantic matching**: Explore {ref}`Semantic Deduplication <text-process-data-dedup-semdedup>` for meaning-based deduplication
458+
- **Require semantic matching**: Explore {ref}`Semantic Deduplication <text-process-data-format-sem-dedup>` for meaning-based deduplication
459459

460460
**For hands-on guidance**: See {ref}`Text Curation Tutorials <text-tutorials>` for step-by-step examples.
461461

0 commit comments

Comments
 (0)