Skip to content

Commit 91f5f9a

Browse files
committed
release notes change, bump version
Signed-off-by: Lawrence Lane <[email protected]>
1 parent abd2209 commit 91f5f9a

File tree

3 files changed

+8
-204
lines changed

3 files changed

+8
-204
lines changed

docs/about/release-notes/index.md

Lines changed: 1 addition & 202 deletions
Original file line numberDiff line numberDiff line change
@@ -12,184 +12,6 @@ modality: "universal"
1212

1313
# NeMo Curator Release Notes: {{ current_release }}
1414

15-
This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](../../curate-video/index.md) and [audio](../../curate-audio/index.md) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
16-
17-
**Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions.
18-
19-
## Installation Updates
20-
21-
- **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`)
22-
- **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support
23-
- **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management
24-
- **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality:
25-
26-
```{list-table} Available Installation Extras
27-
:header-rows: 1
28-
:widths: 25 35 40
29-
30-
* - Extra
31-
- Installation Command
32-
- Description
33-
* - **All Modalities**
34-
- `nemo-curator[all]`
35-
- Complete installation with all modalities and GPU support
36-
* - **Text Curation**
37-
- `nemo-curator[text_cuda12]`
38-
- GPU-accelerated text processing with RAPIDS
39-
* - **Image Curation**
40-
- `nemo-curator[image_cuda12]`
41-
- Image processing with NVIDIA DALI
42-
* - **Audio Curation**
43-
- `nemo-curator[audio_cuda12]`
44-
- Speech recognition with NeMo ASR models
45-
* - **Video Curation**
46-
- `nemo-curator[video_cuda12]`
47-
- Video processing with GPU acceleration
48-
* - **Basic GPU**
49-
- `nemo-curator[cuda12]`
50-
- CUDA utilities without modality-specific dependencies
51-
```
52-
53-
All GPU installations require the NVIDIA PyPI index:
54-
```bash
55-
uv pip install https://pypi.nvidia.com nemo-curator[EXTRA]
56-
```
57-
58-
## New Modalities
59-
60-
### Video
61-
62-
NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities:
63-
64-
- **Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction
65-
- **Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal
66-
- **Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement
67-
- **Embedding generation**: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings
68-
- **Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions
69-
- **Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md)
70-
71-
### Audio
72-
73-
New [audio curation capabilities](../../curate-audio/index.md) for speech data processing:
74-
75-
- **ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models
76-
- **Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation
77-
- **Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second)
78-
- **Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage`
79-
- **Manifest support**: JSONL manifest format for audio file management
80-
81-
## Modality Refactors
82-
83-
### Text
84-
85-
- **Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md)
86-
- **Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization
87-
- **Task-centric architecture**: New `Task`-based processing model for finer-grained control
88-
- **Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification
89-
90-
### Image
91-
92-
- **Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages
93-
- **DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
94-
- **Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md)
95-
- **Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores
96-
97-
Learn more about [image curation](../../curate-images/index.md).
98-
99-
## Deduplication Improvements
100-
101-
Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
102-
103-
- **Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities
104-
- **Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows
105-
- **New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs
106-
107-
## Core Refactors
108-
109-
The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
110-
111-
```{mermaid}
112-
graph LR
113-
subgraph "User Layer"
114-
P[Pipeline]
115-
S1[ProcessingStage X→Y]
116-
S2[ProcessingStage Y→Z]
117-
S3[ProcessingStage Z→W]
118-
R[Resources<br/>CPU/GPU/NVDEC/NVENC]
119-
end
120-
121-
subgraph "Orchestration Layer"
122-
BE[BaseExecutor Interface]
123-
end
124-
125-
subgraph "Backend Layer"
126-
XE[XennaExecutor<br/>Production Ready]
127-
RAP[RayActorPoolExecutor<br/>Experimental]
128-
RDE[RayDataExecutor<br/>Experimental]
129-
end
130-
131-
subgraph "Adaptation Layer"
132-
XA[Xenna Adapter]
133-
RAPA[Ray Actor Adapter]
134-
RDA[Ray Data Adapter]
135-
end
136-
137-
subgraph "Execution Layer"
138-
X[Cosmos-Xenna<br/>Streaming/Batch]
139-
RAY1[Ray Actor Pool<br/>Load Balancing]
140-
RAY2[Ray Data API<br/>Dataset Processing]
141-
end
142-
143-
P --> S1
144-
P --> S2
145-
P --> S3
146-
S1 -.-> R
147-
S2 -.-> R
148-
S3 -.-> R
149-
150-
P --> BE
151-
BE --> XE
152-
BE --> RAP
153-
BE --> RDE
154-
155-
XE --> XA
156-
RAP --> RAPA
157-
RDE --> RDA
158-
159-
XA --> X
160-
RAPA --> RAY1
161-
RDA --> RAY2
162-
163-
style XE fill:#90EE90
164-
style RAP fill:#FFE4B5
165-
style RDE fill:#FFE4B5
166-
style P fill:#E6F3FF
167-
style BE fill:#F0F8FF
168-
```
169-
170-
### Pipelines
171-
172-
- **New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface
173-
- **Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md)
174-
- **Resource specification**: Configurable CPU and GPU memory requirements per stage
175-
- **Stage composition**: Improved stage validation and execution orchestration
176-
177-
### Stages
178-
179-
- **ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety
180-
- **Resource requirements**: Built-in resource specification for CPU and GPU memory
181-
- **Backend adapters**: Stage adaptation layer for different Ray orchestration systems
182-
- **Input/output validation**: Enhanced type checking and data validation
183-
184-
## Tutorials
185-
186-
- **Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API
187-
- **Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend
188-
- **Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio)
189-
- **Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video)
190-
191-
For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository.
192-
19315
## Synthetic Data Generation
19416

19517
New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs:
@@ -205,35 +27,12 @@ New Ray-based synthetic data generation capabilities for creating and augmenting
20527

20628
Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md).
20729

208-
## Known Limitations
209-
210-
> (Pending Refactor in Future Release)
211-
212-
### Generation
213-
214-
- **Hard negative mining**: Retrieval-based data generation workflows under development
215-
216-
### PII
217-
218-
- **PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend
219-
- **Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development
220-
221-
### Blending & Shuffling
222-
223-
- **Data blending**: Multi-source dataset blending functionality being refactored
224-
- **Dataset shuffling**: Large-scale data shuffling operations under development
225-
226-
## Docs Refactor
227-
228-
- **Local preview capability**: Improved documentation build system with local preview support
229-
- **Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md))
230-
- **API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples
23130

23231
---
23332

23433
## What's Next
23534

236-
The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.
35+
The next release will focus on ...
23736

23837
```{toctree}
23938
:hidden:

docs/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
project = "NeMo-Curator"
3030
project_copyright = "2025, NVIDIA Corporation"
3131
author = "NVIDIA Corporation"
32-
release = "25.09"
32+
release = "26.02"
3333

3434
# -- General configuration ---------------------------------------------------
3535
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
@@ -122,7 +122,7 @@
122122
"min_python_version": "3.8",
123123
"recommended_cuda": "12.0+",
124124
"current_release": release,
125-
"container_version": "25.09",
125+
"container_version": "26.02",
126126
}
127127

128128
# Enable figure numbering

docs/versions1.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
[
22
{
33
"preferred": true,
4+
"version": "26.02",
5+
"url": "https://docs.nvidia.com/nemo/curator/26.02/"
6+
},
7+
{
8+
"preferred": false,
49
"version": "25.09",
510
"url": "https://docs.nvidia.com/nemo/curator/25.09/"
611
},

0 commit comments

Comments
 (0)