Skip to content

Commit a2a340d

Browse files
committed
release notes draft
Signed-off-by: Lawrence Lane <llane@nvidia.com>
1 parent aa5c213 commit a2a340d

File tree

1 file changed

+123
-185
lines changed

1 file changed

+123
-185
lines changed

docs/about/release-notes/index.md

Lines changed: 123 additions & 185 deletions
Original file line numberDiff line numberDiff line change
@@ -12,215 +12,153 @@ modality: "universal"
1212

1313
# NeMo Curator Release Notes: {{ current_release }}
1414

15-
## Synthetic Data Generation
15+
## What's New in 26.02
1616

17-
New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs:
18-
19-
- **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff
20-
- **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts
21-
- **Nemotron-CC Pipelines**: Advanced text transformation and knowledge extraction workflows:
22-
- **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose
23-
- **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training
24-
- **Distill**: Create condensed, information-dense paraphrases preserving key concepts
25-
- **Extract Knowledge**: Extract factual content as textbook-style passages
26-
- **Knowledge List**: Extract structured fact lists from documents
27-
28-
Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md).
29-
30-
```{list-table} Available Installation Extras
31-
:header-rows: 1
32-
:widths: 25 35 40
33-
34-
* - Extra
35-
- Installation Command
36-
- Description
37-
* - **All Modalities**
38-
- `nemo-curator[all]`
39-
- Complete installation with all modalities and GPU support
40-
* - **Text Curation**
41-
- `nemo-curator[text_cuda12]`
42-
- GPU-accelerated text processing with RAPIDS
43-
* - **Image Curation**
44-
- `nemo-curator[image_cuda12]`
45-
- Image processing with NVIDIA DALI
46-
* - **Audio Curation**
47-
- `nemo-curator[audio_cuda12]`
48-
- Speech recognition with NeMo ASR models
49-
* - **Video Curation**
50-
- `nemo-curator[video_cuda12]`
51-
- Video processing with GPU acceleration
52-
* - **Basic GPU**
53-
- `nemo-curator[cuda12]`
54-
- CUDA utilities without modality-specific dependencies
55-
```
56-
57-
All GPU installations require the NVIDIA PyPI index:
58-
```bash
59-
uv pip install https://pypi.nvidia.com nemo-curator[EXTRA]
60-
```
61-
62-
## New Modalities
63-
64-
### Video
65-
66-
NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities:
67-
68-
- **Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction
69-
- **Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal
70-
- **Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement
71-
- **Embedding generation**: Cosmos-Embed1 models for clip-level embeddings
72-
- **Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions
73-
- **Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md)
74-
75-
### Audio
76-
77-
New [audio curation capabilities](../../curate-audio/index.md) for speech data processing:
78-
79-
- **ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models
80-
- **Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation
81-
- **Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second)
82-
- **Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage`
83-
- **Manifest support**: JSONL manifest format for audio file management
84-
85-
## Modality Refactors
86-
87-
### Text
88-
89-
- **Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md)
90-
- **Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization
91-
- **Task-centric architecture**: New `Task`-based processing model for finer-grained control
92-
- **Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification
93-
94-
### Image
95-
96-
- **Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages
97-
- **DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
98-
- **Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md)
99-
- **Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores
100-
101-
Learn more about [image curation](../../curate-images/index.md).
102-
103-
## Deduplication Improvements
104-
105-
Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
106-
107-
- **Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities
108-
- **Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows
109-
- **New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs
110-
111-
## Core Refactors
112-
113-
The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
114-
115-
```{mermaid}
116-
graph LR
117-
subgraph "User Layer"
118-
P[Pipeline]
119-
S1[ProcessingStage X→Y]
120-
S2[ProcessingStage Y→Z]
121-
S3[ProcessingStage Z→W]
122-
R[Resources<br/>CPU/GPU/NVDEC/NVENC]
123-
end
124-
125-
subgraph "Orchestration Layer"
126-
BE[BaseExecutor Interface]
127-
end
128-
129-
subgraph "Backend Layer"
130-
XE[XennaExecutor]
131-
RAP[RayActorPoolExecutor]
132-
RDE[RayDataExecutor]
133-
end
134-
135-
subgraph "Adaptation Layer"
136-
XA[Xenna Adapter]
137-
RAPA[Ray Actor Adapter]
138-
RDA[Ray Data Adapter]
139-
end
140-
141-
subgraph "Execution Layer"
142-
X[Cosmos-Xenna<br/>Streaming/Batch]
143-
RAY1[Ray Actor Pool<br/>Load Balancing]
144-
RAY2[Ray Data API<br/>Dataset Processing]
145-
end
146-
147-
P --> S1
148-
P --> S2
149-
P --> S3
150-
S1 -.-> R
151-
S2 -.-> R
152-
S3 -.-> R
153-
154-
P --> BE
155-
BE --> XE
156-
BE --> RAP
157-
BE --> RDE
158-
159-
XE --> XA
160-
RAP --> RAPA
161-
RDE --> RDA
162-
163-
XA --> X
164-
RAPA --> RAY1
165-
RDA --> RAY2
166-
167-
style P fill:#E6F3FF
168-
style BE fill:#F0F8FF
17+
### Benchmarking Infrastructure
18+
19+
New comprehensive benchmarking framework for performance monitoring and optimization:
20+
21+
- **End-to-End Pipeline Benchmarking**: Automated benchmarks for all curation modalities (text, image, video, audio)
22+
- **Performance Tracking**: Integration with MLflow for metrics tracking and Slack for notifications
23+
- **Nightly Benchmarks**: Continuous performance monitoring across:
24+
- Text pipelines: exact deduplication, fuzzy deduplication, semantic deduplication, score filters, modifiers
25+
- Image curation workflows with DALI-based processing
26+
- Video processing pipelines with scene detection and semantic deduplication
27+
- Audio ASR inference and quality assessment
28+
- **Grafana Dashboards**: Real-time monitoring of pipeline performance and resource utilization
29+
30+
### Ray Actor Pool Executor (Experimental)
31+
32+
New execution backend offering an alternative to Xenna for distributed processing:
33+
34+
- **RayActorPoolExecutor**: Experimental executor with load balancing and progress tracking
35+
- **Progress Bars**: Visual feedback for long-running actor pool operations
36+
- **Flexible Resource Allocation**: Better control over worker distribution and task scheduling
37+
38+
Learn more in the [Execution Backends documentation](../../reference/infrastructure/execution-backends.md).
39+
40+
### Enhanced Embedding Generation
41+
42+
Expanded embedding support with new model integrations:
43+
44+
- **vLLM Integration**: High-performance LLM-based embedding generation with automatic batching
45+
- **Sentence Transformers**: Support for popular sentence embedding models
46+
- **Unified API**: Consistent embedding interface across text, image, and video modalities
47+
48+
### YAML Configuration Support
49+
50+
Declarative pipeline configuration for text curation workflows:
51+
52+
- **YAML-Based Pipelines**: Define entire curation pipelines in YAML configuration files
53+
- **Pre-Built Configurations**: Ready-to-use configs for common workflows:
54+
- Code filtering, exact/fuzzy/semantic deduplication
55+
- Heuristic filtering (English and non-English)
56+
- FastText language identification
57+
- **Reproducible Workflows**: Version-controlled pipeline definitions for consistent results
58+
59+
Example:
60+
```bash
61+
python -m nemo_curator.config.run --config_file heuristic_filter_english_pipeline.yaml
16962
```
17063

171-
### Pipelines
64+
### Workflow Results API
65+
66+
New API for tracking and analyzing pipeline execution:
67+
68+
- **WorkflowRunResult**: Structured results object capturing execution metrics
69+
- **Performance Metrics**: Automatic tracking of processing time, throughput, and resource usage
70+
- **Better Debugging**: Detailed logs and error reporting for failed stages
71+
72+
## Improvements from 25.09
73+
74+
### Video Curation
17275

173-
- **New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface
174-
- **Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md)
175-
- **Resource specification**: Configurable CPU and GPU memory requirements per stage
176-
- **Stage composition**: Improved stage validation and execution orchestration
76+
- **Model Updates**: Removed InternVideo2 dependency; updated to more performant alternatives
77+
- **vLLM 0.14.1**: Upgraded for better video captioning compatibility and performance
78+
- **FFmpeg 8.0.1**: Latest FFmpeg with improved codec support and performance
79+
- **Enhanced Tutorials**: Improved video processing examples with real-world scenarios
17780

178-
### Stages
81+
### Audio Curation
17982

180-
- **ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety
181-
- **Resource requirements**: Built-in resource specification for CPU and GPU memory
182-
- **Backend adapters**: Stage adaptation layer for different Ray orchestration systems
183-
- **Input/output validation**: Enhanced type checking and data validation
83+
- **Enhanced Documentation**: Comprehensive ASR inference and quality assessment guides
84+
- **Improved WER Filtering**: Better guidance for Word Error Rate filtering thresholds
85+
- **Manifest Handling**: More robust JSONL manifest processing for large audio datasets
18486

185-
## Tutorials
87+
### Image Curation
18688

187-
- **Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API
188-
- **Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend
189-
- **Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio)
190-
- **Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video)
89+
- **Optimized Batch Sizes**: Reduced default batch sizes for better CPU memory usage (batch_size=50, num_threads=4)
90+
- **Memory Guidance**: Added troubleshooting documentation for out-of-memory errors
91+
- **Tutorial Improvements**: Updated examples optimized for typical GPU configurations
19192

192-
For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository.
93+
### Text Curation
19394

194-
## Known Limitations
95+
- **ID Field Standardization**: Unified ID naming conventions across all deduplication workflows
96+
- **Performance Optimizations**: Fused document iterate and extract stages for reduced overhead
97+
- **Better Memory Management**: Improved handling of large-scale semantic deduplication
98+
- **Small Cluster Warnings**: Automatic warnings when n_clusters is too small for effective deduplication
99+
- **FilePartitioning Improvements**: One worker per partition for better parallelization
195100

196-
> (Pending Refactor in Future Release)
101+
### Deduplication Enhancements
197102

198-
### Generation
103+
- **Cloud Storage Support**: Fixed ParquetReader/Writer and pairwise I/O for S3, GCS, and Azure Blob
104+
- **Non-Blocking ID Generation**: Improved ID generator performance for large datasets
105+
- **Empty Batch Handling**: Better error handling for filters processing empty data batches
199106

200-
- **Synthetic data generation**: Synthetic text generation features are being refactored for Ray compatibility
201-
- **Hard negative mining**: Retrieval-based data generation workflows under development
107+
## Dependency Updates
202108

203-
### PII
109+
- **Transformers**: Pinned to 4.55.2 for stability and compatibility
110+
- **vLLM**: Updated to 0.14.1 with video pipeline compatibility fixes
111+
- **FFmpeg**: Upgraded to 8.0.1 for enhanced multimedia processing
112+
- **Security Patches**:
113+
- Addressed CVEs in aiohttp, urllib3, python-multipart, setuptools
114+
- Removed vulnerable thirdparty aiohttp file from Ray
115+
- Updated to secure dependency versions
204116

205-
- **PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend
206-
- **Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development
117+
## Bug Fixes
207118

208-
### Blending & Shuffling
119+
- Fixed fasttext predict call compatibility with numpy>2
120+
- Fixed broken NeMo Framework documentation links
121+
- Fixed MegatronTokenizerWriter to download only necessary tokenizer files
122+
- Fixed ID generator blocking issues for large-scale processing
123+
- Fixed vLLM API compatibility with video captioning pipeline
124+
- Fixed Gliner tutorial examples and SDG workflow bugs
125+
- Improved semantic deduplication unit test reliability
209126

210-
- **Data blending**: Multi-source dataset blending functionality being refactored
211-
- **Dataset shuffling**: Large-scale data shuffling operations under development
127+
## Infrastructure & Developer Experience
212128

213-
## Docs Refactor
129+
- **Secrets Detection**: Automated secret scanning in CI/CD workflows
130+
- **Dependabot Integration**: Automatic dependency update pull requests
131+
- **Enhanced Install Tests**: Comprehensive installation validation across environments
132+
- **AWS Runner Support**: CI/CD execution on AWS infrastructure
133+
- **Docker Optimization**: Improved layer caching and build times with uv
134+
- **Code Linting**: Standardized code quality checks with markdownlint and pre-commit hooks
135+
- **Cursor Rules**: Development guidelines and patterns for IDE assistance
136+
137+
## Breaking Changes
138+
139+
- **InternVideo2 Removed**: Video pipelines must use alternative embedding models (Cosmos-Embed1)
140+
- **ID Field Standardization**: Custom deduplication workflows may need updates to use standardized ID field names
141+
142+
## Documentation Improvements
143+
144+
- **Heuristic Filter Guide**: Comprehensive documentation for language-specific filtering strategies
145+
- **Distributed Classifier**: Enhanced GPU memory optimization guidance with length-based sequence sorting
146+
- **Installation Guide**: Clearer instructions with troubleshooting for common issues
147+
- **Memory Management**: New guidance for handling CPU/GPU memory constraints
148+
- **AWS Integration**: Updated tutorials with correct AWS credentials setup
214149

215-
- **Local preview capability**: Improved documentation build system with local preview support
216-
- **Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md))
217-
- **API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples
218150

219151
---
220152

221153
## What's Next
222154

223-
The next release will focus on code curation and math curation.
155+
Future releases will focus on:
156+
157+
- **Code Curation**: Specialized pipelines for curating code datasets
158+
- **Math Curation**: Mathematical reasoning and problem-solving data curation
159+
- **Generation Features**: Completing the Ray refactor for synthetic data generation
160+
- **PII Processing**: Enhanced privacy-preserving data curation with Ray backend
161+
- **Blending & Shuffling**: Large-scale multi-source dataset blending and shuffling operations
224162

225163
```{toctree}
226164
:hidden:

0 commit comments

Comments
 (0)