You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](../../curate-video/index.md) and [audio](../../curate-audio/index.md) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
16
-
17
-
**Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions.
18
-
19
-
## Installation Updates
20
-
21
-
-**New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`)
22
-
-**Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support
NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities:
63
-
64
-
-**Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction
65
-
-**Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal
66
-
-**Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement
67
-
-**Embedding generation**: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings
68
-
-**Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions
69
-
-**Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md)
70
-
71
-
### Audio
72
-
73
-
New [audio curation capabilities](../../curate-audio/index.md) for speech data processing:
74
-
75
-
-**ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models
76
-
-**Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation
77
-
-**Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second)
78
-
-**Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage`
79
-
-**Manifest support**: JSONL manifest format for audio file management
80
-
81
-
## Modality Refactors
82
-
83
-
### Text
84
-
85
-
-**Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md)
86
-
-**Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization
87
-
-**Task-centric architecture**: New `Task`-based processing model for finer-grained control
88
-
-**Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification
89
-
90
-
### Image
91
-
92
-
-**Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages
93
-
-**DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
94
-
-**Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md)
95
-
-**Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores
96
-
97
-
Learn more about [image curation](../../curate-images/index.md).
98
-
99
-
## Deduplication Improvements
100
-
101
-
Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
102
-
103
-
-**Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities
104
-
-**Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows
105
-
-**New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs
106
-
107
-
## Core Refactors
108
-
109
-
The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
110
-
111
-
```{mermaid}
112
-
graph LR
113
-
subgraph "User Layer"
114
-
P[Pipeline]
115
-
S1[ProcessingStage X→Y]
116
-
S2[ProcessingStage Y→Z]
117
-
S3[ProcessingStage Z→W]
118
-
R[Resources<br/>CPU/GPU/NVDEC/NVENC]
119
-
end
120
-
121
-
subgraph "Orchestration Layer"
122
-
BE[BaseExecutor Interface]
123
-
end
124
-
125
-
subgraph "Backend Layer"
126
-
XE[XennaExecutor<br/>Production Ready]
127
-
RAP[RayActorPoolExecutor<br/>Experimental]
128
-
RDE[RayDataExecutor<br/>Experimental]
129
-
end
130
-
131
-
subgraph "Adaptation Layer"
132
-
XA[Xenna Adapter]
133
-
RAPA[Ray Actor Adapter]
134
-
RDA[Ray Data Adapter]
135
-
end
136
-
137
-
subgraph "Execution Layer"
138
-
X[Cosmos-Xenna<br/>Streaming/Batch]
139
-
RAY1[Ray Actor Pool<br/>Load Balancing]
140
-
RAY2[Ray Data API<br/>Dataset Processing]
141
-
end
142
-
143
-
P --> S1
144
-
P --> S2
145
-
P --> S3
146
-
S1 -.-> R
147
-
S2 -.-> R
148
-
S3 -.-> R
149
-
150
-
P --> BE
151
-
BE --> XE
152
-
BE --> RAP
153
-
BE --> RDE
154
-
155
-
XE --> XA
156
-
RAP --> RAPA
157
-
RDE --> RDA
158
-
159
-
XA --> X
160
-
RAPA --> RAY1
161
-
RDA --> RAY2
162
-
163
-
style XE fill:#90EE90
164
-
style RAP fill:#FFE4B5
165
-
style RDE fill:#FFE4B5
166
-
style P fill:#E6F3FF
167
-
style BE fill:#F0F8FF
168
-
```
169
-
170
-
### Pipelines
171
-
172
-
-**New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface
173
-
-**Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md)
174
-
-**Resource specification**: Configurable CPU and GPU memory requirements per stage
175
-
-**Stage composition**: Improved stage validation and execution orchestration
176
-
177
-
### Stages
178
-
179
-
-**ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety
180
-
-**Resource requirements**: Built-in resource specification for CPU and GPU memory
181
-
-**Backend adapters**: Stage adaptation layer for different Ray orchestration systems
182
-
-**Input/output validation**: Enhanced type checking and data validation
183
-
184
-
## Tutorials
185
-
186
-
-**Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API
187
-
-**Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend
188
-
-**Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio)
189
-
-**Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video)
190
-
191
-
For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository.
192
-
193
15
## Synthetic Data Generation
194
16
195
17
New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs:
@@ -205,35 +27,12 @@ New Ray-based synthetic data generation capabilities for creating and augmenting
205
27
206
28
Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md).
207
29
208
-
## Known Limitations
209
-
210
-
> (Pending Refactor in Future Release)
211
-
212
-
### Generation
213
-
214
-
-**Hard negative mining**: Retrieval-based data generation workflows under development
215
-
216
-
### PII
217
-
218
-
-**PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend
219
-
-**Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development
220
-
221
-
### Blending & Shuffling
222
-
223
-
-**Data blending**: Multi-source dataset blending functionality being refactored
224
-
-**Dataset shuffling**: Large-scale data shuffling operations under development
225
-
226
-
## Docs Refactor
227
-
228
-
-**Local preview capability**: Improved documentation build system with local preview support
229
-
-**Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md))
230
-
-**API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples
231
30
232
31
---
233
32
234
33
## What's Next
235
34
236
-
The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.
0 commit comments