You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities:
67
-
68
-
-**Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction
69
-
-**Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal
70
-
-**Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement
71
-
-**Embedding generation**: Cosmos-Embed1 models for clip-level embeddings
72
-
-**Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions
73
-
-**Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md)
74
-
75
-
### Audio
76
-
77
-
New [audio curation capabilities](../../curate-audio/index.md) for speech data processing:
78
-
79
-
-**ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models
80
-
-**Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation
81
-
-**Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second)
82
-
-**Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage`
83
-
-**Manifest support**: JSONL manifest format for audio file management
84
-
85
-
## Modality Refactors
86
-
87
-
### Text
88
-
89
-
-**Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md)
90
-
-**Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization
91
-
-**Task-centric architecture**: New `Task`-based processing model for finer-grained control
92
-
-**Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification
93
-
94
-
### Image
95
-
96
-
-**Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages
97
-
-**DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
98
-
-**Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md)
99
-
-**Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores
100
-
101
-
Learn more about [image curation](../../curate-images/index.md).
102
-
103
-
## Deduplication Improvements
104
-
105
-
Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
106
-
107
-
-**Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities
108
-
-**Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows
109
-
-**New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs
110
-
111
-
## Core Refactors
112
-
113
-
The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
114
-
115
-
```{mermaid}
116
-
graph LR
117
-
subgraph "User Layer"
118
-
P[Pipeline]
119
-
S1[ProcessingStage X→Y]
120
-
S2[ProcessingStage Y→Z]
121
-
S3[ProcessingStage Z→W]
122
-
R[Resources<br/>CPU/GPU/NVDEC/NVENC]
123
-
end
124
-
125
-
subgraph "Orchestration Layer"
126
-
BE[BaseExecutor Interface]
127
-
end
128
-
129
-
subgraph "Backend Layer"
130
-
XE[XennaExecutor]
131
-
RAP[RayActorPoolExecutor]
132
-
RDE[RayDataExecutor]
133
-
end
134
-
135
-
subgraph "Adaptation Layer"
136
-
XA[Xenna Adapter]
137
-
RAPA[Ray Actor Adapter]
138
-
RDA[Ray Data Adapter]
139
-
end
140
-
141
-
subgraph "Execution Layer"
142
-
X[Cosmos-Xenna<br/>Streaming/Batch]
143
-
RAY1[Ray Actor Pool<br/>Load Balancing]
144
-
RAY2[Ray Data API<br/>Dataset Processing]
145
-
end
146
-
147
-
P --> S1
148
-
P --> S2
149
-
P --> S3
150
-
S1 -.-> R
151
-
S2 -.-> R
152
-
S3 -.-> R
153
-
154
-
P --> BE
155
-
BE --> XE
156
-
BE --> RAP
157
-
BE --> RDE
158
-
159
-
XE --> XA
160
-
RAP --> RAPA
161
-
RDE --> RDA
162
-
163
-
XA --> X
164
-
RAPA --> RAY1
165
-
RDA --> RAY2
166
-
167
-
style P fill:#E6F3FF
168
-
style BE fill:#F0F8FF
17
+
### Benchmarking Infrastructure
18
+
19
+
New comprehensive benchmarking framework for performance monitoring and optimization:
20
+
21
+
-**End-to-End Pipeline Benchmarking**: Automated benchmarks for all curation modalities (text, image, video, audio)
22
+
-**Performance Tracking**: Integration with MLflow for metrics tracking and Slack for notifications
-**Performance Metrics**: Automatic tracking of processing time, throughput, and resource usage
70
+
-**Better Debugging**: Detailed logs and error reporting for failed stages
71
+
72
+
## Improvements from 25.09
73
+
74
+
### Video Curation
172
75
173
-
-**New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface
174
-
-**Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md)
175
-
-**Resource specification**: Configurable CPU and GPU memory requirements per stage
176
-
-**Stage composition**: Improved stage validation and execution orchestration
76
+
-**Model Updates**: Removed InternVideo2 dependency; updated to more performant alternatives
77
+
-**vLLM 0.14.1**: Upgraded for better video captioning compatibility and performance
78
+
-**FFmpeg 8.0.1**: Latest FFmpeg with improved codec support and performance
79
+
-**Enhanced Tutorials**: Improved video processing examples with real-world scenarios
177
80
178
-
### Stages
81
+
### Audio Curation
179
82
180
-
-**ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety
181
-
-**Resource requirements**: Built-in resource specification for CPU and GPU memory
182
-
-**Backend adapters**: Stage adaptation layer for different Ray orchestration systems
183
-
-**Input/output validation**: Enhanced type checking and data validation
83
+
-**Enhanced Documentation**: Comprehensive ASR inference and quality assessment guides
84
+
-**Improved WER Filtering**: Better guidance for Word Error Rate filtering thresholds
85
+
-**Manifest Handling**: More robust JSONL manifest processing for large audio datasets
184
86
185
-
##Tutorials
87
+
### Image Curation
186
88
187
-
-**Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API
188
-
-**Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend
189
-
-**Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio)
190
-
-**Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video)
89
+
-**Optimized Batch Sizes**: Reduced default batch sizes for better CPU memory usage (batch_size=50, num_threads=4)
90
+
-**Memory Guidance**: Added troubleshooting documentation for out-of-memory errors
91
+
-**Tutorial Improvements**: Updated examples optimized for typical GPU configurations
191
92
192
-
For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository.
93
+
### Text Curation
193
94
194
-
## Known Limitations
95
+
-**ID Field Standardization**: Unified ID naming conventions across all deduplication workflows
96
+
-**Performance Optimizations**: Fused document iterate and extract stages for reduced overhead
97
+
-**Better Memory Management**: Improved handling of large-scale semantic deduplication
98
+
-**Small Cluster Warnings**: Automatic warnings when n_clusters is too small for effective deduplication
99
+
-**FilePartitioning Improvements**: One worker per partition for better parallelization
195
100
196
-
> (Pending Refactor in Future Release)
101
+
### Deduplication Enhancements
197
102
198
-
### Generation
103
+
-**Cloud Storage Support**: Fixed ParquetReader/Writer and pairwise I/O for S3, GCS, and Azure Blob
104
+
-**Non-Blocking ID Generation**: Improved ID generator performance for large datasets
105
+
-**Empty Batch Handling**: Better error handling for filters processing empty data batches
199
106
200
-
-**Synthetic data generation**: Synthetic text generation features are being refactored for Ray compatibility
201
-
-**Hard negative mining**: Retrieval-based data generation workflows under development
107
+
## Dependency Updates
202
108
203
-
### PII
109
+
-**Transformers**: Pinned to 4.55.2 for stability and compatibility
110
+
-**vLLM**: Updated to 0.14.1 with video pipeline compatibility fixes
111
+
-**FFmpeg**: Upgraded to 8.0.1 for enhanced multimedia processing
112
+
-**Security Patches**:
113
+
- Addressed CVEs in aiohttp, urllib3, python-multipart, setuptools
114
+
- Removed vulnerable thirdparty aiohttp file from Ray
115
+
- Updated to secure dependency versions
204
116
205
-
-**PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend
206
-
-**Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development
117
+
## Bug Fixes
207
118
208
-
### Blending & Shuffling
119
+
- Fixed fasttext predict call compatibility with numpy>2
120
+
- Fixed broken NeMo Framework documentation links
121
+
- Fixed MegatronTokenizerWriter to download only necessary tokenizer files
122
+
- Fixed ID generator blocking issues for large-scale processing
123
+
- Fixed vLLM API compatibility with video captioning pipeline
124
+
- Fixed Gliner tutorial examples and SDG workflow bugs
125
+
- Improved semantic deduplication unit test reliability
209
126
210
-
-**Data blending**: Multi-source dataset blending functionality being refactored
211
-
-**Dataset shuffling**: Large-scale data shuffling operations under development
127
+
## Infrastructure & Developer Experience
212
128
213
-
## Docs Refactor
129
+
-**Secrets Detection**: Automated secret scanning in CI/CD workflows
0 commit comments