Skip to content

Commit 22486ff

Browse files
authored
docs: isolate release notes and changelog (#1529)
* docs: isolate release notes and changelog Signed-off-by: Lawrence Lane <llane@nvidia.com> * abhinav's feedback Signed-off-by: Lawrence Lane <llane@nvidia.com> * feedback Signed-off-by: Lawrence Lane <llane@nvidia.com> --------- Signed-off-by: Lawrence Lane <llane@nvidia.com>
1 parent 9360428 commit 22486ff

File tree

2 files changed

+48
-27
lines changed

2 files changed

+48
-27
lines changed

CHANGELOG.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,42 @@
11
# Changelog
22

3+
## NVIDIA NeMo Curator 1.1.0
4+
5+
### New Features
6+
7+
- **Stage and Pipeline Benchmarking**: Benchmarking for all modalities (text, image, video, audio)
8+
- **YAML Configuration**: Declarative pipeline configuration with pre-built configs for code filtering, deduplication, heuristic filtering, and FastText
9+
- **Pipeline Performance and Metric Logging**: Automatic tracking of processing time, throughput, and resource usage; detailed logs and error reporting for failed stages
10+
11+
### Improvements
12+
13+
- **Video**: Removed InternVideo2; vLLM 0.15.1, FFmpeg 8.0.1
14+
- **Audio**: Enhanced ASR/WER docs, robust manifest handling
15+
- **Image**: Optimized batch sizes (batch_size=100, num_threads=16), memory guidance
16+
- **Text**: Better memory management for large-scale semantic deduplication
17+
- **Deduplication**: Cloud storage (S3, GCS, Azure) for ParquetReader/Writer, non-blocking ID generation, empty batch handling
18+
19+
### Dependency Updates
20+
21+
- Transformers 4.55.2, vLLM 0.15.1, FFmpeg 8.0.1
22+
- Security patches: aiohttp, urllib3, python-multipart, setuptools
23+
24+
### Bug Fixes
25+
26+
- FastText numpy>2 compatibility, NeMo doc links, ID generator blocking, vLLM video API, Gliner/SDG tutorials, semantic dedup test reliability
27+
28+
### Infrastructure
29+
30+
- Secrets detection, Dependabot, enhanced install tests, AWS runner support, Docker/uv optimization, Cursor rules
31+
32+
### Breaking Changes
33+
34+
- **InternVideo2 Removed**: Use Cosmos-Embed1 for video embeddings
35+
36+
### Documentation
37+
38+
- Heuristic filter guide, distributed classifier memory guidance, installation troubleshooting, memory management, AWS credentials
39+
340
## NVIDIA NeMo Curator 1.0.0
441

542
This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](https://docs.nvidia.com/nemo/curator/latest/curate-video/index.html) and [audio](https://docs.nvidia.com/nemo/curator/latest/curate-audio/index.html) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.

docs/about/release-notes/index.md

Lines changed: 11 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -14,28 +14,16 @@ modality: "universal"
1414

1515
## What's New in 26.02
1616

17-
### Benchmarking Infrastructure
17+
### Stage and Pipeline Benchmarking
1818

19-
New comprehensive benchmarking framework for performance monitoring and optimization:
19+
Benchmarking framework for performance monitoring:
2020

21-
- **End-to-End Pipeline Benchmarking**: Automated benchmarks for all curation modalities (text, image, video, audio)
22-
- **Performance Tracking**: Integration with MLflow for metrics tracking and Slack for notifications
23-
- **Nightly Benchmarks**: Continuous performance monitoring across:
21+
- **Stage and Pipeline Benchmarking**: Automated benchmarks for curation modalities (text, image, video, audio)
22+
- **Performance Tracking**: Metrics tracking across:
2423
- Text pipelines: exact deduplication, fuzzy deduplication, semantic deduplication, score filters, modifiers
2524
- Image curation workflows with DALI-based processing
26-
- Video processing pipelines with scene detection and semantic deduplication
25+
- Video processing pipelines with splitting, scene detection, captioning, and semantic deduplication
2726
- Audio ASR inference and quality assessment
28-
- **Grafana Dashboards**: Real-time monitoring of pipeline performance and resource utilization
29-
30-
### Ray Actor Pool Executor Improvements
31-
32-
Enhanced features for the experimental Ray Actor Pool execution backend:
33-
34-
- **Progress Bars**: New visual feedback for long-running actor pool operations, making it easier to monitor pipeline execution
35-
- **Improved Load Balancing**: Better worker distribution and task scheduling
36-
- **Enhanced Stability**: Continued refinements to the experimental executor
37-
38-
Learn more in the [Execution Backends documentation](../../reference/infrastructure/execution-backends.md).
3927

4028
### YAML Configuration Support
4129

@@ -50,12 +38,12 @@ Declarative pipeline configuration for text curation workflows:
5038

5139
Example:
5240
```bash
53-
python -m nemo_curator.config.run --config_file heuristic_filter_english_pipeline.yaml
41+
python run.py --config-path ./text --config-name heuristic_filter_english_pipeline.yaml input_path=./input_dir output_path=./output_dir
5442
```
5543

56-
### Workflow Results API
44+
### Pipeline Performance and Metric Logging
5745

58-
New API for tracking and analyzing pipeline execution:
46+
Enhanced tracking of pipeline execution:
5947

6048
- **Performance Metrics**: Automatic tracking of processing time, throughput, and resource usage
6149
- **Better Debugging**: Detailed logs and error reporting for failed stages
@@ -65,7 +53,7 @@ New API for tracking and analyzing pipeline execution:
6553
### Video Curation
6654

6755
- **Model Updates**: Removed InternVideo2 dependency; updated to more performant alternatives
68-
- **vLLM 0.14.1**: Upgraded for better video captioning compatibility and performance
56+
- **vLLM 0.15.1**: Upgraded for better video captioning compatibility and performance
6957
- **FFmpeg 8.0.1**: Latest FFmpeg with improved codec support and performance
7058
- **Enhanced Tutorials**: Improved video processing examples with real-world scenarios
7159

@@ -77,15 +65,13 @@ New API for tracking and analyzing pipeline execution:
7765

7866
### Image Curation
7967

80-
- **Optimized Batch Sizes**: Reduced default batch sizes for better CPU memory usage (batch_size=50, num_threads=4)
68+
- **Optimized Batch Sizes**: Configurable batch sizes for better CPU/GPU memory usage (batch_size=100, num_threads=16)
8169
- **Memory Guidance**: Added troubleshooting documentation for out-of-memory errors
8270
- **Tutorial Improvements**: Updated examples optimized for typical GPU configurations
8371

8472
### Text Curation
8573

8674
- **Better Memory Management**: Improved handling of large-scale semantic deduplication
87-
- **Small Cluster Warnings**: Automatic warnings when n_clusters is too small for effective deduplication
88-
- **FilePartitioning Improvements**: One worker per partition for better parallelization
8975

9076
### Deduplication Enhancements
9177

@@ -96,7 +82,7 @@ New API for tracking and analyzing pipeline execution:
9682
## Dependency Updates
9783

9884
- **Transformers**: Pinned to 4.55.2 for stability and compatibility
99-
- **vLLM**: Updated to 0.14.1 with video pipeline compatibility fixes
85+
- **vLLM**: Updated to 0.15.1 with video pipeline compatibility fixes
10086
- **FFmpeg**: Upgraded to 8.0.1 for enhanced multimedia processing
10187
- **Security Patches**:
10288
- Addressed CVEs in aiohttp, urllib3, python-multipart, setuptools
@@ -107,7 +93,6 @@ New API for tracking and analyzing pipeline execution:
10793

10894
- Fixed fasttext predict call compatibility with numpy>2
10995
- Fixed broken NeMo Framework documentation links
110-
- Fixed MegatronTokenizerWriter to download only necessary tokenizer files
11196
- Fixed ID generator blocking issues for large-scale processing
11297
- Fixed vLLM API compatibility with video captioning pipeline
11398
- Fixed Gliner tutorial examples and SDG workflow bugs
@@ -120,7 +105,6 @@ New API for tracking and analyzing pipeline execution:
120105
- **Enhanced Install Tests**: Comprehensive installation validation across environments
121106
- **AWS Runner Support**: CI/CD execution on AWS infrastructure
122107
- **Docker Optimization**: Improved layer caching and build times with uv
123-
- **Code Linting**: Standardized code quality checks with markdownlint and pre-commit hooks
124108
- **Cursor Rules**: Development guidelines and patterns for IDE assistance
125109

126110
## Breaking Changes

0 commit comments

Comments
 (0)