docs: isolate release notes and changelog (#1529)

lbliii · web-flow · commit 22486ff6e9b3 · 2026-02-19T14:53:45.000-05:00
* docs: isolate release notes and changelog

Signed-off-by: Lawrence Lane &lt;llane@nvidia.com&gt;

* abhinav's feedback

Signed-off-by: Lawrence Lane &lt;llane@nvidia.com&gt;

* feedback

Signed-off-by: Lawrence Lane &lt;llane@nvidia.com&gt;

---------

Signed-off-by: Lawrence Lane &lt;llane@nvidia.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,42 @@
 # Changelog
 
+## NVIDIA NeMo Curator 1.1.0
+
+### New Features
+
+- **Stage and Pipeline Benchmarking**: Benchmarking for all modalities (text, image, video, audio)
+- **YAML Configuration**: Declarative pipeline configuration with pre-built configs for code filtering, deduplication, heuristic filtering, and FastText
+- **Pipeline Performance and Metric Logging**: Automatic tracking of processing time, throughput, and resource usage; detailed logs and error reporting for failed stages
+
+### Improvements
+
+- **Video**: Removed InternVideo2; vLLM 0.15.1, FFmpeg 8.0.1
+- **Audio**: Enhanced ASR/WER docs, robust manifest handling
+- **Image**: Optimized batch sizes (batch_size=100, num_threads=16), memory guidance
+- **Text**: Better memory management for large-scale semantic deduplication
+- **Deduplication**: Cloud storage (S3, GCS, Azure) for ParquetReader/Writer, non-blocking ID generation, empty batch handling
+
+### Dependency Updates
+
+- Transformers 4.55.2, vLLM 0.15.1, FFmpeg 8.0.1
+- Security patches: aiohttp, urllib3, python-multipart, setuptools
+
+### Bug Fixes
+
+- FastText numpy>2 compatibility, NeMo doc links, ID generator blocking, vLLM video API, Gliner/SDG tutorials, semantic dedup test reliability
+
+### Infrastructure
+
+- Secrets detection, Dependabot, enhanced install tests, AWS runner support, Docker/uv optimization, Cursor rules
+
+### Breaking Changes
+
+- **InternVideo2 Removed**: Use Cosmos-Embed1 for video embeddings
+
+### Documentation
+
+- Heuristic filter guide, distributed classifier memory guidance, installation troubleshooting, memory management, AWS credentials
+
 ## NVIDIA NeMo Curator 1.0.0
 
 This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](https://docs.nvidia.com/nemo/curator/latest/curate-video/index.html) and [audio](https://docs.nvidia.com/nemo/curator/latest/curate-audio/index.html) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
diff --git a/docs/about/release-notes/index.md b/docs/about/release-notes/index.md
@@ -14,28 +14,16 @@ modality: "universal"
 
 ## What's New in 26.02
 
-### Benchmarking Infrastructure
+### Stage and Pipeline Benchmarking
 
-New comprehensive benchmarking framework for performance monitoring and optimization:
+Benchmarking framework for performance monitoring:
 
-- **End-to-End Pipeline Benchmarking**: Automated benchmarks for all curation modalities (text, image, video, audio)
-- **Performance Tracking**: Integration with MLflow for metrics tracking and Slack for notifications
-- **Nightly Benchmarks**: Continuous performance monitoring across:
+- **Stage and Pipeline Benchmarking**: Automated benchmarks for curation modalities (text, image, video, audio)
+- **Performance Tracking**: Metrics tracking across:
   - Text pipelines: exact deduplication, fuzzy deduplication, semantic deduplication, score filters, modifiers
   - Image curation workflows with DALI-based processing
-  - Video processing pipelines with scene detection and semantic deduplication
+  - Video processing pipelines with splitting, scene detection, captioning, and semantic deduplication
   - Audio ASR inference and quality assessment
-- **Grafana Dashboards**: Real-time monitoring of pipeline performance and resource utilization
-
-### Ray Actor Pool Executor Improvements
-
-Enhanced features for the experimental Ray Actor Pool execution backend:
-
-- **Progress Bars**: New visual feedback for long-running actor pool operations, making it easier to monitor pipeline execution
-- **Improved Load Balancing**: Better worker distribution and task scheduling
-- **Enhanced Stability**: Continued refinements to the experimental executor
-
-Learn more in the [Execution Backends documentation](../../reference/infrastructure/execution-backends.md).
 
 ### YAML Configuration Support
 
@@ -50,12 +38,12 @@ Declarative pipeline configuration for text curation workflows:
 
 Example:
 ```bash
-python -m nemo_curator.config.run --config_file heuristic_filter_english_pipeline.yaml
+python run.py --config-path ./text --config-name heuristic_filter_english_pipeline.yaml input_path=./input_dir output_path=./output_dir
 ```
 
-### Workflow Results API
+### Pipeline Performance and Metric Logging
 
-New API for tracking and analyzing pipeline execution:
+Enhanced tracking of pipeline execution:
 
 - **Performance Metrics**: Automatic tracking of processing time, throughput, and resource usage
 - **Better Debugging**: Detailed logs and error reporting for failed stages
@@ -65,7 +53,7 @@ New API for tracking and analyzing pipeline execution:
 ### Video Curation
 
 - **Model Updates**: Removed InternVideo2 dependency; updated to more performant alternatives
-- **vLLM 0.14.1**: Upgraded for better video captioning compatibility and performance
+- **vLLM 0.15.1**: Upgraded for better video captioning compatibility and performance
 - **FFmpeg 8.0.1**: Latest FFmpeg with improved codec support and performance
 - **Enhanced Tutorials**: Improved video processing examples with real-world scenarios
 
@@ -77,15 +65,13 @@ New API for tracking and analyzing pipeline execution:
 
 ### Image Curation
 
-- **Optimized Batch Sizes**: Reduced default batch sizes for better CPU memory usage (batch_size=50, num_threads=4)
+- **Optimized Batch Sizes**: Configurable batch sizes for better CPU/GPU memory usage (batch_size=100, num_threads=16)
 - **Memory Guidance**: Added troubleshooting documentation for out-of-memory errors
 - **Tutorial Improvements**: Updated examples optimized for typical GPU configurations
 
 ### Text Curation
 
 - **Better Memory Management**: Improved handling of large-scale semantic deduplication
-- **Small Cluster Warnings**: Automatic warnings when n_clusters is too small for effective deduplication
-- **FilePartitioning Improvements**: One worker per partition for better parallelization
 
 ### Deduplication Enhancements
 
@@ -96,7 +82,7 @@ New API for tracking and analyzing pipeline execution:
 ## Dependency Updates
 
 - **Transformers**: Pinned to 4.55.2 for stability and compatibility
-- **vLLM**: Updated to 0.14.1 with video pipeline compatibility fixes
+- **vLLM**: Updated to 0.15.1 with video pipeline compatibility fixes
 - **FFmpeg**: Upgraded to 8.0.1 for enhanced multimedia processing
 - **Security Patches**:
   - Addressed CVEs in aiohttp, urllib3, python-multipart, setuptools
@@ -107,7 +93,6 @@ New API for tracking and analyzing pipeline execution:
 
 - Fixed fasttext predict call compatibility with numpy>2
 - Fixed broken NeMo Framework documentation links
-- Fixed MegatronTokenizerWriter to download only necessary tokenizer files
 - Fixed ID generator blocking issues for large-scale processing
 - Fixed vLLM API compatibility with video captioning pipeline
 - Fixed Gliner tutorial examples and SDG workflow bugs
@@ -120,7 +105,6 @@ New API for tracking and analyzing pipeline execution:
 - **Enhanced Install Tests**: Comprehensive installation validation across environments
 - **AWS Runner Support**: CI/CD execution on AWS infrastructure
 - **Docker Optimization**: Improved layer caching and build times with uv
-- **Code Linting**: Standardized code quality checks with markdownlint and pre-commit hooks
 - **Cursor Rules**: Development guidelines and patterns for IDE assistance
 
 ## Breaking Changes