-
Notifications
You must be signed in to change notification settings - Fork 222
Llane/sdg ray docs #1347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llane/sdg ray docs #1347
Changes from 9 commits
8d42be8
bccea95
abd2209
91f5f9a
9a29ce7
8b6f531
a4ae7a4
2794210
b237fcb
eea5799
777ea37
2891e8f
8a3e846
0086976
ad1b77f
6642691
3cd18a7
40eedfa
0a14d61
66ff0c6
7b8b856
9687c5c
58aa4f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -12,214 +12,27 @@ modality: "universal" | |
|
|
||
| # NeMo Curator Release Notes: {{ current_release }} | ||
|
|
||
| This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](../../curate-video/index.md) and [audio](../../curate-audio/index.md) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads. | ||
| ## Synthetic Data Generation | ||
|
Comment on lines
13
to
+15
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removed introductory paragraph provided important context about the Dask-to-Ray architecture shift and migration guide references. Consider restoring this as the opening paragraph before the SDG section to help users understand the scope of v26.02. Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing critical context. The opening paragraph describing the Dask-to-Ray architecture shift and migration guide references were removed. The "Installation Updates" section header is also missing, leaving the installation table without proper context. Restore the removed content:
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
||
|
|
||
| **Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions. | ||
| New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs: | ||
|
|
||
| ## Installation Updates | ||
| - **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff | ||
| - **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts | ||
| - **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows: | ||
sarahyurick marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose | ||
| - **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training | ||
| - **Distill**: Create condensed, information-dense paraphrases preserving key concepts | ||
| - **Extract Knowledge**: Extract factual content as textbook-style passages | ||
| - **Knowledge List**: Extract structured fact lists from documents | ||
|
|
||
| - **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`) | ||
| - **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support | ||
| - **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management | ||
| - **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality: | ||
| Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md). | ||
lbliii marked this conversation as resolved.
Show resolved
Hide resolved
lbliii marked this conversation as resolved.
Show resolved
Hide resolved
lbliii marked this conversation as resolved.
Show resolved
Hide resolved
Comment on lines
+15
to
+28
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing section header for Installation Updates. The table starting at line 30 contains installation extras but has no header after replacing "Installation Updates" with "Synthetic Data Generation". Add a header like |
||
|
|
||
|
Comment on lines
15
to
29
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The release notes section was reduced from 231 lines to 44 lines, removing all the comprehensive v26.02 release information. The SDG section should be added to the existing release notes, not replace them. Missing content includes:
Restore the original release notes content and add the SDG section as a new bullet point under the appropriate category (likely "New Features" or "Text Modality Updates"). |
||
| ```{list-table} Available Installation Extras | ||
| :header-rows: 1 | ||
| :widths: 25 35 40 | ||
|
|
||
| * - Extra | ||
| - Installation Command | ||
| - Description | ||
| * - **All Modalities** | ||
| - `nemo-curator[all]` | ||
| - Complete installation with all modalities and GPU support | ||
| * - **Text Curation** | ||
| - `nemo-curator[text_cuda12]` | ||
| - GPU-accelerated text processing with RAPIDS | ||
| * - **Image Curation** | ||
| - `nemo-curator[image_cuda12]` | ||
| - Image processing with NVIDIA DALI | ||
| * - **Audio Curation** | ||
| - `nemo-curator[audio_cuda12]` | ||
| - Speech recognition with NeMo ASR models | ||
| * - **Video Curation** | ||
| - `nemo-curator[video_cuda12]` | ||
| - Video processing with GPU acceleration | ||
| * - **Basic GPU** | ||
| - `nemo-curator[cuda12]` | ||
| - CUDA utilities without modality-specific dependencies | ||
| ``` | ||
|
|
||
| All GPU installations require the NVIDIA PyPI index: | ||
| ```bash | ||
| uv pip install https://pypi.nvidia.com nemo-curator[EXTRA] | ||
| ``` | ||
|
|
||
| ## New Modalities | ||
|
|
||
| ### Video | ||
|
|
||
| NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities: | ||
|
|
||
| - **Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction | ||
| - **Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal | ||
| - **Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement | ||
| - **Embedding generation**: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings | ||
| - **Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions | ||
| - **Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md) | ||
|
|
||
| ### Audio | ||
|
|
||
| New [audio curation capabilities](../../curate-audio/index.md) for speech data processing: | ||
|
|
||
| - **ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models | ||
| - **Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation | ||
| - **Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second) | ||
| - **Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage` | ||
| - **Manifest support**: JSONL manifest format for audio file management | ||
|
|
||
| ## Modality Refactors | ||
|
|
||
| ### Text | ||
|
|
||
| - **Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md) | ||
| - **Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization | ||
| - **Task-centric architecture**: New `Task`-based processing model for finer-grained control | ||
| - **Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification | ||
|
|
||
| ### Image | ||
|
|
||
| - **Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages | ||
| - **DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback | ||
| - **Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md) | ||
| - **Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores | ||
|
|
||
| Learn more about [image curation](../../curate-images/index.md). | ||
|
|
||
| ## Deduplication Improvements | ||
|
|
||
| Enhanced deduplication capabilities across all modalities with improved performance and flexibility: | ||
|
|
||
| - **Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities | ||
| - **Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows | ||
| - **New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs | ||
|
|
||
| ## Core Refactors | ||
|
|
||
| The architecture refactor introduces a layered system with unified interfaces and multiple execution backends: | ||
|
|
||
| ```{mermaid} | ||
| graph LR | ||
| subgraph "User Layer" | ||
| P[Pipeline] | ||
| S1[ProcessingStage X→Y] | ||
| S2[ProcessingStage Y→Z] | ||
| S3[ProcessingStage Z→W] | ||
| R[Resources<br/>CPU/GPU/NVDEC/NVENC] | ||
| end | ||
|
|
||
| subgraph "Orchestration Layer" | ||
| BE[BaseExecutor Interface] | ||
| end | ||
|
|
||
| subgraph "Backend Layer" | ||
| XE[XennaExecutor<br/>Production Ready] | ||
| RAP[RayActorPoolExecutor<br/>Experimental] | ||
| RDE[RayDataExecutor<br/>Experimental] | ||
| end | ||
|
|
||
| subgraph "Adaptation Layer" | ||
| XA[Xenna Adapter] | ||
| RAPA[Ray Actor Adapter] | ||
| RDA[Ray Data Adapter] | ||
| end | ||
|
|
||
| subgraph "Execution Layer" | ||
| X[Cosmos-Xenna<br/>Streaming/Batch] | ||
| RAY1[Ray Actor Pool<br/>Load Balancing] | ||
| RAY2[Ray Data API<br/>Dataset Processing] | ||
| end | ||
|
|
||
| P --> S1 | ||
| P --> S2 | ||
| P --> S3 | ||
| S1 -.-> R | ||
| S2 -.-> R | ||
| S3 -.-> R | ||
|
|
||
| P --> BE | ||
| BE --> XE | ||
| BE --> RAP | ||
| BE --> RDE | ||
|
|
||
| XE --> XA | ||
| RAP --> RAPA | ||
| RDE --> RDA | ||
|
|
||
| XA --> X | ||
| RAPA --> RAY1 | ||
| RDA --> RAY2 | ||
|
|
||
| style XE fill:#90EE90 | ||
| style RAP fill:#FFE4B5 | ||
| style RDE fill:#FFE4B5 | ||
| style P fill:#E6F3FF | ||
| style BE fill:#F0F8FF | ||
| ``` | ||
|
|
||
| ### Pipelines | ||
|
|
||
| - **New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface | ||
| - **Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md) | ||
| - **Resource specification**: Configurable CPU and GPU memory requirements per stage | ||
| - **Stage composition**: Improved stage validation and execution orchestration | ||
|
|
||
| ### Stages | ||
|
|
||
| - **ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety | ||
| - **Resource requirements**: Built-in resource specification for CPU and GPU memory | ||
| - **Backend adapters**: Stage adaptation layer for different Ray orchestration systems | ||
| - **Input/output validation**: Enhanced type checking and data validation | ||
|
|
||
| ## Tutorials | ||
|
|
||
| - **Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API | ||
| - **Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend | ||
| - **Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio) | ||
| - **Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video) | ||
|
|
||
| For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository. | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| > (Pending Refactor in Future Release) | ||
|
|
||
| ### Generation | ||
|
|
||
| - **Synthetic data generation**: Synthetic text generation features are being refactored for Ray compatibility | ||
| - **Hard negative mining**: Retrieval-based data generation workflows under development | ||
|
|
||
| ### PII | ||
|
|
||
| - **PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend | ||
| - **Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development | ||
|
|
||
| ### Blending & Shuffling | ||
|
|
||
| - **Data blending**: Multi-source dataset blending functionality being refactored | ||
| - **Dataset shuffling**: Large-scale data shuffling operations under development | ||
|
|
||
| ## Docs Refactor | ||
|
|
||
| - **Local preview capability**: Improved documentation build system with local preview support | ||
| - **Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md)) | ||
| - **API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples | ||
|
|
||
| --- | ||
|
|
||
| ## What's Next | ||
|
|
||
| The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support. | ||
| The next release will focus on ... | ||
lbliii marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```{toctree} | ||
| :hidden: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opening paragraph describing the Ray architecture shift and migration guide reference was removed. This important context helps users understand the scope of the 26.02 release. Consider restoring:
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!