-
Notifications
You must be signed in to change notification settings - Fork 213
β¨ Set up comprehensive Copilot instructions with complete API design architecture and framework overview for Python 3.12 and CUDA 12 #1031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
560b62e
Initial plan
Copilot d53a7ca
Add comprehensive GitHub Copilot instructions for NVIDIA NeMo Curator
Copilot 6d9e461
Enhance Copilot instructions with explicit Python 3.12 and CUDA 12 focus
Copilot 8fff4c5
Address review feedback: add PyTorch, remove unsupported fallback patβ¦
Copilot df71d98
Add comprehensive API design principles and core components from api-β¦
Copilot d47c07e
Expand API design section with comprehensive background and context fβ¦
Copilot 44ba074
Remove Dask references and background sections per review feedback
Copilot da4ad29
Update task-centric architecture description to mention Ray
Copilot 7170425
Add comprehensive framework overview with key components, execution fβ¦
Copilot 29e0001
Addressing PR comments
Copilot 02ae7ee
Merge branch 'main' into copilot/fix-981-2
VibhuJawa 3cad997
Update .github/copilot-instructions.md
VibhuJawa 614029a
Apply suggestion from @sarahyurick
VibhuJawa 1ef6dd1
Update .github/copilot-instructions.md
VibhuJawa File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,263 @@ | ||
| # GitHub Copilot Instructions for NVIDIA NeMo Curator | ||
|
|
||
| ## Overview | ||
|
|
||
| NVIDIA NeMo Curator is a scalable data preprocessing tool for training large language models. This repository focuses on data curation pipelines for text, audio, video, and image modalities with support for both CPU and GPU processing. | ||
|
|
||
| ## Development Environment | ||
|
|
||
| ### Python Environment | ||
| - **Python Version**: 3.12 (recommended and tested), supports 3.10-3.12 | ||
| - **Package Manager**: [uv](https://docs.astral.sh/uv/) for fast, reliable dependency management | ||
| - **Virtual Environment**: Use uv's built-in virtual environment management | ||
| - **Note**: Python 3.11 is not supported due to RAPIDS compatibility requirements | ||
VibhuJawa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### CUDA Environment (Optional) | ||
| - **CUDA Version**: 12.x (12.0+) for GPU acceleration | ||
| - **GPU Requirements**: NVIDIA GPU with Voltaβ’ architecture or higher (compute capability 7.0+) | ||
| - **GPU Libraries**: RAPIDS (cuDF, cuML), PyTorch with CUDA 12 support, CuPy | ||
| - **Note**: GPU features are optional and CPU fallbacks are provided | ||
|
|
||
| ## Key Technologies and Frameworks | ||
|
|
||
| ### Core Dependencies | ||
| - **PyTorch**: Deep learning framework with CUDA support | ||
| - **Ray**: Distributed computing framework for data processing | ||
| - **Pandas/CuDF**: Data manipulation (CPU/GPU respectively) | ||
| - **Transformers**: Hugging Face transformers library | ||
| - **LoguRu**: Structured logging | ||
VibhuJawa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Modality-Specific Libraries | ||
| - **Text Processing**: BeautifulSoup, fasttext, sentencepiece, trafilatura | ||
VibhuJawa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **Audio Processing**: NeMo Toolkit ASR components | ||
| - **Video Processing**: OpenCV, PyAV, CvCuda, PyNvVideoCodec | ||
| - **Image Processing**: NVIDIA DALI for optimized data loading | ||
|
|
||
| ### Testing Framework | ||
| - **pytest**: Primary testing framework | ||
| - **pytest-asyncio**: Async testing support | ||
| - **pytest-coverage**: Code coverage measurement | ||
| - **GPU Testing**: Use `@pytest.mark.gpu` for GPU-dependent tests | ||
|
|
||
| ## Setup Instructions | ||
|
|
||
| ### Basic Setup | ||
| ```bash | ||
| # Install uv package manager | ||
| pip3 install uv | ||
|
|
||
| # Clone and setup development environment | ||
| git clone <repository-url> | ||
| cd Curator | ||
| uv sync | ||
|
|
||
| # For GPU development (requires CUDA 12.x) | ||
| uv sync --extra deduplication_cuda12x | ||
| ``` | ||
|
|
||
| ### Optional Feature Groups | ||
| ```bash | ||
| # Text processing capabilities | ||
| uv sync --extra text | ||
|
|
||
| # Video processing with GPU acceleration | ||
| uv sync --extra video --extra video_cuda | ||
|
|
||
| # All features (includes CUDA dependencies) | ||
| uv sync --extra all | ||
| ``` | ||
|
|
||
| ## Coding Standards and Patterns | ||
|
|
||
| ### Code Quality | ||
| - **Linting**: Ruff with comprehensive rule set (see pyproject.toml) | ||
| - **Line Length**: 119 characters maximum | ||
| - **Type Hints**: Use comprehensive type annotations | ||
| - **Imports**: Follow import sorting conventions | ||
|
|
||
| ### Module Structure | ||
| - **Stages**: Processing stages organized by modality (`text/`, `video/`, `audio/`, `image/`) | ||
| - **Utils**: Shared utilities for common operations | ||
| - **Datasets**: Data loading and manipulation classes | ||
| - **Modules**: Core processing algorithms | ||
|
|
||
| ### Error Handling Patterns | ||
| ```python | ||
| # Graceful degradation for optional GPU dependencies | ||
| try: | ||
| import cudf | ||
| import cvcuda | ||
| import PyNvVideoCodec as Nvc | ||
| HAS_GPU_SUPPORT = True | ||
| except ImportError: | ||
| HAS_GPU_SUPPORT = False | ||
| # Provide helpful error messages for missing dependencies | ||
| logger.warning("GPU dependencies not available, falling back to CPU processing") | ||
|
|
||
| # Example from nvcodec_utils.py | ||
| try: | ||
| import cvcuda | ||
| import nvcv | ||
| import pycuda.driver as cuda | ||
| import PyNvVideoCodec as Nvc | ||
| pixel_format_to_cvcuda_code = { | ||
| Nvc.Pixel_Format.YUV444: cvcuda.ColorConversion.YUV2RGB, | ||
| Nvc.Pixel_Format.NV12: cvcuda.ColorConversion.YUV2RGB_NV12, | ||
| } | ||
| except (ImportError, RuntimeError): | ||
| logger.warning("PyNvVideoCodec is not installed, some features will be disabled.") | ||
| Nvc = None | ||
| pixel_format_to_cvcuda_code = {} | ||
| ``` | ||
VibhuJawa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Configuration Patterns | ||
| - Use YAML configuration files for processing pipelines | ||
| - Support hierarchical configuration (CLI args > env vars > config files > defaults) | ||
| - Follow the configuration structure in `docs/admin/config/` | ||
|
|
||
| ## Testing Guidelines | ||
|
|
||
| ### Test Organization | ||
| - **Unit Tests**: Test individual functions and classes | ||
| - **Integration Tests**: Test complete processing pipelines | ||
| - **GPU Tests**: Mark with `@pytest.mark.gpu` decorator | ||
| - **Mock External Dependencies**: Use pytest mocks for external services | ||
|
|
||
| ### Test Environment Setup | ||
| ```python | ||
| # Example GPU test structure | ||
| @pytest.mark.gpu | ||
| def test_gpu_processing(): | ||
| import cudf # Only import in GPU tests | ||
| df = cudf.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]}) | ||
| assert len(df) == 3 | ||
| ``` | ||
|
|
||
| ### Test Configuration | ||
| - Tests use a unified Ray cluster configuration via `conftest.py` | ||
| - GPU availability is automatically detected using multiple methods (pynvml, nvidia-ml-py) | ||
| - Tests gracefully handle missing GPU dependencies | ||
|
|
||
| ### Running Tests | ||
| ```bash | ||
| # Run all tests (CPU only by default) | ||
| uv run pytest | ||
|
|
||
| # Run GPU tests (requires CUDA environment) | ||
| uv run pytest -m gpu | ||
|
|
||
| # Run specific test categories | ||
| uv run pytest -m "not gpu" # CPU tests only | ||
|
|
||
| # Run tests for specific modules | ||
| uv run pytest tests/stages/text/ | ||
| uv run pytest tests/stages/image/ | ||
| ``` | ||
|
|
||
| ## Build and Development | ||
|
|
||
| ### Development Workflow | ||
| ```bash | ||
| # Install development dependencies | ||
| uv sync | ||
|
|
||
| # Run linting and formatting | ||
| uv run ruff check . | ||
| uv run ruff format . | ||
|
|
||
| # Run tests | ||
| uv run pytest | ||
|
|
||
| # Run specific test modules | ||
| uv run pytest tests/utils/test_nvcodec_utils.py | ||
|
|
||
| # Build documentation (if working on docs) | ||
| make docs-html | ||
|
|
||
| # Check for dependency updates | ||
| uv lock --upgrade | ||
| ``` | ||
|
|
||
| ### Common Development Patterns | ||
|
|
||
| #### Processing Pipeline Structure | ||
| ```python | ||
| # Standard pipeline component structure | ||
| class ProcessingStage: | ||
| def __init__(self, config: Dict[str, Any]): | ||
| self.config = config | ||
|
|
||
| def __call__(self, dataset: DocumentDataset) -> DocumentDataset: | ||
| # Process and return modified dataset | ||
| pass | ||
| ``` | ||
|
|
||
| #### GPU vs CPU Handling | ||
| ```python | ||
| # Check for GPU availability and fallback gracefully | ||
| if HAS_GPU_SUPPORT and use_gpu: | ||
| # Use GPU-accelerated processing | ||
| df = cudf.DataFrame(data) | ||
| else: | ||
| # Fallback to CPU processing | ||
| df = pd.DataFrame(data) | ||
| ``` | ||
VibhuJawa marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #### Configuration Loading | ||
| ```python | ||
| # Use hierarchical configuration loading | ||
| from nemo_curator.utils.config_utils import load_config | ||
|
|
||
| config = load_config( | ||
| config_path="config/processing.yaml", | ||
| overrides={"batch_size": 1024} | ||
| ) | ||
| ``` | ||
VibhuJawa marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
VibhuJawa marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## File Structure Conventions | ||
|
|
||
| - **Processing Stages**: `nemo_curator/stages/{modality}/` (text, image, audio, video) | ||
| - **Utilities**: `nemo_curator/utils/` (shared functionality) | ||
| - **Tests**: `tests/` (mirrors source structure with 162+ test files) | ||
| - **Documentation**: `docs/` (Sphinx-based documentation with MyST) | ||
| - **Examples**: `tutorials/` (Jupyter notebooks and example scripts) | ||
| - **Configuration**: `config/` (YAML configuration templates) | ||
|
|
||
| ### Key Directories | ||
| ``` | ||
| nemo_curator/ | ||
| βββ stages/ | ||
| β βββ text/ # Text processing pipelines | ||
| β βββ image/ # Image processing and filtering | ||
| β βββ audio/ # Audio processing capabilities | ||
| β βββ video/ # Video processing and analysis | ||
| βββ utils/ # Shared utilities and helpers | ||
| βββ datasets/ # Dataset loading and manipulation | ||
| βββ modules/ # Core processing algorithms | ||
| ``` | ||
|
|
||
| ## GPU Development Notes | ||
|
|
||
| ### Memory Management | ||
| - Use context managers for GPU memory allocation | ||
| - Implement memory pool management for large datasets | ||
| - Monitor GPU memory usage during development | ||
|
|
||
| ### CUDA Compatibility | ||
| - Ensure CUDA 12.x compatibility for all GPU code | ||
| - Test both single-GPU and multi-GPU scenarios | ||
| - Handle graceful degradation when GPU resources are unavailable | ||
|
|
||
| ### Performance Considerations | ||
| - Profile GPU kernels for performance bottlenecks | ||
| - Use async processing where possible with Ray | ||
| - Implement batch processing for efficient GPU utilization | ||
|
|
||
| ## Documentation Guidelines | ||
|
|
||
| - Update docstrings for all public APIs | ||
| - Include usage examples in docstrings | ||
| - Update relevant documentation in `docs/` for significant changes | ||
| - Use type hints and document parameter types and return values | ||
|
|
||
| This repository follows NVIDIA's coding standards and emphasizes scalable, robust data processing pipelines with optional GPU acceleration for high-performance computing environments. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.