From 8d42be888e3efcc3159eed8e4d41b958b75ec458 Mon Sep 17 00:00:00 2001 From: Lawrence Lane Date: Fri, 2 Jan 2026 10:25:00 -0500 Subject: [PATCH 1/5] sdg ray docs init Signed-off-by: Lawrence Lane --- docs/about/release-notes/index.md | 16 +- docs/curate-text/index.md | 11 + docs/curate-text/synthetic/index.md | 156 +++++++ docs/curate-text/synthetic/llm-client.md | 301 ++++++++++++++ docs/curate-text/synthetic/multilingual-qa.md | 299 +++++++++++++ .../synthetic/nemotron-cc/index.md | 280 +++++++++++++ .../synthetic/nemotron-cc/tasks.md | 393 ++++++++++++++++++ tutorials/synthetic/README.md | 116 +++++- 8 files changed, 1560 insertions(+), 12 deletions(-) create mode 100644 docs/curate-text/synthetic/index.md create mode 100644 docs/curate-text/synthetic/llm-client.md create mode 100644 docs/curate-text/synthetic/multilingual-qa.md create mode 100644 docs/curate-text/synthetic/nemotron-cc/index.md create mode 100644 docs/curate-text/synthetic/nemotron-cc/tasks.md diff --git a/docs/about/release-notes/index.md b/docs/about/release-notes/index.md index 5dd57edfc2..7492c9e141 100644 --- a/docs/about/release-notes/index.md +++ b/docs/about/release-notes/index.md @@ -190,13 +190,27 @@ graph LR For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository. +## Synthetic Data Generation + +New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs: + +- **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff +- **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts +- **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows: + - **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose + - **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training + - **Distill**: Create condensed, information-dense paraphrases preserving key concepts + - **Extract Knowledge**: Extract factual content as textbook-style passages + - **Knowledge List**: Extract structured fact lists from documents + +Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md). + ## Known Limitations > (Pending Refactor in Future Release) ### Generation -- **Synthetic data generation**: Synthetic text generation features are being refactored for Ray compatibility - **Hard negative mining**: Retrieval-based data generation workflows under development ### PII diff --git a/docs/curate-text/index.md b/docs/curate-text/index.md index f8c0aa9576..c9a8c46275 100644 --- a/docs/curate-text/index.md +++ b/docs/curate-text/index.md @@ -191,6 +191,17 @@ Domain-specific processing for code and advanced curation tasks {bdg-secondary}`code-processing` ::: +:::{grid-item-card} {octicon}`sparkles;1.5em;sd-mr-1` Synthetic Data Generation +:link: synthetic/index +:link-type: doc +Generate and augment training data using LLMs ++++ +{bdg-secondary}`llm` +{bdg-secondary}`augmentation` +{bdg-secondary}`multilingual` +{bdg-secondary}`nemotron-cc` +::: + :::: diff --git a/docs/curate-text/synthetic/index.md b/docs/curate-text/synthetic/index.md new file mode 100644 index 0000000000..7112b0288d --- /dev/null +++ b/docs/curate-text/synthetic/index.md @@ -0,0 +1,156 @@ +--- +description: "Generate and augment training data using LLMs with NeMo Curator's synthetic data generation pipeline" +categories: ["workflows"] +tags: ["synthetic-data", "llm", "generation", "augmentation", "multilingual"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "intermediate" +content_type: "workflow" +modality: "text-only" +--- + +(synthetic-data-overview)= + +# Synthetic Data Generation + +NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, local vLLM servers, or other inference providers. + +## Use Cases + +- **Data Augmentation**: Expand limited datasets by generating diverse variations +- **Multilingual Generation**: Create Q&A pairs and text in multiple languages +- **Knowledge Extraction**: Convert raw text into structured knowledge formats +- **Quality Improvement**: Paraphrase low-quality text into higher-quality Wikipedia-style prose +- **Training Data Creation**: Generate instruction-following data for model fine-tuning + +## Core Concepts + +Synthetic data generation in NeMo Curator operates in two primary modes: + +### Generation Mode + +Create new data from scratch without requiring input documents. The `QAMultilingualSyntheticStage` demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents. + +### Transformation Mode + +Improve or restructure existing data using LLM capabilities. The NemotronCC stages exemplify this approach, taking input documents and producing: + +- Paraphrased text in Wikipedia style +- Diverse Q&A pairs derived from document content +- Condensed knowledge distillations +- Extracted factual content + +## Architecture + +The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages: + +```{mermaid} +flowchart LR + A["Input Documents
(Parquet/JSONL)"] --> B["Preprocessing
(Tokenization,
Segmentation)"] + B --> C["LLM Generation
(OpenAI-compatible)"] + C --> D["Postprocessing
(Cleanup, Filtering)"] + D --> E["Output Dataset
(Parquet/JSONL)"] + + F["LLM Client
(NVIDIA API,
vLLM, TGI)"] -.->|"API Calls"| C + + classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000 + classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000 + classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000 + + class A,B,C,D stage + class E output + class F infra +``` + +## Prerequisites + +Before using synthetic data generation, ensure you have: + +1. **NVIDIA API Key** (for cloud endpoints) + - Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys) + - Set as environment variable: `export NVIDIA_API_KEY="your-key"` + +2. **NeMo Curator with text extras** + + ```bash + uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12] + ``` + +3. **Additional dependencies** (for NemotronCC pipelines) + + ```bash + pip install transformers # For tokenizer support + ``` + +## Available SDG Stages + +```{list-table} Synthetic Data Generation Stages +:header-rows: 1 +:widths: 30 40 30 + +* - Stage + - Purpose + - Input Type +* - `QAMultilingualSyntheticStage` + - Generate multilingual Q&A pairs + - Empty (generates from scratch) +* - `WikipediaParaphrasingStage` + - Rewrite text as Wikipedia-style prose + - Document text +* - `DiverseQAStage` + - Generate diverse Q&A pairs from documents + - Document text +* - `DistillStage` + - Create condensed, information-dense paraphrases + - Document text +* - `ExtractKnowledgeStage` + - Extract knowledge as textbook-style passages + - Document text +* - `KnowledgeListStage` + - Extract structured fact lists + - Document text +``` + +--- + +## Getting Started + +::::{grid} 1 1 2 2 +:gutter: 2 + +:::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` LLM Client Setup +:link: llm-client +:link-type: doc +Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints ++++ +{bdg-secondary}`configuration` +{bdg-secondary}`performance` +::: + +:::{grid-item-card} {octicon}`globe;1.5em;sd-mr-1` Multilingual Q&A Generation +:link: multilingual-qa +:link-type: doc +Generate synthetic Q&A pairs across multiple languages ++++ +{bdg-secondary}`quickstart` +{bdg-secondary}`tutorial` +::: + +:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` NemotronCC Pipelines +:link: nemotron-cc/index +:link-type: doc +Advanced text transformation and knowledge extraction workflows ++++ +{bdg-secondary}`advanced` +{bdg-secondary}`paraphrasing` +::: + +:::: + +```{toctree} +:hidden: +:maxdepth: 2 + +llm-client +multilingual-qa +nemotron-cc/index +``` diff --git a/docs/curate-text/synthetic/llm-client.md b/docs/curate-text/synthetic/llm-client.md new file mode 100644 index 0000000000..1b517e8230 --- /dev/null +++ b/docs/curate-text/synthetic/llm-client.md @@ -0,0 +1,301 @@ +--- +description: "Configure LLM clients for synthetic data generation with NVIDIA APIs or custom endpoints" +categories: ["how-to-guides"] +tags: ["llm-client", "openai", "nvidia-api", "configuration"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "beginner" +content_type: "how-to" +modality: "text-only" +--- + +(synthetic-llm-client)= +# LLM Client Configuration + +NeMo Curator's synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints. + +## Overview + +Two client types are available: + +- **`AsyncOpenAIClient`**: Recommended for high-throughput batch processing with concurrent requests +- **`OpenAIClient`**: Synchronous client for simpler use cases or debugging + +For most SDG workloads, use `AsyncOpenAIClient` to maximize throughput. + +## Basic Configuration + +### NVIDIA API Endpoints + +```python +from nemo_curator.models.client.openai_client import AsyncOpenAIClient + +client = AsyncOpenAIClient( + api_key="your-nvidia-api-key", # Or use NVIDIA_API_KEY env var + base_url="https://integrate.api.nvidia.com/v1", + max_concurrent_requests=5, +) +``` + +### Environment Variables + +Set your API key as an environment variable to avoid hardcoding credentials: + +```bash +export NVIDIA_API_KEY="nvapi-..." +``` + +The client automatically uses `NVIDIA_API_KEY` or `OPENAI_API_KEY` if not explicitly provided. + +## Generation Parameters + +Configure LLM generation behavior using `GenerationConfig`: + +```python +from nemo_curator.models.client.llm_client import GenerationConfig + +config = GenerationConfig( + max_tokens=2048, + temperature=0.7, + top_p=0.95, + seed=42, # For reproducibility +) +``` + +```{list-table} Generation Parameters +:header-rows: 1 +:widths: 20 15 15 50 + +* - Parameter + - Type + - Default + - Description +* - `max_tokens` + - int + - 2048 + - Maximum tokens to generate per request +* - `temperature` + - float + - 0.0 + - Sampling temperature (0.0-2.0). Higher values increase randomness +* - `top_p` + - float + - 0.95 + - Nucleus sampling parameter (0.0-1.0) +* - `top_k` + - int + - None + - Top-k sampling (if supported by the endpoint) +* - `seed` + - int + - 0 + - Random seed for reproducibility +* - `stop` + - str/list + - None + - Stop sequences to end generation +* - `stream` + - bool + - False + - Enable streaming (not recommended for batch processing) +* - `n` + - int + - 1 + - Number of completions to generate per request +``` + +## Performance Tuning + +### Concurrency vs. Parallelism + +The `max_concurrent_requests` parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray's distributed workers: + +- **Client-level concurrency**: `max_concurrent_requests` limits concurrent API calls per worker +- **Worker-level parallelism**: Ray distributes tasks across multiple workers + +```python +# For NVIDIA API endpoints with rate limits +client = AsyncOpenAIClient( + base_url="https://integrate.api.nvidia.com/v1", + max_concurrent_requests=3, # Conservative for cloud APIs +) + +# For local vLLM server with more capacity +client = AsyncOpenAIClient( + base_url="http://localhost:8000/v1", + max_concurrent_requests=16, # Higher for local deployment +) +``` + +### Optimal Settings + +```{list-table} Recommended Concurrency Settings +:header-rows: 1 +:widths: 30 25 45 + +* - Endpoint Type + - Recommended Setting + - Notes +* - NVIDIA API (cloud) + - 3-5 + - Respects rate limits; increase gradually +* - Local vLLM + - 8-32 + - Depends on GPU memory and model size +* - Local TGI + - 8-16 + - Adjust based on server configuration +``` + +### Retry Configuration + +The client includes automatic retry with exponential backoff for transient errors: + +```python +client = AsyncOpenAIClient( + base_url="https://integrate.api.nvidia.com/v1", + max_retries=3, # Number of retry attempts + base_delay=1.0, # Base delay in seconds + timeout=120, # Request timeout +) +``` + +The retry logic handles: +- **Rate limit errors (429)**: Automatic backoff with jitter +- **Connection errors**: Retry with exponential delay +- **Transient failures**: Configurable retry attempts + +## Using Custom Endpoints + +````{tab-set} + +```{tab-item} Local vLLM Server + +Deploy a local vLLM server and configure the client: + +**Start vLLM server:** +```bash +vllm serve meta-llama/Llama-3.3-70B-Instruct \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 4 +``` + +**Configure client:** +```python +client = AsyncOpenAIClient( + base_url="http://localhost:8000/v1", + api_key="not-needed", # vLLM doesn't require API key by default + max_concurrent_requests=16, + timeout=300, # Longer timeout for large models +) +``` +``` + +```{tab-item} Text Generation Inference (TGI) + +Deploy a TGI server and configure the client: + +**Start TGI server:** +```bash +docker run --gpus all -p 8080:80 \ + ghcr.io/huggingface/text-generation-inference:latest \ + --model-id meta-llama/Llama-3.3-70B-Instruct +``` + +**Configure client:** +```python +client = AsyncOpenAIClient( + base_url="http://localhost:8080/v1", + api_key="not-needed", + max_concurrent_requests=8, +) +``` +``` + +```{tab-item} OpenAI API + +Use the official OpenAI API: + +```python +client = AsyncOpenAIClient( + base_url="https://api.openai.com/v1", + api_key="sk-...", # Or set OPENAI_API_KEY env var + max_concurrent_requests=5, +) +``` +``` + +```` + +## Complete Example + +```python +import os +from nemo_curator.models.client.openai_client import AsyncOpenAIClient +from nemo_curator.models.client.llm_client import GenerationConfig +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage + +# Configure client +client = AsyncOpenAIClient( + api_key=os.environ.get("NVIDIA_API_KEY"), + base_url="https://integrate.api.nvidia.com/v1", + max_concurrent_requests=5, + max_retries=3, + base_delay=1.0, +) + +# Configure generation +config = GenerationConfig( + temperature=0.9, + top_p=0.95, + max_tokens=2048, +) + +# Use in a pipeline stage +pipeline = Pipeline(name="sdg_example") +pipeline.add_stage( + QAMultilingualSyntheticStage( + prompt="Generate a Q&A pair about science in {language}.", + languages=["English", "French", "German"], + client=client, + model_name="meta/llama-3.3-70b-instruct", + num_samples=100, + generation_config=config, + ) +) +``` + +## Troubleshooting + +### Rate Limit Errors + +If you encounter frequent 429 errors: +1. Reduce `max_concurrent_requests` +2. Increase `base_delay` for longer backoff +3. Consider using a local deployment for high-volume workloads + +### Connection Timeouts + +For large models or slow networks: +```python +client = AsyncOpenAIClient( + base_url="...", + timeout=300, # Increase from default 120 seconds +) +``` + +### Local Server Issues + +If experiencing connection errors with local servers: +- Check server resource utilization (GPU memory, CPU) +- Reduce concurrent requests +- Verify the server is running and accessible + +--- + +## Next Steps + +- {ref}`multilingual-qa-tutorial`: Generate multilingual Q&A pairs +- {ref}`nemotron-cc-overview`: Advanced text transformation pipelines + diff --git a/docs/curate-text/synthetic/multilingual-qa.md b/docs/curate-text/synthetic/multilingual-qa.md new file mode 100644 index 0000000000..8417d63b34 --- /dev/null +++ b/docs/curate-text/synthetic/multilingual-qa.md @@ -0,0 +1,299 @@ +--- +description: "Generate multilingual Q&A pairs using LLMs with NeMo Curator's synthetic data pipeline" +categories: ["tutorials"] +tags: ["multilingual", "qa-generation", "synthetic-data", "quickstart"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "beginner" +content_type: "tutorial" +modality: "text-only" +--- + +(multilingual-qa-tutorial)= +# Generate Multilingual Q&A Data + +This tutorial shows how to generate synthetic Q&A pairs across multiple languages using NeMo Curator's `QAMultilingualSyntheticStage`. You'll learn to configure an LLM client, create a generation pipeline, and optionally filter the output. + +**Time to complete**: ~15 minutes + +## What You'll Build + +A pipeline that: +1. Generates Q&A pairs in multiple languages using an LLM +2. Optionally filters results by language +3. Writes output to JSONL format + +## Prerequisites + +- **NVIDIA API Key**: Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys) +- **NeMo Curator**: Installed with text extras + +```bash +export NVIDIA_API_KEY="nvapi-..." +``` + +## Quick Start + +```python +import os +from nemo_curator.core.client import RayClient +from nemo_curator.models.client.openai_client import AsyncOpenAIClient +from nemo_curator.models.client.llm_client import GenerationConfig +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage +from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter + +# Initialize Ray +client = RayClient(include_dashboard=False) +client.start() + +# Create LLM client +llm_client = AsyncOpenAIClient( + api_key=os.environ["NVIDIA_API_KEY"], + base_url="https://integrate.api.nvidia.com/v1", + max_concurrent_requests=5, +) + +# Create pipeline +pipeline = Pipeline(name="multilingual_qa") + +# Add synthetic generation stage +pipeline.add_stage( + QAMultilingualSyntheticStage( + prompt="Generate a Q&A pair about science in {language}.", + languages=["English", "French", "German", "Spanish"], + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + num_samples=50, + generation_config=GenerationConfig(temperature=0.9), + ) +) + +# Write output +pipeline.add_stage(JsonlWriter(path="./synthetic_qa/")) + +# Run pipeline +results = pipeline.run() + +client.stop() +``` + +## Step-by-Step Guide + +### Step 1: Configure the LLM Client + +The `AsyncOpenAIClient` enables concurrent API requests for efficient batch generation: + +```python +from nemo_curator.models.client.openai_client import AsyncOpenAIClient +from nemo_curator.models.client.llm_client import GenerationConfig + +llm_client = AsyncOpenAIClient( + api_key=os.environ["NVIDIA_API_KEY"], + base_url="https://integrate.api.nvidia.com/v1", + max_concurrent_requests=5, # Adjust based on rate limits + max_retries=3, # Retry on transient failures + base_delay=1.0, # Backoff delay in seconds +) + +# Configure generation parameters +generation_config = GenerationConfig( + temperature=0.9, # Higher for more diverse outputs + top_p=0.95, + max_tokens=2048, + seed=None, # None for non-deterministic generation +) +``` + +### Step 2: Define the Prompt Template + +The prompt template must include a `{language}` placeholder. The stage randomly selects a language for each sample: + +```python +# Simple Q&A prompt +prompt = "Generate a Q&A pair about science in {language}." + +# Structured prompt with language prefixes +prompt = """ +Generate a short question and a short answer in the general science domain in {language}. +Begin with the language name using the 2-letter code in square brackets, +e.g. [EN] for English, [FR] for French, [DE] for German. +""" +``` + +### Step 3: Create the Pipeline + +```python +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage + +pipeline = Pipeline( + name="multilingual_qa_generation", + description="Generate synthetic Q&A pairs in multiple languages", +) + +pipeline.add_stage( + QAMultilingualSyntheticStage( + prompt=prompt, + languages=["English", "French", "German", "Spanish", "Italian"], + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + num_samples=100, + generation_config=generation_config, + ) +) +``` + +### Step 4: Add Language Filtering (Optional) + +If your prompt includes language prefixes, you can filter to keep only specific languages: + +```python +from nemo_curator.stages.text.filters.doc_filter import DocumentFilter +from nemo_curator.stages.text.modules.score_filter import ScoreFilter + + +class BeginsWithLanguageFilter(DocumentFilter): + """Filter documents based on language prefix codes.""" + + def __init__(self, languages: list[str]): + self.name = "begins_with_language_filter" + self.languages = languages + + def score_document(self, text: str) -> float: + if not self.languages: + return 1.0 + return 1.0 if text.startswith(tuple(self.languages)) else 0.0 + + def keep_document(self, score: float) -> bool: + return score == 1.0 + + +# Add filter to keep only English outputs +pipeline.add_stage( + ScoreFilter( + BeginsWithLanguageFilter(languages=["[EN]"]), + text_field="text", + ), +) +``` + +### Step 5: Configure Output + +Write results to JSONL or Parquet format: + +```python +from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter +from nemo_curator.stages.text.io.writer.parquet import ParquetWriter + +# JSONL output +pipeline.add_stage(JsonlWriter(path="./output/synthetic_qa/")) + +# Or Parquet output +# pipeline.add_stage(ParquetWriter(path="./output/synthetic_qa/")) +``` + +### Step 6: Run the Pipeline + +```python +from nemo_curator.core.client import RayClient + +# Initialize Ray +client = RayClient(include_dashboard=False) +client.start() + +# Execute pipeline +print(pipeline.describe()) +results = pipeline.run() + +# Print results summary +if results: + for result in results: + if hasattr(result, "data") and result.data: + for file_path in result.data: + print(f"Generated: {file_path}") + +client.stop() +``` + +## CLI Usage + +The tutorial script supports command-line arguments: + +```bash +cd tutorials/synthetic + +# Basic usage +python synthetic_data_generation_example.py --num-samples 50 + +# Custom languages and model +python synthetic_data_generation_example.py \ + --num-samples 100 \ + --languages English French German \ + --model-name meta/llama-3.3-70b-instruct \ + --temperature 0.9 + +# Skip language filtering +python synthetic_data_generation_example.py \ + --num-samples 50 \ + --no-filter-languages +``` + +### Available Arguments + +```{list-table} +:header-rows: 1 +:widths: 25 15 60 + +* - Argument + - Default + - Description +* - `--api-key` + - env var + - NVIDIA API key (or set NVIDIA_API_KEY) +* - `--base-url` + - NVIDIA API + - Base URL for the API endpoint +* - `--model-name` + - llama-3.3-70b + - Model to use for generation +* - `--languages` + - EN, FR, DE, ES, IT + - Languages to generate Q&A pairs for +* - `--num-samples` + - 100 + - Number of samples to generate +* - `--temperature` + - 0.9 + - Sampling temperature +* - `--output-path` + - ./synthetic_output + - Output directory +* - `--no-filter-languages` + - False + - Disable language filtering +``` + +## Sample Output + +Generated documents contain a `text` field with the LLM response: + +```json +{"text": "[EN] Question: What causes ocean tides? Answer: Ocean tides are primarily caused by the gravitational pull of the Moon and Sun on Earth's water bodies."} +{"text": "[FR] Question: Qu'est-ce que la photosynthèse? Answer: La photosynthèse est le processus par lequel les plantes convertissent la lumière du soleil en énergie."} +{"text": "[DE] Question: Was ist der größte Planet in unserem Sonnensystem? Answer: Jupiter ist der größte Planet in unserem Sonnensystem."} +``` + +## Tips for Diverse Output + +1. **Use higher temperature** (0.7-1.0) for more varied outputs +2. **Avoid fixed seeds** for non-deterministic generation +3. **Include clear instructions** in the prompt for consistent formatting +4. **Filter post-generation** to ensure quality standards + +--- + +## Next Steps + +- {ref}`synthetic-llm-client`: Advanced client configuration and performance tuning +- {ref}`nemotron-cc-overview`: Advanced pipelines for text transformation and knowledge extraction + diff --git a/docs/curate-text/synthetic/nemotron-cc/index.md b/docs/curate-text/synthetic/nemotron-cc/index.md new file mode 100644 index 0000000000..2f2665f7f4 --- /dev/null +++ b/docs/curate-text/synthetic/nemotron-cc/index.md @@ -0,0 +1,280 @@ +--- +description: "Advanced synthetic data generation using NemotronCC pipelines for text transformation and knowledge extraction" +categories: ["workflows"] +tags: ["nemotron-cc", "paraphrasing", "knowledge-extraction", "distillation"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "advanced" +content_type: "workflow" +modality: "text-only" +--- + +(nemotron-cc-overview)= +# NemotronCC Pipelines + +NemotronCC provides advanced synthetic data generation workflows for transforming and extracting knowledge from existing text documents. Unlike simple generation, these pipelines use sophisticated preprocessing, LLM-based transformation, and postprocessing to create high-quality training data. + +## The Composable Pipeline Pattern + +NemotronCC stages follow a composable pattern with three distinct phases: + +1. **Preprocessing**: Segment documents, filter by length, and prepare inputs for the LLM +2. **Generation**: Apply task-specific prompts to transform text using the LLM +3. **Postprocessing**: Clean outputs, remove formatting artifacts, and filter low-quality results + +This separation enables fine-grained control over each phase while providing reusable helper functions for common patterns. + +## Pipeline Architecture + +```{mermaid} +flowchart TB + subgraph "Preprocessing" + A[Input Documents] --> B[Token Count Filter] + B --> C[Document Splitter] + C --> D[Segment Filter] + D --> E[Document Joiner] + end + + subgraph "LLM Generation" + E --> F[Task-Specific Stage
WikiPara/DiverseQA/Distill/etc.] + end + + subgraph "Postprocessing" + F --> G[Token Count Filter] + G --> H[Markdown Remover] + H --> I[Task-Specific Cleanup] + I --> J[Quality Filter] + end + + J --> K[Output Dataset] +``` + +## Available Tasks + +NemotronCC provides five specialized generation tasks, each designed for specific data transformation needs: + +```{list-table} NemotronCC Task Types +:header-rows: 1 +:widths: 20 25 30 25 + +* - Task + - Stage Class + - Purpose + - Use Case +* - Wikipedia Paraphrasing + - `WikipediaParaphrasingStage` + - Rewrite text as Wikipedia-style prose + - Improving noisy web data +* - Diverse QA + - `DiverseQAStage` + - Generate diverse Q&A pairs + - Reading comprehension training +* - Distill + - `DistillStage` + - Create condensed, informative paraphrases + - Knowledge distillation +* - Extract Knowledge + - `ExtractKnowledgeStage` + - Extract factual content as passages + - Knowledge base creation +* - Knowledge List + - `KnowledgeListStage` + - Extract structured fact lists + - Fact extraction +``` + +## Quality-Based Processing Strategy + +NemotronCC pipelines are designed to process data based on quality scores. The typical approach: + +### High-Quality Data Pipeline + +For documents with high quality scores, use tasks that leverage the existing quality: +- **DiverseQA**: Generate Q&A pairs from well-structured content +- **Distill**: Create condensed versions preserving key information +- **ExtractKnowledge**: Extract factual passages +- **KnowledgeList**: Extract structured facts + +```python +from nemo_curator.stages.text.modules.score_filter import Filter + +# Filter for high-quality documents (score > 11) +pipeline.add_stage( + Filter( + filter_fn=lambda x: int(x) > 11, + filter_field="quality_score", + ), +) +``` + +### Low-Quality Data Pipeline + +For documents with lower quality scores, use Wikipedia Paraphrasing to improve text quality: + +```python +# Filter for low-quality documents (score <= 11) +pipeline.add_stage( + Filter( + filter_fn=lambda x: int(x) <= 11, + filter_field="quality_score", + ), +) +``` + +## Using Helper Functions + +The recommended approach is to use the helper functions in `nemotron_cc_pipelines.py`: + +```python +from nemotron_cc_pipelines import ( + add_preprocessing_pipeline, + add_diverse_qa_postprocessing_pipeline, +) +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAStage + +pipeline = Pipeline(name="diverse_qa_pipeline") + +# Add preprocessing +pipeline = add_preprocessing_pipeline( + pipeline=pipeline, + text_field="text", + system_prompt=SYSTEM_PROMPT, + user_prompt_template=PROMPT_TEMPLATE, + min_document_tokens=30, + min_segment_tokens=30, + max_input_tokens=1000, + args=args, # Contains tokenizer config +) + +# Add generation stage +pipeline.add_stage( + DiverseQAStage( + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + generation_config=generation_config, + input_field="text", + output_field="diverse_qa", + ) +) + +# Add postprocessing +pipeline = add_diverse_qa_postprocessing_pipeline( + pipeline=pipeline, + llm_response_field="diverse_qa", + args=args, +) +``` + +## Task Configuration + +Each task has specific token count and preprocessing requirements: + +```{list-table} Task Configuration Defaults +:header-rows: 1 +:widths: 25 15 15 20 25 + +* - Task + - Min Doc Tokens + - Min Segment Tokens + - Max Input Tokens + - Max Output Tokens +* - Diverse QA + - 30 + - 30 + - 1000 + - 600 +* - Distill + - 30 + - 10 + - 2000 + - 1600 +* - Extract Knowledge + - 30 + - 30 + - 1400 + - 1400 +* - Knowledge List + - 30 + - 30 + - 1000 + - 600 +* - Wikipedia Paraphrasing + - 5 + - 5 + - 512 + - 512 +``` + +## Quick Example + +```python +import os +from transformers import AutoTokenizer +from nemo_curator.core.client import RayClient +from nemo_curator.backends.xenna import XennaExecutor +from nemo_curator.models.client.openai_client import AsyncOpenAIClient +from nemo_curator.models.client.llm_client import GenerationConfig +from nemo_curator.pipeline import Pipeline +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAStage +from nemo_curator.stages.text.io.reader.parquet import ParquetReader +from nemo_curator.stages.text.io.writer.parquet import ParquetWriter + +# Initialize +client = RayClient(include_dashboard=False) +client.start() +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct") + +# Create LLM client +llm_client = AsyncOpenAIClient( + api_key=os.environ["NVIDIA_API_KEY"], + base_url="https://integrate.api.nvidia.com/v1", + max_concurrent_requests=5, +) + +# Build pipeline +pipeline = Pipeline(name="nemotron_cc_diverse_qa") +pipeline.add_stage(ParquetReader(file_paths=["./input_data/*.parquet"])) +# ... add preprocessing stages ... +pipeline.add_stage( + DiverseQAStage( + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + generation_config=GenerationConfig(temperature=0.5, top_p=0.9), + input_field="text", + output_field="diverse_qa", + ) +) +# ... add postprocessing stages ... +pipeline.add_stage(ParquetWriter(path="./output/")) + +# Execute +executor = XennaExecutor() +results = pipeline.run(executor) + +client.stop() +``` + +--- + +## Detailed Reference + +::::{grid} 1 +:gutter: 2 + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Task Reference +:link: tasks +:link-type: doc +Detailed reference for each NemotronCC stage, prompts, and post-processing ++++ +{bdg-secondary}`reference` +{bdg-secondary}`api` +::: + +:::: + +```{toctree} +:hidden: + +tasks +``` + diff --git a/docs/curate-text/synthetic/nemotron-cc/tasks.md b/docs/curate-text/synthetic/nemotron-cc/tasks.md new file mode 100644 index 0000000000..8c00fa843b --- /dev/null +++ b/docs/curate-text/synthetic/nemotron-cc/tasks.md @@ -0,0 +1,393 @@ +--- +description: "Reference documentation for NemotronCC synthetic data generation tasks and stages" +categories: ["reference"] +tags: ["nemotron-cc", "stages", "api-reference"] +personas: ["data-scientist-focused", "mle-focused"] +difficulty: "advanced" +content_type: "reference" +modality: "text-only" +--- + +(nemotron-cc-tasks)= +# NemotronCC Task Reference + +This reference documents each NemotronCC synthetic data generation stage, including prompt templates, configuration options, and post-processing details. + +## WikipediaParaphrasingStage + +Rewrites low-quality text in Wikipedia-style prose, improving readability and structure. + +### Purpose + +Transform noisy or poorly-written web data into high-quality, encyclopedic text suitable for training language models. + +### Configuration + +```python +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import WikipediaParaphrasingStage + +stage = WikipediaParaphrasingStage( + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + generation_config=generation_config, + input_field="text", + output_field="rephrased", +) +``` + +### Prompt Template + +The stage uses a system prompt establishing the assistant persona and a user prompt requesting paraphrasing: + +```text +System: A chat between a curious user and an artificial intelligence assistant. +The assistant gives helpful, detailed, and polite answers to the questions. + +User: For the following paragraph give me a diverse paraphrase of the same in +high quality English language as in sentences on Wikipedia. Begin your answer +on a separate line with "Here is a paraphrased version:". + +Text: {document} +``` + +See the [full prompt in source](https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/synthetic/nemotron_cc/prompts.py). + +### Post-Processing + +The Wikipedia post-processing pipeline: +1. Filters by token count (max 510 tokens) +2. Removes markdown formatting +3. Validates prefix "Here is a paraphrased version:" +4. Removes the prefix from output +5. Removes quotation marks +6. Joins document segments +7. Filters documents below 50 tokens + +--- + +## DiverseQAStage + +Generates diverse question-answer pairs from document content. + +### Purpose + +Create reading comprehension training data with varied question types and cognitive complexity levels. + +### Configuration + +```python +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAStage + +stage = DiverseQAStage( + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + generation_config=generation_config, + input_field="text", + output_field="diverse_qa", +) +``` + +### Prompt Template + +The stage requests up to 8 diverse Q&A pairs with specific formatting: + +```text +Task: Read the text, ask questions and answer them. + +Follow these instructions: +1. Ask diverse questions that require different cognitive skills +2. Ask questions in various forms: + - Yes/No questions + - Open-ended questions (what, how, when, where, why, who) + - Multi-choice questions with options + - Comparison questions + - Reading comprehension questions + - Problem-solving questions +3. Focus on factual information and key concepts +4. Use clear and concise language +5. Use plain text (no Markdown) +6. Format: Question: [question] Answer: [answer] + +Text: {document} +``` + +### Post-Processing with DiverseQAPostProcessingStage + +The `DiverseQAPostProcessingStage` performs specialized parsing: + +```python +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DiverseQAPostProcessingStage + +post_stage = DiverseQAPostProcessingStage( + input_field="text", + qa_field="diverse_qa", + tokenizer=tokenizer, # For length-based sampling + prefix="Here are the questions and answers based on the provided text:", + max_num_pairs=10, +) +``` + +**Post-processing logic:** +1. Parse Q&A pairs from bullet-formatted output +2. Merge question and answer lines +3. Shuffle pairs randomly +4. Sample pairs based on input document length (using tokenizer) +5. Concatenate original document with selected Q&A pairs + +The number of Q&A pairs sampled is proportional to input length: +```python +num_pairs = random.randint(1, max(1, int(max_num_pairs * num_tokens / 150))) +``` + +--- + +## DistillStage + +Creates condensed, information-dense paraphrases while preserving key concepts. + +### Purpose + +Generate training data that captures essential knowledge in a more accessible format, suitable for knowledge distillation. + +### Configuration + +```python +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import DistillStage + +stage = DistillStage( + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + generation_config=generation_config, + input_field="text", + output_field="distill", +) +``` + +### Prompt Template + +```text +System: You are an artificial intelligence assistant. You carefully provide +accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. + +User: Your task is to read and paraphrase the provided text following these instructions: +- Create a condensed but accurate and informative version +- Preserve crucial information, key concepts, important values, factual details +- Retain technical terms and specialized vocabulary +- Retain examples and explanations of reasoning +- Only include information present in the original text +- Write in plain text without formatting + +Text: {document} + +Task: Paraphrase in high-quality English. Begin with "Paraphrased Text:". +``` + +### Post-Processing + +1. Filter by token count (max 1598 tokens) +2. Remove markdown formatting +3. Validate "Paraphrased Text:" prefix +4. Remove the prefix +5. Remove quotation marks +6. Filter documents below 50 tokens + +--- + +## ExtractKnowledgeStage + +Extracts and rewrites knowledge as textbook-style passages. + +### Purpose + +Convert raw text into educational-quality passages organized by domain, suitable for building knowledge bases. + +### Configuration + +```python +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import ExtractKnowledgeStage + +stage = ExtractKnowledgeStage( + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + generation_config=generation_config, + input_field="text", + output_field="extract_knowledge", +) +``` + +### Prompt Template + +```text +Your task is to rewrite knowledge from the provided text following these instructions: +- Rewrite as passages using easy-to-understand, high-quality English + like sentences in textbooks and Wikipedia +- Focus on content in disciplines: humanities, social sciences, natural sciences, + technology, engineering, math, law, business, management, art, education, + agricultural sciences, politics, and history +- Disregard content without useful facts or knowledge +- Retain examples and supporting evidence +- Do not add or alter details +- Write in plain text +- Do not add titles or comments + +Text: {document} + +Task: Rewrite facts and knowledge as passages following the instructions. +``` + +### Post-Processing + +1. Filter by token count (max 1398 tokens) +2. Remove markdown formatting +3. Remove passage labels ("Passage:", "Passage 1:", etc.) +4. Filter documents below 50 tokens + +--- + +## KnowledgeListStage + +Extracts structured fact lists from documents. + +### Purpose + +Generate bullet-pointed factual content for structured knowledge extraction. + +### Configuration + +```python +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import KnowledgeListStage + +stage = KnowledgeListStage( + client=llm_client, + model_name="meta/llama-3.3-70b-instruct", + generation_config=generation_config, + input_field="text", + output_field="knowledge_list", +) +``` + +### Prompt Template + +```text +Review the text and extract the key information. Follow these instructions: +- Provide a concise and organized list of factual information +- Include concrete details, key concepts, and important statistics +- Ensure each point is clear, specific, and supported by the original text +- Ensure extracted text is information-dense +- Do not add titles or headings + +Text: {document} + +Task: Extract factual information, concrete details, and key concepts. +``` + +### Post-Processing with KnowledgeListPostProcessingStage + +```python +from nemo_curator.stages.synthetic.nemotron_cc.nemotron_cc import KnowledgeListPostProcessingStage + +post_stage = KnowledgeListPostProcessingStage( + input_field="knowledge_list", +) +``` + +**Post-processing logic:** +1. Remove leading bullet markers ("- ") +2. Normalize indentation +3. Join lines with newlines + +--- + +## Customizing Prompts + +To use custom prompts while maintaining NemotronCC infrastructure, subclass `BaseSyntheticStage`: + +```python +from dataclasses import dataclass +from nemo_curator.stages.synthetic.nemotron_cc.base import BaseSyntheticStage + + +@dataclass +class CustomSyntheticStage(BaseSyntheticStage): + system_prompt: str = "You are a helpful assistant specialized in..." + prompt: str = """Your custom prompt template here. + +Text: {document} + +Instructions: ...""" + input_field: str = "text" + output_field: str = "custom_output" + + @property + def name(self) -> str: + return "CustomSyntheticStage" +``` + +The `{document}` placeholder is replaced with the content from `input_field`. + +--- + +## Complete Configuration Example + +```python +TASK_CONFIG = { + "diverse_qa": { + "system_prompt": NEMOTRON_CC_SYSTEM_PROMPT, + "prompt_template": DIVERSE_QA_PROMPT_TEMPLATE, + "min_document_tokens": 30, + "min_segment_tokens": 30, + "max_input_tokens": 1000, + "max_output_tokens": 600, + }, + "distill": { + "system_prompt": NEMOTRON_CC_DISTILL_SYSTEM_PROMPT, + "prompt_template": DISTILL_PROMPT_TEMPLATE, + "min_document_tokens": 30, + "min_segment_tokens": 10, + "max_input_tokens": 2000, + "max_output_tokens": 1600, + }, + "extract_knowledge": { + "system_prompt": NEMOTRON_CC_SYSTEM_PROMPT, + "prompt_template": EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE, + "min_document_tokens": 30, + "min_segment_tokens": 30, + "max_input_tokens": 1400, + "max_output_tokens": 1400, + }, + "knowledge_list": { + "system_prompt": NEMOTRON_CC_SYSTEM_PROMPT, + "prompt_template": KNOWLEDGE_LIST_PROMPT_TEMPLATE, + "min_document_tokens": 30, + "min_segment_tokens": 30, + "max_input_tokens": 1000, + "max_output_tokens": 600, + }, + "wikipedia_paraphrasing": { + "system_prompt": NEMOTRON_CC_SYSTEM_PROMPT, + "prompt_template": WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE, + "min_document_tokens": 5, + "min_segment_tokens": 5, + "max_input_tokens": 512, + "max_output_tokens": 512, + }, +} + +GENERATION_CONFIG = { + "MAX_INPUT_TOKENS": 2000, + "MAX_OUTPUT_TOKENS": 1600, + "TOP_K": 0, + "TOP_P": 0.9, + "TEMPERATURE": 0.5, +} +``` + +--- + +## Source Code References + +- **Prompts**: [`nemo_curator/stages/synthetic/nemotron_cc/prompts.py`](https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/synthetic/nemotron_cc/prompts.py) +- **Stages**: [`nemo_curator/stages/synthetic/nemotron_cc/nemotron_cc.py`](https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/synthetic/nemotron_cc/nemotron_cc.py) +- **Base Class**: [`nemo_curator/stages/synthetic/nemotron_cc/base.py`](https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/synthetic/nemotron_cc/base.py) +- **Pipeline Helpers**: [`tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py`](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py) + diff --git a/tutorials/synthetic/README.md b/tutorials/synthetic/README.md index 7765d169c3..c96a9b895e 100644 --- a/tutorials/synthetic/README.md +++ b/tutorials/synthetic/README.md @@ -1,36 +1,130 @@ # Synthetic Data Generation Tutorials -Hands-on tutorials for generating synthetic data with NeMo Curator. Complete working examples with detailed explanations. +Hands-on tutorials for generating synthetic data with NeMo Curator using Ray-based distributed processing. +## Documentation + +For comprehensive documentation, refer to the [Synthetic Data Generation Guide](../../docs/curate-text/synthetic/index.md). ## Getting Started ### Prerequisites -To run these tutorials, you'll need an NVIDIA API key. You can obtain one from: -- **NVIDIA Build**: https://build.nvidia.com/settings/api-keys +- **NVIDIA API Key**: Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys) +- **NeMo Curator**: Installed with text extras (`pip install nemo-curator[text_cuda12]`) ### Setup -Set your API key as an environment variable: - ```bash export NVIDIA_API_KEY="your-api-key-here" ``` -Alternatively, you can pass it directly using the `--api-key` argument when running the examples. +## Available Tutorials + +| Tutorial | Description | Difficulty | +|----------|-------------|------------| +| [Multilingual Q&A](synthetic_data_generation_example.py) | Generate Q&A pairs in multiple languages | Beginner | +| [NemotronCC High-Quality](nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py) | Advanced SDG for high-quality data (DiverseQA, Distill, ExtractKnowledge, KnowledgeList) | Advanced | +| [NemotronCC Low-Quality](nemotron_cc/nemotron_cc_sdg_low_quality_example_pipeline.py) | Improve low-quality data via Wikipedia-style paraphrasing | Advanced | -### Quick Example +## Quick Examples + +### Basic Multilingual Q&A ```bash # Generate 20 synthetic Q&A pairs in multiple languages python synthetic_data_generation_example.py --num-samples 20 + +# Customize languages and disable filtering +python synthetic_data_generation_example.py \ + --num-samples 50 \ + --languages English French German Spanish \ + --no-filter-languages ``` +### NemotronCC Pipelines -## Available Tutorials +```bash +# Run DiverseQA pipeline with mock data (requires tokenizer access) +python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py \ + --task diverse_qa \ + --tokenizer meta-llama/Llama-3.3-70B-Instruct \ + --mock + +# Run Distill pipeline +python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py \ + --task distill \ + --tokenizer meta-llama/Llama-3.3-70B-Instruct \ + --mock + +# Run Wikipedia Paraphrasing for low-quality data +python nemotron_cc/nemotron_cc_sdg_low_quality_example_pipeline.py \ + --tokenizer meta-llama/Llama-3.3-70B-Instruct \ + --mock +``` + +### Using Real Data + +```bash +# Process Parquet input files +python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py \ + --task diverse_qa \ + --tokenizer meta-llama/Llama-3.3-70B-Instruct \ + --input-parquet-path ./my_data/*.parquet \ + --output-path ./synthetic_output \ + --output-format parquet +``` + +## Command-Line Arguments + +### Common Arguments + +| Argument | Default | Description | +|----------|---------|-------------| +| `--api-key` | env var | NVIDIA API key | +| `--base-url` | NVIDIA API | Base URL for API endpoint | +| `--model-name` | llama-3.3-70b | Model to use for generation | +| `--output-path` | ./synthetic_output | Output directory | +| `--max-concurrent-requests` | 3 | Concurrent API requests | +| `--temperature` | 0.9 (QA) / 0.5 (NemotronCC) | Sampling temperature | + +### NemotronCC-Specific Arguments + +| Argument | Default | Description | +|----------|---------|-------------| +| `--task` | diverse_qa | Task type (diverse_qa, distill, extract_knowledge, knowledge_list) | +| `--tokenizer` | required | HuggingFace tokenizer name | +| `--mock` | False | Use built-in test data | +| `--input-parquet-path` | None | Input Parquet file path/glob | +| `--output-format` | parquet | Output format (jsonl, parquet) | + +## Example Output + +### Multilingual Q&A + +```json +{"text": "[EN] Question: What causes ocean tides? Answer: Ocean tides are primarily caused by the gravitational pull of the Moon and Sun on Earth's water bodies."} +{"text": "[FR] Question: Qu'est-ce que la photosynthèse? Answer: La photosynthèse est le processus par lequel les plantes convertissent la lumière du soleil en énergie."} +``` + +### DiverseQA + +The output contains the original text followed by generated Q&A pairs: + +```text +The Amazon rainforest contains an unparalleled diversity of plant and animal species... + +Question: What makes the Amazon rainforest unique in terms of biodiversity? +Answer: The Amazon rainforest contains an unparalleled diversity of plant and animal species. + +Question: True or False: The Amazon rainforest has limited species diversity. +Answer: False. The Amazon rainforest contains an unparalleled diversity of species. +``` + +--- -| Tutorial | Description | Files | -|----------|-------------|-------| -| **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` | +## Additional Resources +- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md) +- [NemotronCC Pipeline Documentation](../../docs/curate-text/synthetic/nemotron-cc/index.md) +- [Task Reference](../../docs/curate-text/synthetic/nemotron-cc/tasks.md) From bccea95616d39df573d1dcb8471e660fedc56b58 Mon Sep 17 00:00:00 2001 From: Lawrence Lane Date: Fri, 2 Jan 2026 10:48:58 -0500 Subject: [PATCH 2/5] header, tab fixes Signed-off-by: Lawrence Lane --- docs/curate-text/synthetic/index.md | 2 +- docs/curate-text/synthetic/llm-client.md | 16 ++++++++-------- docs/curate-text/synthetic/nemotron-cc/index.md | 4 ++++ docs/index.md | 1 + 4 files changed, 14 insertions(+), 9 deletions(-) diff --git a/docs/curate-text/synthetic/index.md b/docs/curate-text/synthetic/index.md index 7112b0288d..cf1d81e086 100644 --- a/docs/curate-text/synthetic/index.md +++ b/docs/curate-text/synthetic/index.md @@ -112,7 +112,7 @@ Before using synthetic data generation, ensure you have: --- -## Getting Started +## Topics ::::{grid} 1 1 2 2 :gutter: 2 diff --git a/docs/curate-text/synthetic/llm-client.md b/docs/curate-text/synthetic/llm-client.md index 1b517e8230..7a30e7307b 100644 --- a/docs/curate-text/synthetic/llm-client.md +++ b/docs/curate-text/synthetic/llm-client.md @@ -166,9 +166,9 @@ The retry logic handles: ## Using Custom Endpoints -````{tab-set} +::::{tab-set} -```{tab-item} Local vLLM Server +:::{tab-item} Local vLLM Server Deploy a local vLLM server and configure the client: @@ -189,9 +189,9 @@ client = AsyncOpenAIClient( timeout=300, # Longer timeout for large models ) ``` -``` +::: -```{tab-item} Text Generation Inference (TGI) +:::{tab-item} Text Generation Inference (TGI) Deploy a TGI server and configure the client: @@ -210,9 +210,9 @@ client = AsyncOpenAIClient( max_concurrent_requests=8, ) ``` -``` +::: -```{tab-item} OpenAI API +:::{tab-item} OpenAI API Use the official OpenAI API: @@ -223,9 +223,9 @@ client = AsyncOpenAIClient( max_concurrent_requests=5, ) ``` -``` +::: -```` +:::: ## Complete Example diff --git a/docs/curate-text/synthetic/nemotron-cc/index.md b/docs/curate-text/synthetic/nemotron-cc/index.md index 2f2665f7f4..ca73912d07 100644 --- a/docs/curate-text/synthetic/nemotron-cc/index.md +++ b/docs/curate-text/synthetic/nemotron-cc/index.md @@ -124,6 +124,10 @@ pipeline.add_stage( The recommended approach is to use the helper functions in `nemotron_cc_pipelines.py`: +:::{note} +The `nemotron_cc_pipelines` helper functions are provided in the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py), not as part of the installed package. Copy this file to your project or reference the patterns when building custom pipelines. +::: + ```python from nemotron_cc_pipelines import ( add_preprocessing_pipeline, diff --git a/docs/index.md b/docs/index.md index 25227c1ce3..fc4390ebf3 100644 --- a/docs/index.md +++ b/docs/index.md @@ -223,6 +223,7 @@ curate-text/index.md Tutorials Load Data Process Data +Synthetic Data :::: ::::{toctree} From abd22090bb8298d139f171f5345fbb49bc9655b9 Mon Sep 17 00:00:00 2001 From: Lawrence Lane Date: Fri, 2 Jan 2026 11:00:27 -0500 Subject: [PATCH 3/5] style guide Signed-off-by: Lawrence Lane --- docs/curate-text/synthetic/multilingual-qa.md | 2 +- docs/curate-text/synthetic/nemotron-cc/tasks.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/curate-text/synthetic/multilingual-qa.md b/docs/curate-text/synthetic/multilingual-qa.md index 8417d63b34..e1b72779f2 100644 --- a/docs/curate-text/synthetic/multilingual-qa.md +++ b/docs/curate-text/synthetic/multilingual-qa.md @@ -116,7 +116,7 @@ prompt = "Generate a Q&A pair about science in {language}." prompt = """ Generate a short question and a short answer in the general science domain in {language}. Begin with the language name using the 2-letter code in square brackets, -e.g. [EN] for English, [FR] for French, [DE] for German. +for example, [EN] for English, [FR] for French, [DE] for German. """ ``` diff --git a/docs/curate-text/synthetic/nemotron-cc/tasks.md b/docs/curate-text/synthetic/nemotron-cc/tasks.md index 8c00fa843b..cb4b0e5069 100644 --- a/docs/curate-text/synthetic/nemotron-cc/tasks.md +++ b/docs/curate-text/synthetic/nemotron-cc/tasks.md @@ -50,7 +50,7 @@ on a separate line with "Here is a paraphrased version:". Text: {document} ``` -See the [full prompt in source](https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/synthetic/nemotron_cc/prompts.py). +Refer to the [full prompt in source](https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/synthetic/nemotron_cc/prompts.py). ### Post-Processing @@ -89,7 +89,7 @@ stage = DiverseQAStage( ### Prompt Template -The stage requests up to 8 diverse Q&A pairs with specific formatting: +The stage requests up to eight diverse Q&A pairs with specific formatting: ```text Task: Read the text, ask questions and answer them. From 91f5f9ae8c942c1db648c079a01b35ad7ee9f250 Mon Sep 17 00:00:00 2001 From: Lawrence Lane Date: Fri, 2 Jan 2026 11:04:33 -0500 Subject: [PATCH 4/5] release notes change, bump version Signed-off-by: Lawrence Lane --- docs/about/release-notes/index.md | 203 +----------------------------- docs/conf.py | 4 +- docs/versions1.json | 5 + 3 files changed, 8 insertions(+), 204 deletions(-) diff --git a/docs/about/release-notes/index.md b/docs/about/release-notes/index.md index 7492c9e141..91487bc6f7 100644 --- a/docs/about/release-notes/index.md +++ b/docs/about/release-notes/index.md @@ -12,184 +12,6 @@ modality: "universal" # NeMo Curator Release Notes: {{ current_release }} -This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](../../curate-video/index.md) and [audio](../../curate-audio/index.md) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads. - -**Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide ` for step-by-step instructions and the {ref}`Migration FAQ ` for common questions. - -## Installation Updates - -- **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`) -- **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support -- **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management -- **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality: - - ```{list-table} Available Installation Extras - :header-rows: 1 - :widths: 25 35 40 - - * - Extra - - Installation Command - - Description - * - **All Modalities** - - `nemo-curator[all]` - - Complete installation with all modalities and GPU support - * - **Text Curation** - - `nemo-curator[text_cuda12]` - - GPU-accelerated text processing with RAPIDS - * - **Image Curation** - - `nemo-curator[image_cuda12]` - - Image processing with NVIDIA DALI - * - **Audio Curation** - - `nemo-curator[audio_cuda12]` - - Speech recognition with NeMo ASR models - * - **Video Curation** - - `nemo-curator[video_cuda12]` - - Video processing with GPU acceleration - * - **Basic GPU** - - `nemo-curator[cuda12]` - - CUDA utilities without modality-specific dependencies - ``` - - All GPU installations require the NVIDIA PyPI index: - ```bash - uv pip install https://pypi.nvidia.com nemo-curator[EXTRA] - ``` - -## New Modalities - -### Video - -NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities: - -- **Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction -- **Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal -- **Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement -- **Embedding generation**: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings -- **Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions -- **Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md) - -### Audio - -New [audio curation capabilities](../../curate-audio/index.md) for speech data processing: - -- **ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models -- **Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation -- **Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second) -- **Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage` -- **Manifest support**: JSONL manifest format for audio file management - -## Modality Refactors - -### Text - -- **Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md) -- **Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization -- **Task-centric architecture**: New `Task`-based processing model for finer-grained control -- **Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification - -### Image - -- **Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages -- **DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback -- **Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md) -- **Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores - -Learn more about [image curation](../../curate-images/index.md). - -## Deduplication Improvements - -Enhanced deduplication capabilities across all modalities with improved performance and flexibility: - -- **Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities -- **Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows -- **New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs - -## Core Refactors - -The architecture refactor introduces a layered system with unified interfaces and multiple execution backends: - -```{mermaid} -graph LR - subgraph "User Layer" - P[Pipeline] - S1[ProcessingStage X→Y] - S2[ProcessingStage Y→Z] - S3[ProcessingStage Z→W] - R[Resources
CPU/GPU/NVDEC/NVENC] - end - - subgraph "Orchestration Layer" - BE[BaseExecutor Interface] - end - - subgraph "Backend Layer" - XE[XennaExecutor
Production Ready] - RAP[RayActorPoolExecutor
Experimental] - RDE[RayDataExecutor
Experimental] - end - - subgraph "Adaptation Layer" - XA[Xenna Adapter] - RAPA[Ray Actor Adapter] - RDA[Ray Data Adapter] - end - - subgraph "Execution Layer" - X[Cosmos-Xenna
Streaming/Batch] - RAY1[Ray Actor Pool
Load Balancing] - RAY2[Ray Data API
Dataset Processing] - end - - P --> S1 - P --> S2 - P --> S3 - S1 -.-> R - S2 -.-> R - S3 -.-> R - - P --> BE - BE --> XE - BE --> RAP - BE --> RDE - - XE --> XA - RAP --> RAPA - RDE --> RDA - - XA --> X - RAPA --> RAY1 - RDA --> RAY2 - - style XE fill:#90EE90 - style RAP fill:#FFE4B5 - style RDE fill:#FFE4B5 - style P fill:#E6F3FF - style BE fill:#F0F8FF -``` - -### Pipelines - -- **New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface -- **Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md) -- **Resource specification**: Configurable CPU and GPU memory requirements per stage -- **Stage composition**: Improved stage validation and execution orchestration - -### Stages - -- **ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety -- **Resource requirements**: Built-in resource specification for CPU and GPU memory -- **Backend adapters**: Stage adaptation layer for different Ray orchestration systems -- **Input/output validation**: Enhanced type checking and data validation - -## Tutorials - -- **Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API -- **Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend -- **Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio) -- **Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video) - -For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository. - ## Synthetic Data Generation New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs: @@ -205,35 +27,12 @@ New Ray-based synthetic data generation capabilities for creating and augmenting Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md). -## Known Limitations - -> (Pending Refactor in Future Release) - -### Generation - -- **Hard negative mining**: Retrieval-based data generation workflows under development - -### PII - -- **PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend -- **Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development - -### Blending & Shuffling - -- **Data blending**: Multi-source dataset blending functionality being refactored -- **Dataset shuffling**: Large-scale data shuffling operations under development - -## Docs Refactor - -- **Local preview capability**: Improved documentation build system with local preview support -- **Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md)) -- **API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples --- ## What's Next -The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support. +The next release will focus on ... ```{toctree} :hidden: diff --git a/docs/conf.py b/docs/conf.py index 02e763b92e..a619cfc8b5 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -29,7 +29,7 @@ project = "NeMo-Curator" project_copyright = "2025, NVIDIA Corporation" author = "NVIDIA Corporation" -release = "25.09" +release = "26.02" # -- General configuration --------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration @@ -122,7 +122,7 @@ "min_python_version": "3.8", "recommended_cuda": "12.0+", "current_release": release, - "container_version": "25.09", + "container_version": "26.02", } # Enable figure numbering diff --git a/docs/versions1.json b/docs/versions1.json index 9fd5dcd52d..d9f09cf338 100644 --- a/docs/versions1.json +++ b/docs/versions1.json @@ -1,6 +1,11 @@ [ { "preferred": true, + "version": "26.02", + "url": "https://docs.nvidia.com/nemo/curator/26.02/" + }, + { + "preferred": false, "version": "25.09", "url": "https://docs.nvidia.com/nemo/curator/25.09/" }, From 9a29ce7ba55372818722791457eb25f286f5c431 Mon Sep 17 00:00:00 2001 From: Lawrence Lane Date: Fri, 2 Jan 2026 15:43:03 -0500 Subject: [PATCH 5/5] feedback Signed-off-by: Lawrence Lane --- docs/curate-text/synthetic/llm-client.md | 96 ++----------------- docs/curate-text/synthetic/multilingual-qa.md | 2 +- .../synthetic/nemotron-cc/index.md | 2 +- tutorials/synthetic/README.md | 38 ++------ 4 files changed, 18 insertions(+), 120 deletions(-) diff --git a/docs/curate-text/synthetic/llm-client.md b/docs/curate-text/synthetic/llm-client.md index 7a30e7307b..b626bc6529 100644 --- a/docs/curate-text/synthetic/llm-client.md +++ b/docs/curate-text/synthetic/llm-client.md @@ -118,32 +118,6 @@ client = AsyncOpenAIClient( base_url="https://integrate.api.nvidia.com/v1", max_concurrent_requests=3, # Conservative for cloud APIs ) - -# For local vLLM server with more capacity -client = AsyncOpenAIClient( - base_url="http://localhost:8000/v1", - max_concurrent_requests=16, # Higher for local deployment -) -``` - -### Optimal Settings - -```{list-table} Recommended Concurrency Settings -:header-rows: 1 -:widths: 30 25 45 - -* - Endpoint Type - - Recommended Setting - - Notes -* - NVIDIA API (cloud) - - 3-5 - - Respects rate limits; increase gradually -* - Local vLLM - - 8-32 - - Depends on GPU memory and model size -* - Local TGI - - 8-16 - - Adjust based on server configuration ``` ### Retry Configuration @@ -164,68 +138,25 @@ The retry logic handles: - **Connection errors**: Retry with exponential delay - **Transient failures**: Configurable retry attempts -## Using Custom Endpoints - -::::{tab-set} - -:::{tab-item} Local vLLM Server - -Deploy a local vLLM server and configure the client: - -**Start vLLM server:** -```bash -vllm serve meta-llama/Llama-3.3-70B-Instruct \ - --host 0.0.0.0 \ - --port 8000 \ - --tensor-parallel-size 4 -``` - -**Configure client:** -```python -client = AsyncOpenAIClient( - base_url="http://localhost:8000/v1", - api_key="not-needed", # vLLM doesn't require API key by default - max_concurrent_requests=16, - timeout=300, # Longer timeout for large models -) -``` -::: - -:::{tab-item} Text Generation Inference (TGI) - -Deploy a TGI server and configure the client: +## Using Other OpenAI-Compatible Endpoints -**Start TGI server:** -```bash -docker run --gpus all -p 8080:80 \ - ghcr.io/huggingface/text-generation-inference:latest \ - --model-id meta-llama/Llama-3.3-70B-Instruct -``` +The `AsyncOpenAIClient` works with any OpenAI-compatible API endpoint. Simply configure the `base_url` and `api_key` parameters: -**Configure client:** ```python +# OpenAI API client = AsyncOpenAIClient( - base_url="http://localhost:8080/v1", - api_key="not-needed", - max_concurrent_requests=8, + base_url="https://api.openai.com/v1", + api_key="sk-...", # Or set OPENAI_API_KEY env var + max_concurrent_requests=5, ) -``` -::: -:::{tab-item} OpenAI API - -Use the official OpenAI API: - -```python +# Any OpenAI-compatible endpoint client = AsyncOpenAIClient( - base_url="https://api.openai.com/v1", - api_key="sk-...", # Or set OPENAI_API_KEY env var + base_url="http://your-endpoint/v1", + api_key="your-api-key", max_concurrent_requests=5, ) ``` -::: - -:::: ## Complete Example @@ -277,7 +208,7 @@ If you encounter frequent 429 errors: ### Connection Timeouts -For large models or slow networks: +For slow networks or high-latency endpoints: ```python client = AsyncOpenAIClient( base_url="...", @@ -285,13 +216,6 @@ client = AsyncOpenAIClient( ) ``` -### Local Server Issues - -If experiencing connection errors with local servers: -- Check server resource utilization (GPU memory, CPU) -- Reduce concurrent requests -- Verify the server is running and accessible - --- ## Next Steps diff --git a/docs/curate-text/synthetic/multilingual-qa.md b/docs/curate-text/synthetic/multilingual-qa.md index e1b72779f2..fa16fb1a64 100644 --- a/docs/curate-text/synthetic/multilingual-qa.md +++ b/docs/curate-text/synthetic/multilingual-qa.md @@ -254,7 +254,7 @@ python synthetic_data_generation_example.py \ - NVIDIA API - Base URL for the API endpoint * - `--model-name` - - llama-3.3-70b + - meta/llama-3.3-70b-instruct - Model to use for generation * - `--languages` - EN, FR, DE, ES, IT diff --git a/docs/curate-text/synthetic/nemotron-cc/index.md b/docs/curate-text/synthetic/nemotron-cc/index.md index ca73912d07..d5577bc4c7 100644 --- a/docs/curate-text/synthetic/nemotron-cc/index.md +++ b/docs/curate-text/synthetic/nemotron-cc/index.md @@ -125,7 +125,7 @@ pipeline.add_stage( The recommended approach is to use the helper functions in `nemotron_cc_pipelines.py`: :::{note} -The `nemotron_cc_pipelines` helper functions are provided in the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py), not as part of the installed package. Copy this file to your project or reference the patterns when building custom pipelines. +The `nemotron_cc_pipelines` helper functions are provided in the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/synthetic/nemotron_cc/nemotron_cc_pipelines.py), not as part of the installed package. Copy the `nemotron_cc_pipelines.py` file to your project or reference the patterns when building custom pipelines. ::: ```python diff --git a/tutorials/synthetic/README.md b/tutorials/synthetic/README.md index c96a9b895e..19b6b3a026 100644 --- a/tutorials/synthetic/README.md +++ b/tutorials/synthetic/README.md @@ -45,19 +45,13 @@ python synthetic_data_generation_example.py \ ### NemotronCC Pipelines ```bash -# Run DiverseQA pipeline with mock data (requires tokenizer access) +# High-quality processing: Run any task (diverse_qa, distill, extract_knowledge, knowledge_list) python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py \ --task diverse_qa \ --tokenizer meta-llama/Llama-3.3-70B-Instruct \ --mock -# Run Distill pipeline -python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py \ - --task distill \ - --tokenizer meta-llama/Llama-3.3-70B-Instruct \ - --mock - -# Run Wikipedia Paraphrasing for low-quality data +# Low-quality processing: Wikipedia-style paraphrasing to improve text quality python nemotron_cc/nemotron_cc_sdg_low_quality_example_pipeline.py \ --tokenizer meta-llama/Llama-3.3-70B-Instruct \ --mock @@ -77,27 +71,17 @@ python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py \ ## Command-Line Arguments -### Common Arguments +Refer to each script's `--help` output for the complete list of available arguments. | Argument | Default | Description | |----------|---------|-------------| | `--api-key` | env var | NVIDIA API key | | `--base-url` | NVIDIA API | Base URL for API endpoint | -| `--model-name` | llama-3.3-70b | Model to use for generation | +| `--model-name` | meta/llama-3.3-70b-instruct | Model to use for generation | | `--output-path` | ./synthetic_output | Output directory | | `--max-concurrent-requests` | 3 | Concurrent API requests | | `--temperature` | 0.9 (QA) / 0.5 (NemotronCC) | Sampling temperature | -### NemotronCC-Specific Arguments - -| Argument | Default | Description | -|----------|---------|-------------| -| `--task` | diverse_qa | Task type (diverse_qa, distill, extract_knowledge, knowledge_list) | -| `--tokenizer` | required | HuggingFace tokenizer name | -| `--mock` | False | Use built-in test data | -| `--input-parquet-path` | None | Input Parquet file path/glob | -| `--output-format` | parquet | Output format (jsonl, parquet) | - ## Example Output ### Multilingual Q&A @@ -107,19 +91,9 @@ python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py \ {"text": "[FR] Question: Qu'est-ce que la photosynthèse? Answer: La photosynthèse est le processus par lequel les plantes convertissent la lumière du soleil en énergie."} ``` -### DiverseQA - -The output contains the original text followed by generated Q&A pairs: +### NemotronCC -```text -The Amazon rainforest contains an unparalleled diversity of plant and animal species... - -Question: What makes the Amazon rainforest unique in terms of biodiversity? -Answer: The Amazon rainforest contains an unparalleled diversity of plant and animal species. - -Question: True or False: The Amazon rainforest has limited species diversity. -Answer: False. The Amazon rainforest contains an unparalleled diversity of species. -``` +See the [NemotronCC documentation](../../docs/curate-text/synthetic/nemotron-cc/index.md) for output format details for each task type. ---