NVIDIA-NeMo
diff --git a/‎docs/curate-text/synthetic/index.md‎
Lines changed: 154 additions & 0 deletions b/‎docs/curate-text/synthetic/index.md‎
Lines changed: 154 additions & 0 deletions
diff --git a/‎docs/curate-text/synthetic/llm-client.md‎
Lines changed: 233 additions & 0 deletions b/‎docs/curate-text/synthetic/llm-client.md‎
Lines changed: 233 additions & 0 deletions
@@ -0,0 +1,154 @@
+---
+description: "Generate and augment training data using LLMs with NeMo Curator's synthetic data generation pipeline"
+categories: ["workflows"]
+tags: ["synthetic-data", "llm", "generation", "augmentation", "multilingual"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "intermediate"
+content_type: "workflow"
+modality: "text-only"
+---
+
+(synthetic-data-overview)=
+
+# Synthetic Data Generation
+
+NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, local vLLM servers, or other inference providers.
+
+## Use Cases
+
+- **Data Augmentation**: Expand limited datasets by generating diverse variations
+- **Multilingual Generation**: Create Q&A pairs and text in multiple languages
+- **Knowledge Extraction**: Convert raw text into structured knowledge formats
+- **Quality Improvement**: Paraphrase low-quality text into higher-quality Wikipedia-style prose
+- **Training Data Creation**: Generate instruction-following data for model fine-tuning
+
+## Core Concepts
+
+Synthetic data generation in NeMo Curator operates in two primary modes:
+
+### Generation Mode
+
+Create new data from scratch without requiring input documents. The `QAMultilingualSyntheticStage` demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.
+
+### Transformation Mode
+
+Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:
+
+- Paraphrased text in Wikipedia style
+- Diverse Q&A pairs derived from document content
+- Condensed knowledge distillations
+- Extracted factual content
+
+## Architecture
+
+The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:
+
+```{mermaid}
+flowchart LR
+    A["Input Documents<br/>(Parquet/JSONL)"] --> B["Preprocessing<br/>(Tokenization,<br/>Segmentation)"]
+    B --> C["LLM Generation<br/>(OpenAI-compatible)"]
+    C --> D["Postprocessing<br/>(Cleanup, Filtering)"]
+    D --> E["Output Dataset<br/>(Parquet/JSONL)"]
+    
+    F["LLM Client<br/>(NVIDIA API,<br/>vLLM, TGI)"] -.->|"API Calls"| C
+    
+    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
+    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
+    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
+    
+    class A,B,C,D stage
+    class E output
+    class F infra
+```
+
+## Prerequisites
+
+Before using synthetic data generation, ensure you have:
+
+1. **NVIDIA API Key** (for cloud endpoints)
+   - Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys)
+   - Set as environment variable: `export NVIDIA_API_KEY="your-key"`
+
+2. **NeMo Curator with text extras**
+
+   ```bash
+   uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
+   ```
+
+   :::{note}
+   Nemotron-CC pipelines use the `transformers` library for tokenization, which is included in NeMo Curator's core dependencies.
+   :::
+
+## Available SDG Stages
+
+```{list-table} Synthetic Data Generation Stages
+:header-rows: 1
+:widths: 30 40 30
+
+* - Stage
+  - Purpose
+  - Input Type
+* - `QAMultilingualSyntheticStage`
+  - Generate multilingual Q&A pairs
+  - Empty (generates from scratch)
+* - `WikipediaParaphrasingStage`
+  - Rewrite text as Wikipedia-style prose
+  - Document text
+* - `DiverseQAStage`
+  - Generate diverse Q&A pairs from documents
+  - Document text
+* - `DistillStage`
+  - Create condensed, information-dense paraphrases
+  - Document text
+* - `ExtractKnowledgeStage`
+  - Extract knowledge as textbook-style passages
+  - Document text
+* - `KnowledgeListStage`
+  - Extract structured fact lists
+  - Document text
+```
+
+---
+
+## Topics
+
+::::{grid} 1 1 2 2
+:gutter: 2
+
+:::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` LLM Client Setup
+:link: llm-client
+:link-type: doc
+Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints
++++
+{bdg-secondary}`configuration`
+{bdg-secondary}`performance`
+:::
+
+:::{grid-item-card} {octicon}`globe;1.5em;sd-mr-1` Multilingual Q&A Generation
+:link: multilingual-qa
+:link-type: doc
+Generate synthetic Q&A pairs across multiple languages
++++
+{bdg-secondary}`quickstart`
+{bdg-secondary}`tutorial`
+:::
+
+:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Nemotron-CC Pipelines
+:link: nemotron-cc/index
+:link-type: doc
+Advanced text transformation and knowledge extraction workflows
++++
+{bdg-secondary}`advanced`
+{bdg-secondary}`paraphrasing`
+:::
+
+::::
+
+```{toctree}
+:hidden:
+:maxdepth: 2
+
+llm-client
+multilingual-qa
+nemotron-cc/index
+```
@@ -0,0 +1,233 @@
+---
+description: "Configure LLM clients for synthetic data generation with NVIDIA APIs or custom endpoints"
+categories: ["how-to-guides"]
+tags: ["llm-client", "openai", "nvidia-api", "configuration"]
+personas: ["data-scientist-focused", "mle-focused"]
+difficulty: "beginner"
+content_type: "how-to"
+modality: "text-only"
+---
+
+(synthetic-llm-client)=
+# LLM Client Configuration
+
+NeMo Curator's synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.
+
+## Overview
+
+Two client types are available:
+
+- **`AsyncOpenAIClient`**: Recommended for high-throughput batch processing with concurrent requests
+- **`OpenAIClient`**: Synchronous client for simpler use cases or debugging
+
+For most SDG workloads, use `AsyncOpenAIClient` to maximize throughput.
+
+## Basic Configuration
+
+### NVIDIA API Endpoints
+
+```python
+from nemo_curator.models.client.openai_client import AsyncOpenAIClient
+
+client = AsyncOpenAIClient(
+    api_key="your-nvidia-api-key",  # Or use NVIDIA_API_KEY env var
+    base_url="https://integrate.api.nvidia.com/v1",
+    max_concurrent_requests=5,
+)
+```
+
+### Environment Variables
+
+Set your API key as an environment variable to avoid hardcoding credentials:
+
+```bash
+export NVIDIA_API_KEY="nvapi-..."
+```
+
+The underlying OpenAI client automatically uses the `OPENAI_API_KEY` environment variable if no `api_key` is provided. For NVIDIA APIs, explicitly pass the key:
+
+```python
+import os
+
+client = AsyncOpenAIClient(
+    api_key=os.environ["NVIDIA_API_KEY"],
+    base_url="https://integrate.api.nvidia.com/v1",
+)
+```
+
+## Generation Parameters
+
+Configure LLM generation behavior using `GenerationConfig`:
+
+```python
+from nemo_curator.models.client.llm_client import GenerationConfig
+
+config = GenerationConfig(
+    max_tokens=2048,
+    temperature=0.7,
+    top_p=0.95,
+    seed=42,  # For reproducibility
+)
+```
+
+```{list-table} Generation Parameters
+:header-rows: 1
+:widths: 20 15 15 50
+
+* - Parameter
+  - Type
+  - Default
+  - Description
+* - `max_tokens`
+  - int
+  - 2048
+  - Maximum tokens to generate per request
+* - `temperature`
+  - float
+  - 0.0
+  - Sampling temperature (0.0-2.0). Higher values increase randomness
+* - `top_p`
+  - float
+  - 0.95
+  - Nucleus sampling parameter (0.0-1.0)
+* - `top_k`
+  - int
+  - None
+  - Top-k sampling (if supported by the endpoint)
+* - `seed`
+  - int
+  - 0
+  - Random seed for reproducibility
+* - `stop`
+  - str/list
+  - None
+  - Stop sequences to end generation
+* - `stream`
+  - bool
+  - False
+  - Enable streaming (not recommended for batch processing)
+* - `n`
+  - int
+  - 1
+  - Number of completions to generate per request
+```
+
+## Performance Tuning
+
+### Concurrency vs. Parallelism
+
+The `max_concurrent_requests` parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray's distributed workers:
+
+- **Client-level concurrency**: `max_concurrent_requests` limits concurrent API calls per worker
+- **Worker-level parallelism**: Ray distributes tasks across multiple workers
+
+```python
+# For NVIDIA API endpoints with rate limits
+client = AsyncOpenAIClient(
+    base_url="https://integrate.api.nvidia.com/v1",
+    max_concurrent_requests=3,  # Conservative for cloud APIs
+)
+```
+
+### Retry Configuration
+
+The client includes automatic retry with exponential backoff for transient errors:
+
+```python
+client = AsyncOpenAIClient(
+    base_url="https://integrate.api.nvidia.com/v1",
+    max_retries=3,        # Number of retry attempts
+    base_delay=1.0,       # Base delay in seconds
+    timeout=120,          # Request timeout
+)
+```
+
+The retry logic handles:
+- **Rate limit errors (429)**: Automatic backoff with jitter
+- **Connection errors**: Retry with exponential delay
+- **Transient failures**: Configurable retry attempts
+
+## Using Other OpenAI-Compatible Endpoints
+
+The `AsyncOpenAIClient` works with any OpenAI-compatible API endpoint. Simply configure the `base_url` and `api_key` parameters:
+
+```python
+# OpenAI API
+client = AsyncOpenAIClient(
+    base_url="https://api.openai.com/v1",
+    api_key="sk-...",  # Or set OPENAI_API_KEY env var
+    max_concurrent_requests=5,
+)
+
+# Any OpenAI-compatible endpoint
+client = AsyncOpenAIClient(
+    base_url="http://your-endpoint/v1",
+    api_key="your-api-key",
+    max_concurrent_requests=5,
+)
+```
+
+## Complete Example
+
+```python
+import os
+from nemo_curator.models.client.openai_client import AsyncOpenAIClient
+from nemo_curator.models.client.llm_client import GenerationConfig
+from nemo_curator.pipeline import Pipeline
+from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
+
+# Configure client
+client = AsyncOpenAIClient(
+    api_key=os.environ.get("NVIDIA_API_KEY"),
+    base_url="https://integrate.api.nvidia.com/v1",
+    max_concurrent_requests=5,
+    max_retries=3,
+    base_delay=1.0,
+)
+
+# Configure generation
+config = GenerationConfig(
+    temperature=0.9,
+    top_p=0.95,
+    max_tokens=2048,
+)
+
+# Use in a pipeline stage
+pipeline = Pipeline(name="sdg_example")
+pipeline.add_stage(
+    QAMultilingualSyntheticStage(
+        prompt="Generate a Q&A pair about science in {language}.",
+        languages=["English", "French", "German"],
+        client=client,
+        model_name="meta/llama-3.3-70b-instruct",
+        num_samples=100,
+        generation_config=config,
+    )
+)
+```
+
+## Troubleshooting
+
+### Rate Limit Errors
+
+If you encounter frequent 429 errors:
+1. Reduce `max_concurrent_requests`
+2. Increase `base_delay` for longer backoff
+3. Consider using a local deployment for high-volume workloads
+
+### Connection Timeouts
+
+For slow networks or high-latency endpoints:
+```python
+client = AsyncOpenAIClient(
+    base_url="...",
+    timeout=300,  # Increase from default 120 seconds
+)
+```
+
+---
+
+## Next Steps
+
+- {ref}`multilingual-qa-tutorial`: Generate multilingual Q&A pairs
+- {ref}`nemotron-cc-overview`: Advanced text transformation pipelines