Skip to content

Commit 4468ad3

Browse files
committed
add new sdg docs
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
1 parent 2fdfe14 commit 4468ad3

File tree

5 files changed

+1471
-0
lines changed

5 files changed

+1471
-0
lines changed
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
---
2+
description: "Generate and augment training data using LLMs with NeMo Curator's synthetic data generation pipeline"
3+
categories: ["workflows"]
4+
tags: ["synthetic-data", "llm", "generation", "augmentation", "multilingual"]
5+
personas: ["data-scientist-focused", "mle-focused"]
6+
difficulty: "intermediate"
7+
content_type: "workflow"
8+
modality: "text-only"
9+
---
10+
11+
(synthetic-data-overview)=
12+
13+
# Synthetic Data Generation
14+
15+
NeMo Curator provides synthetic data generation (SDG) capabilities for creating and augmenting training data using Large Language Models (LLMs). These pipelines integrate with OpenAI-compatible APIs, enabling you to use NVIDIA NIM endpoints, local vLLM servers, or other inference providers.
16+
17+
## Use Cases
18+
19+
- **Data Augmentation**: Expand limited datasets by generating diverse variations
20+
- **Multilingual Generation**: Create Q&A pairs and text in multiple languages
21+
- **Knowledge Extraction**: Convert raw text into structured knowledge formats
22+
- **Quality Improvement**: Paraphrase low-quality text into higher-quality Wikipedia-style prose
23+
- **Training Data Creation**: Generate instruction-following data for model fine-tuning
24+
25+
## Core Concepts
26+
27+
Synthetic data generation in NeMo Curator operates in two primary modes:
28+
29+
### Generation Mode
30+
31+
Create new data from scratch without requiring input documents. The `QAMultilingualSyntheticStage` demonstrates this pattern—it generates Q&A pairs based on a prompt template without needing seed documents.
32+
33+
### Transformation Mode
34+
35+
Improve or restructure existing data using LLM capabilities. The Nemotron-CC stages exemplify this approach, taking input documents and producing:
36+
37+
- Paraphrased text in Wikipedia style
38+
- Diverse Q&A pairs derived from document content
39+
- Condensed knowledge distillations
40+
- Extracted factual content
41+
42+
## Architecture
43+
44+
The following diagram shows how SDG pipelines process data through preprocessing, LLM generation, and postprocessing stages:
45+
46+
```{mermaid}
47+
flowchart LR
48+
A["Input Documents<br/>(Parquet/JSONL)"] --> B["Preprocessing<br/>(Tokenization,<br/>Segmentation)"]
49+
B --> C["LLM Generation<br/>(OpenAI-compatible)"]
50+
C --> D["Postprocessing<br/>(Cleanup, Filtering)"]
51+
D --> E["Output Dataset<br/>(Parquet/JSONL)"]
52+
53+
F["LLM Client<br/>(NVIDIA API,<br/>vLLM, TGI)"] -.->|"API Calls"| C
54+
55+
classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
56+
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
57+
classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
58+
59+
class A,B,C,D stage
60+
class E output
61+
class F infra
62+
```
63+
64+
## Prerequisites
65+
66+
Before using synthetic data generation, ensure you have:
67+
68+
1. **NVIDIA API Key** (for cloud endpoints)
69+
- Obtain from [NVIDIA Build](https://build.nvidia.com/settings/api-keys)
70+
- Set as environment variable: `export NVIDIA_API_KEY="your-key"`
71+
72+
2. **NeMo Curator with text extras**
73+
74+
```bash
75+
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
76+
```
77+
78+
:::{note}
79+
Nemotron-CC pipelines use the `transformers` library for tokenization, which is included in NeMo Curator's core dependencies.
80+
:::
81+
82+
## Available SDG Stages
83+
84+
```{list-table} Synthetic Data Generation Stages
85+
:header-rows: 1
86+
:widths: 30 40 30
87+
88+
* - Stage
89+
- Purpose
90+
- Input Type
91+
* - `QAMultilingualSyntheticStage`
92+
- Generate multilingual Q&A pairs
93+
- Empty (generates from scratch)
94+
* - `WikipediaParaphrasingStage`
95+
- Rewrite text as Wikipedia-style prose
96+
- Document text
97+
* - `DiverseQAStage`
98+
- Generate diverse Q&A pairs from documents
99+
- Document text
100+
* - `DistillStage`
101+
- Create condensed, information-dense paraphrases
102+
- Document text
103+
* - `ExtractKnowledgeStage`
104+
- Extract knowledge as textbook-style passages
105+
- Document text
106+
* - `KnowledgeListStage`
107+
- Extract structured fact lists
108+
- Document text
109+
```
110+
111+
---
112+
113+
## Topics
114+
115+
::::{grid} 1 1 2 2
116+
:gutter: 2
117+
118+
:::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` LLM Client Setup
119+
:link: llm-client
120+
:link-type: doc
121+
Configure OpenAI-compatible clients for NVIDIA APIs and custom endpoints
122+
+++
123+
{bdg-secondary}`configuration`
124+
{bdg-secondary}`performance`
125+
:::
126+
127+
:::{grid-item-card} {octicon}`globe;1.5em;sd-mr-1` Multilingual Q&A Generation
128+
:link: multilingual-qa
129+
:link-type: doc
130+
Generate synthetic Q&A pairs across multiple languages
131+
+++
132+
{bdg-secondary}`quickstart`
133+
{bdg-secondary}`tutorial`
134+
:::
135+
136+
:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Nemotron-CC Pipelines
137+
:link: nemotron-cc/index
138+
:link-type: doc
139+
Advanced text transformation and knowledge extraction workflows
140+
+++
141+
{bdg-secondary}`advanced`
142+
{bdg-secondary}`paraphrasing`
143+
:::
144+
145+
::::
146+
147+
```{toctree}
148+
:hidden:
149+
:maxdepth: 2
150+
151+
llm-client
152+
multilingual-qa
153+
nemotron-cc/index
154+
```
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
---
2+
description: "Configure LLM clients for synthetic data generation with NVIDIA APIs or custom endpoints"
3+
categories: ["how-to-guides"]
4+
tags: ["llm-client", "openai", "nvidia-api", "configuration"]
5+
personas: ["data-scientist-focused", "mle-focused"]
6+
difficulty: "beginner"
7+
content_type: "how-to"
8+
modality: "text-only"
9+
---
10+
11+
(synthetic-llm-client)=
12+
# LLM Client Configuration
13+
14+
NeMo Curator's synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.
15+
16+
## Overview
17+
18+
Two client types are available:
19+
20+
- **`AsyncOpenAIClient`**: Recommended for high-throughput batch processing with concurrent requests
21+
- **`OpenAIClient`**: Synchronous client for simpler use cases or debugging
22+
23+
For most SDG workloads, use `AsyncOpenAIClient` to maximize throughput.
24+
25+
## Basic Configuration
26+
27+
### NVIDIA API Endpoints
28+
29+
```python
30+
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
31+
32+
client = AsyncOpenAIClient(
33+
api_key="your-nvidia-api-key", # Or use NVIDIA_API_KEY env var
34+
base_url="https://integrate.api.nvidia.com/v1",
35+
max_concurrent_requests=5,
36+
)
37+
```
38+
39+
### Environment Variables
40+
41+
Set your API key as an environment variable to avoid hardcoding credentials:
42+
43+
```bash
44+
export NVIDIA_API_KEY="nvapi-..."
45+
```
46+
47+
The underlying OpenAI client automatically uses the `OPENAI_API_KEY` environment variable if no `api_key` is provided. For NVIDIA APIs, explicitly pass the key:
48+
49+
```python
50+
import os
51+
52+
client = AsyncOpenAIClient(
53+
api_key=os.environ["NVIDIA_API_KEY"],
54+
base_url="https://integrate.api.nvidia.com/v1",
55+
)
56+
```
57+
58+
## Generation Parameters
59+
60+
Configure LLM generation behavior using `GenerationConfig`:
61+
62+
```python
63+
from nemo_curator.models.client.llm_client import GenerationConfig
64+
65+
config = GenerationConfig(
66+
max_tokens=2048,
67+
temperature=0.7,
68+
top_p=0.95,
69+
seed=42, # For reproducibility
70+
)
71+
```
72+
73+
```{list-table} Generation Parameters
74+
:header-rows: 1
75+
:widths: 20 15 15 50
76+
77+
* - Parameter
78+
- Type
79+
- Default
80+
- Description
81+
* - `max_tokens`
82+
- int
83+
- 2048
84+
- Maximum tokens to generate per request
85+
* - `temperature`
86+
- float
87+
- 0.0
88+
- Sampling temperature (0.0-2.0). Higher values increase randomness
89+
* - `top_p`
90+
- float
91+
- 0.95
92+
- Nucleus sampling parameter (0.0-1.0)
93+
* - `top_k`
94+
- int
95+
- None
96+
- Top-k sampling (if supported by the endpoint)
97+
* - `seed`
98+
- int
99+
- 0
100+
- Random seed for reproducibility
101+
* - `stop`
102+
- str/list
103+
- None
104+
- Stop sequences to end generation
105+
* - `stream`
106+
- bool
107+
- False
108+
- Enable streaming (not recommended for batch processing)
109+
* - `n`
110+
- int
111+
- 1
112+
- Number of completions to generate per request
113+
```
114+
115+
## Performance Tuning
116+
117+
### Concurrency vs. Parallelism
118+
119+
The `max_concurrent_requests` parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray's distributed workers:
120+
121+
- **Client-level concurrency**: `max_concurrent_requests` limits concurrent API calls per worker
122+
- **Worker-level parallelism**: Ray distributes tasks across multiple workers
123+
124+
```python
125+
# For NVIDIA API endpoints with rate limits
126+
client = AsyncOpenAIClient(
127+
base_url="https://integrate.api.nvidia.com/v1",
128+
max_concurrent_requests=3, # Conservative for cloud APIs
129+
)
130+
```
131+
132+
### Retry Configuration
133+
134+
The client includes automatic retry with exponential backoff for transient errors:
135+
136+
```python
137+
client = AsyncOpenAIClient(
138+
base_url="https://integrate.api.nvidia.com/v1",
139+
max_retries=3, # Number of retry attempts
140+
base_delay=1.0, # Base delay in seconds
141+
timeout=120, # Request timeout
142+
)
143+
```
144+
145+
The retry logic handles:
146+
- **Rate limit errors (429)**: Automatic backoff with jitter
147+
- **Connection errors**: Retry with exponential delay
148+
- **Transient failures**: Configurable retry attempts
149+
150+
## Using Other OpenAI-Compatible Endpoints
151+
152+
The `AsyncOpenAIClient` works with any OpenAI-compatible API endpoint. Simply configure the `base_url` and `api_key` parameters:
153+
154+
```python
155+
# OpenAI API
156+
client = AsyncOpenAIClient(
157+
base_url="https://api.openai.com/v1",
158+
api_key="sk-...", # Or set OPENAI_API_KEY env var
159+
max_concurrent_requests=5,
160+
)
161+
162+
# Any OpenAI-compatible endpoint
163+
client = AsyncOpenAIClient(
164+
base_url="http://your-endpoint/v1",
165+
api_key="your-api-key",
166+
max_concurrent_requests=5,
167+
)
168+
```
169+
170+
## Complete Example
171+
172+
```python
173+
import os
174+
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
175+
from nemo_curator.models.client.llm_client import GenerationConfig
176+
from nemo_curator.pipeline import Pipeline
177+
from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
178+
179+
# Configure client
180+
client = AsyncOpenAIClient(
181+
api_key=os.environ.get("NVIDIA_API_KEY"),
182+
base_url="https://integrate.api.nvidia.com/v1",
183+
max_concurrent_requests=5,
184+
max_retries=3,
185+
base_delay=1.0,
186+
)
187+
188+
# Configure generation
189+
config = GenerationConfig(
190+
temperature=0.9,
191+
top_p=0.95,
192+
max_tokens=2048,
193+
)
194+
195+
# Use in a pipeline stage
196+
pipeline = Pipeline(name="sdg_example")
197+
pipeline.add_stage(
198+
QAMultilingualSyntheticStage(
199+
prompt="Generate a Q&A pair about science in {language}.",
200+
languages=["English", "French", "German"],
201+
client=client,
202+
model_name="meta/llama-3.3-70b-instruct",
203+
num_samples=100,
204+
generation_config=config,
205+
)
206+
)
207+
```
208+
209+
## Troubleshooting
210+
211+
### Rate Limit Errors
212+
213+
If you encounter frequent 429 errors:
214+
1. Reduce `max_concurrent_requests`
215+
2. Increase `base_delay` for longer backoff
216+
3. Consider using a local deployment for high-volume workloads
217+
218+
### Connection Timeouts
219+
220+
For slow networks or high-latency endpoints:
221+
```python
222+
client = AsyncOpenAIClient(
223+
base_url="...",
224+
timeout=300, # Increase from default 120 seconds
225+
)
226+
```
227+
228+
---
229+
230+
## Next Steps
231+
232+
- {ref}`multilingual-qa-tutorial`: Generate multilingual Q&A pairs
233+
- {ref}`nemotron-cc-overview`: Advanced text transformation pipelines

0 commit comments

Comments
 (0)