|
| 1 | +--- |
| 2 | +description: "Configure LLM clients for synthetic data generation with NVIDIA APIs or custom endpoints" |
| 3 | +categories: ["how-to-guides"] |
| 4 | +tags: ["llm-client", "openai", "nvidia-api", "configuration"] |
| 5 | +personas: ["data-scientist-focused", "mle-focused"] |
| 6 | +difficulty: "beginner" |
| 7 | +content_type: "how-to" |
| 8 | +modality: "text-only" |
| 9 | +--- |
| 10 | + |
| 11 | +(synthetic-llm-client)= |
| 12 | +# LLM Client Configuration |
| 13 | + |
| 14 | +NeMo Curator's synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints. |
| 15 | + |
| 16 | +## Overview |
| 17 | + |
| 18 | +Two client types are available: |
| 19 | + |
| 20 | +- **`AsyncOpenAIClient`**: Recommended for high-throughput batch processing with concurrent requests |
| 21 | +- **`OpenAIClient`**: Synchronous client for simpler use cases or debugging |
| 22 | + |
| 23 | +For most SDG workloads, use `AsyncOpenAIClient` to maximize throughput. |
| 24 | + |
| 25 | +## Basic Configuration |
| 26 | + |
| 27 | +### NVIDIA API Endpoints |
| 28 | + |
| 29 | +```python |
| 30 | +from nemo_curator.models.client.openai_client import AsyncOpenAIClient |
| 31 | + |
| 32 | +client = AsyncOpenAIClient( |
| 33 | + api_key="your-nvidia-api-key", # Or use NVIDIA_API_KEY env var |
| 34 | + base_url="https://integrate.api.nvidia.com/v1", |
| 35 | + max_concurrent_requests=5, |
| 36 | +) |
| 37 | +``` |
| 38 | + |
| 39 | +### Environment Variables |
| 40 | + |
| 41 | +Set your API key as an environment variable to avoid hardcoding credentials: |
| 42 | + |
| 43 | +```bash |
| 44 | +export NVIDIA_API_KEY="nvapi-..." |
| 45 | +``` |
| 46 | + |
| 47 | +The underlying OpenAI client automatically uses the `OPENAI_API_KEY` environment variable if no `api_key` is provided. For NVIDIA APIs, explicitly pass the key: |
| 48 | + |
| 49 | +```python |
| 50 | +import os |
| 51 | + |
| 52 | +client = AsyncOpenAIClient( |
| 53 | + api_key=os.environ["NVIDIA_API_KEY"], |
| 54 | + base_url="https://integrate.api.nvidia.com/v1", |
| 55 | +) |
| 56 | +``` |
| 57 | + |
| 58 | +## Generation Parameters |
| 59 | + |
| 60 | +Configure LLM generation behavior using `GenerationConfig`: |
| 61 | + |
| 62 | +```python |
| 63 | +from nemo_curator.models.client.llm_client import GenerationConfig |
| 64 | + |
| 65 | +config = GenerationConfig( |
| 66 | + max_tokens=2048, |
| 67 | + temperature=0.7, |
| 68 | + top_p=0.95, |
| 69 | + seed=42, # For reproducibility |
| 70 | +) |
| 71 | +``` |
| 72 | + |
| 73 | +```{list-table} Generation Parameters |
| 74 | +:header-rows: 1 |
| 75 | +:widths: 20 15 15 50 |
| 76 | +
|
| 77 | +* - Parameter |
| 78 | + - Type |
| 79 | + - Default |
| 80 | + - Description |
| 81 | +* - `max_tokens` |
| 82 | + - int |
| 83 | + - 2048 |
| 84 | + - Maximum tokens to generate per request |
| 85 | +* - `temperature` |
| 86 | + - float |
| 87 | + - 0.0 |
| 88 | + - Sampling temperature (0.0-2.0). Higher values increase randomness |
| 89 | +* - `top_p` |
| 90 | + - float |
| 91 | + - 0.95 |
| 92 | + - Nucleus sampling parameter (0.0-1.0) |
| 93 | +* - `top_k` |
| 94 | + - int |
| 95 | + - None |
| 96 | + - Top-k sampling (if supported by the endpoint) |
| 97 | +* - `seed` |
| 98 | + - int |
| 99 | + - 0 |
| 100 | + - Random seed for reproducibility |
| 101 | +* - `stop` |
| 102 | + - str/list |
| 103 | + - None |
| 104 | + - Stop sequences to end generation |
| 105 | +* - `stream` |
| 106 | + - bool |
| 107 | + - False |
| 108 | + - Enable streaming (not recommended for batch processing) |
| 109 | +* - `n` |
| 110 | + - int |
| 111 | + - 1 |
| 112 | + - Number of completions to generate per request |
| 113 | +``` |
| 114 | + |
| 115 | +## Performance Tuning |
| 116 | + |
| 117 | +### Concurrency vs. Parallelism |
| 118 | + |
| 119 | +The `max_concurrent_requests` parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray's distributed workers: |
| 120 | + |
| 121 | +- **Client-level concurrency**: `max_concurrent_requests` limits concurrent API calls per worker |
| 122 | +- **Worker-level parallelism**: Ray distributes tasks across multiple workers |
| 123 | + |
| 124 | +```python |
| 125 | +# For NVIDIA API endpoints with rate limits |
| 126 | +client = AsyncOpenAIClient( |
| 127 | + base_url="https://integrate.api.nvidia.com/v1", |
| 128 | + max_concurrent_requests=3, # Conservative for cloud APIs |
| 129 | +) |
| 130 | +``` |
| 131 | + |
| 132 | +### Retry Configuration |
| 133 | + |
| 134 | +The client includes automatic retry with exponential backoff for transient errors: |
| 135 | + |
| 136 | +```python |
| 137 | +client = AsyncOpenAIClient( |
| 138 | + base_url="https://integrate.api.nvidia.com/v1", |
| 139 | + max_retries=3, # Number of retry attempts |
| 140 | + base_delay=1.0, # Base delay in seconds |
| 141 | + timeout=120, # Request timeout |
| 142 | +) |
| 143 | +``` |
| 144 | + |
| 145 | +The retry logic handles: |
| 146 | +- **Rate limit errors (429)**: Automatic backoff with jitter |
| 147 | +- **Connection errors**: Retry with exponential delay |
| 148 | +- **Transient failures**: Configurable retry attempts |
| 149 | + |
| 150 | +## Using Other OpenAI-Compatible Endpoints |
| 151 | + |
| 152 | +The `AsyncOpenAIClient` works with any OpenAI-compatible API endpoint. Simply configure the `base_url` and `api_key` parameters: |
| 153 | + |
| 154 | +```python |
| 155 | +# OpenAI API |
| 156 | +client = AsyncOpenAIClient( |
| 157 | + base_url="https://api.openai.com/v1", |
| 158 | + api_key="sk-...", # Or set OPENAI_API_KEY env var |
| 159 | + max_concurrent_requests=5, |
| 160 | +) |
| 161 | + |
| 162 | +# Any OpenAI-compatible endpoint |
| 163 | +client = AsyncOpenAIClient( |
| 164 | + base_url="http://your-endpoint/v1", |
| 165 | + api_key="your-api-key", |
| 166 | + max_concurrent_requests=5, |
| 167 | +) |
| 168 | +``` |
| 169 | + |
| 170 | +## Complete Example |
| 171 | + |
| 172 | +```python |
| 173 | +import os |
| 174 | +from nemo_curator.models.client.openai_client import AsyncOpenAIClient |
| 175 | +from nemo_curator.models.client.llm_client import GenerationConfig |
| 176 | +from nemo_curator.pipeline import Pipeline |
| 177 | +from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage |
| 178 | + |
| 179 | +# Configure client |
| 180 | +client = AsyncOpenAIClient( |
| 181 | + api_key=os.environ.get("NVIDIA_API_KEY"), |
| 182 | + base_url="https://integrate.api.nvidia.com/v1", |
| 183 | + max_concurrent_requests=5, |
| 184 | + max_retries=3, |
| 185 | + base_delay=1.0, |
| 186 | +) |
| 187 | + |
| 188 | +# Configure generation |
| 189 | +config = GenerationConfig( |
| 190 | + temperature=0.9, |
| 191 | + top_p=0.95, |
| 192 | + max_tokens=2048, |
| 193 | +) |
| 194 | + |
| 195 | +# Use in a pipeline stage |
| 196 | +pipeline = Pipeline(name="sdg_example") |
| 197 | +pipeline.add_stage( |
| 198 | + QAMultilingualSyntheticStage( |
| 199 | + prompt="Generate a Q&A pair about science in {language}.", |
| 200 | + languages=["English", "French", "German"], |
| 201 | + client=client, |
| 202 | + model_name="meta/llama-3.3-70b-instruct", |
| 203 | + num_samples=100, |
| 204 | + generation_config=config, |
| 205 | + ) |
| 206 | +) |
| 207 | +``` |
| 208 | + |
| 209 | +## Troubleshooting |
| 210 | + |
| 211 | +### Rate Limit Errors |
| 212 | + |
| 213 | +If you encounter frequent 429 errors: |
| 214 | +1. Reduce `max_concurrent_requests` |
| 215 | +2. Increase `base_delay` for longer backoff |
| 216 | +3. Consider using a local deployment for high-volume workloads |
| 217 | + |
| 218 | +### Connection Timeouts |
| 219 | + |
| 220 | +For slow networks or high-latency endpoints: |
| 221 | +```python |
| 222 | +client = AsyncOpenAIClient( |
| 223 | + base_url="...", |
| 224 | + timeout=300, # Increase from default 120 seconds |
| 225 | +) |
| 226 | +``` |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## Next Steps |
| 231 | + |
| 232 | +- {ref}`multilingual-qa-tutorial`: Generate multilingual Q&A pairs |
| 233 | +- {ref}`nemotron-cc-overview`: Advanced text transformation pipelines |
0 commit comments