|
1 | 1 | # 🎨 NeMo Data Designer |
2 | | -[](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml) [](https://opensource.org/licenses/Apache-2.0) |
3 | 2 |
|
4 | | -Create synthetic datasets from scratch. |
| 3 | +[](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml) |
| 4 | +[](https://opensource.org/licenses/Apache-2.0) |
| 5 | +[](https://www.python.org/downloads/) [](https://docs.nvidia.com/nemo/microservices/latest/index.html) |
5 | 6 |
|
6 | | -## Installation |
| 7 | +**Generate high-quality synthetic datasets from scratch or using your own seed data.** |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Welcome! |
| 12 | + |
| 13 | +Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data. |
| 14 | + |
| 15 | +## What can you do with Data Designer? |
| 16 | + |
| 17 | +- **Generate diverse data** using statistical samplers, LLMs, or existing seed datasets |
| 18 | +- **Control relationships** between fields with dependency-aware generation |
| 19 | +- **Validate quality** with built-in Python, SQL, and custom local and remote validators |
| 20 | +- **Score outputs** using LLM-as-a-judge for quality assessment |
| 21 | +- **Iterate quickly** with preview mode before full-scale generation |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Quick Start |
| 26 | + |
| 27 | +### 1. Install |
| 28 | + |
| 29 | +```bash |
| 30 | +pip install data-designer |
| 31 | +``` |
| 32 | + |
| 33 | +Or install from source: |
7 | 34 |
|
8 | 35 | ```bash |
9 | 36 | git clone https://github.com/NVIDIA-NeMo/DataDesigner.git |
10 | 37 | cd DataDesigner |
11 | 38 | make install |
12 | 39 | ``` |
13 | 40 |
|
14 | | -Test your installation: |
| 41 | +### 2. Set your API key |
| 42 | + |
| 43 | +Get your API key from [build.nvidia.com](https://build.nvidia.com) or [OpenAI](https://platform.openai.com/api-keys): |
15 | 44 |
|
16 | 45 | ```bash |
17 | | -make test |
| 46 | +export NVIDIA_API_KEY="your-api-key-here" |
| 47 | +# Or use OpenAI |
| 48 | +export OPENAI_API_KEY="your-openai-api-key-here" |
18 | 49 | ``` |
19 | 50 |
|
20 | | -## Example Usage |
| 51 | +### 3. Generate your first dataset |
21 | 52 |
|
22 | 53 | ```python |
23 | 54 | from data_designer.essentials import ( |
24 | 55 | CategorySamplerParams, |
25 | 56 | DataDesigner, |
26 | 57 | DataDesignerConfigBuilder, |
27 | | - InferenceParameters, |
28 | 58 | LLMTextColumnConfig, |
29 | | - ModelConfig, |
30 | 59 | PersonSamplerParams, |
31 | 60 | SamplerColumnConfig, |
32 | 61 | SamplerType, |
33 | | - SubcategorySamplerParams, |
34 | | - UniformSamplerParams, |
35 | | -) |
36 | | - |
37 | | -data_designer = DataDesigner(artifact_path="./artifacts") |
38 | | - |
39 | | -# The model ID is from build.nvidia.com. |
40 | | -MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2" |
41 | | - |
42 | | -# We choose this alias to be descriptive for our use case. |
43 | | -MODEL_ALIAS = "nemotron-nano-v2" |
44 | | - |
45 | | -# This sets reasoning to False for the nemotron-nano-v2 model. |
46 | | -SYSTEM_PROMPT = "/no_think" |
47 | | - |
48 | | -model_configs = [ |
49 | | - ModelConfig( |
50 | | - alias=MODEL_ALIAS, |
51 | | - model=MODEL_ID, |
52 | | - inference_parameters=InferenceParameters( |
53 | | - temperature=0.5, |
54 | | - top_p=1.0, |
55 | | - max_tokens=1024, |
56 | | - ), |
57 | | - ) |
58 | | -] |
59 | | - |
60 | | -config_builder = DataDesignerConfigBuilder(model_configs=model_configs) |
61 | | - |
62 | | - |
63 | | -config_builder.add_column( |
64 | | - SamplerColumnConfig( |
65 | | - name="customer", |
66 | | - sampler_type=SamplerType.PERSON, |
67 | | - params=PersonSamplerParams(age_range=[18, 70]), |
68 | | - ) |
69 | 62 | ) |
70 | 63 |
|
| 64 | +# Initialize with default settings |
| 65 | +data_designer = DataDesigner() |
| 66 | +config_builder = DataDesignerConfigBuilder() |
71 | 67 |
|
| 68 | +# Add a product category |
72 | 69 | config_builder.add_column( |
73 | 70 | SamplerColumnConfig( |
74 | 71 | name="product_category", |
75 | 72 | sampler_type=SamplerType.CATEGORY, |
76 | 73 | params=CategorySamplerParams( |
77 | | - values=[ |
78 | | - "Electronics", |
79 | | - "Clothing", |
80 | | - "Home & Kitchen", |
81 | | - "Books", |
82 | | - "Home Office", |
83 | | - ], |
| 74 | + values=["Electronics", "Clothing", "Home & Kitchen", "Books"], |
84 | 75 | ), |
85 | 76 | ) |
86 | 77 | ) |
87 | 78 |
|
| 79 | +# Generate personalized customer reviews |
88 | 80 | config_builder.add_column( |
89 | 81 | LLMTextColumnConfig( |
90 | | - name="customer_review", |
91 | | - prompt=( |
92 | | - "You are a customer named {{ customer.first_name }} from {{ customer.city }}, " |
93 | | - "{{ customer.state }}. Tell me about your experience working in the " |
94 | | - "{{ product_category }} department of our company." |
95 | | - ), |
96 | | - system_prompt=SYSTEM_PROMPT, |
97 | | - model_alias=MODEL_ALIAS, |
| 82 | + name="review", |
| 83 | + model_alias="nvidia-text", |
| 84 | + prompt="""Write a brief product review for a {{ product_category }} item you recently purchased.""", |
98 | 85 | ) |
99 | 86 | ) |
100 | 87 |
|
101 | | -preview = data_designer.preview(config_builder) |
102 | | - |
| 88 | +# Preview your dataset |
| 89 | +preview = data_designer.preview(config_builder=config_builder) |
103 | 90 | preview.display_sample_record() |
104 | 91 | ``` |
105 | 92 |
|
106 | | -## A note about about Person Sampling |
| 93 | +**That's it!** You've created a dataset. |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## What's next? |
107 | 98 |
|
108 | | -> **Note:** The below usage is only temporary. The library's support for the Nemotron-Personas datasets will be evolve as we prepare to open source. |
| 99 | +### 📚 Learn more |
109 | 100 |
|
110 | | -The PII and persona managed datasets have been updated for 25.11. If you want to use our Nemotron-Personas datasets for person / persona sampling, you need to do the following. |
| 101 | +- **[Quick Start Guide](https://nvidia-nemo.github.io/DataDesigner)** – Detailed walkthrough with more examples |
| 102 | +- **[Tutorial Notebooks](https://nvidia-nemo.github.io/DataDesigner/notebooks/1-the-basics/)** – Step-by-step interactive tutorials |
| 103 | +- **[Column Types](https://nvidia-nemo.github.io/DataDesigner/concepts/columns/)** – Explore samplers, LLM columns, validators, and more |
| 104 | +- **[Model Configuration](https://nvidia-nemo.github.io/DataDesigner/models/model-configs/)** – Configure custom models and providers |
| 105 | + |
| 106 | +### 🔧 Configure models via CLI |
111 | 107 |
|
112 | | -Download the datasets from NGC: |
113 | 108 | ```bash |
114 | | -ngc registry resource download-version --org nvidian nvidian/nemo-llm/nemotron-personas-datasets:0.0.6-slim |
| 109 | +data-designer config providers # Configure model providers |
| 110 | +data-designer config models # Set up your model configurations |
| 111 | +data-designer config list # View current settings |
115 | 112 | ``` |
116 | 113 |
|
117 | | -The "slim" version is smaller for fast development. Remove the "-slim" to get the full datasets. |
| 114 | +### 🤝 Get involved |
118 | 115 |
|
119 | | -Tell `DataDesigner` where to find the datasets: |
120 | | -```python |
121 | | -data_designer = DataDesigner(artifact_path="./artifacts", blob_storage_path="/path/to/nemotron-personas-datasets") |
122 | | -``` |
| 116 | +- **[Contributing Guide](https://nvidia-nemo.github.io/DataDesigner/CONTRIBUTING.md)** – Help improve Data Designer |
| 117 | +- **[GitHub Issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues)** – Report bugs or request features |
| 118 | +- **[GitHub Discussions](https://github.com/NVIDIA-NeMo/DataDesigner/discussions)** – Ask questions and share ideas |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +## License |
| 123 | + |
| 124 | +Apache License 2.0 – see [LICENSE](LICENSE) for details. |
0 commit comments