Skip to content

Commit 0be3c10

Browse files
kirit93nabinchhajohnnygreco
authored
chore: Updated Readme (#51)
* Updated Readme * Update README.md Co-authored-by: Nabin Mulepati <[email protected]> * Updated links * Update README.md --------- Co-authored-by: Nabin Mulepati <[email protected]> Co-authored-by: Johnny Greco <[email protected]>
1 parent 7cba674 commit 0be3c10

File tree

1 file changed

+73
-71
lines changed

1 file changed

+73
-71
lines changed

README.md

Lines changed: 73 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,122 +1,124 @@
11
# 🎨 NeMo Data Designer
2-
[![CI](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
32

4-
Create synthetic datasets from scratch.
3+
[![CI](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml/badge.svg)](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml)
4+
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
5+
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html)
56

6-
## Installation
7+
**Generate high-quality synthetic datasets from scratch or using your own seed data.**
8+
9+
---
10+
11+
## Welcome!
12+
13+
Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.
14+
15+
## What can you do with Data Designer?
16+
17+
- **Generate diverse data** using statistical samplers, LLMs, or existing seed datasets
18+
- **Control relationships** between fields with dependency-aware generation
19+
- **Validate quality** with built-in Python, SQL, and custom local and remote validators
20+
- **Score outputs** using LLM-as-a-judge for quality assessment
21+
- **Iterate quickly** with preview mode before full-scale generation
22+
23+
---
24+
25+
## Quick Start
26+
27+
### 1. Install
28+
29+
```bash
30+
pip install data-designer
31+
```
32+
33+
Or install from source:
734

835
```bash
936
git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
1037
cd DataDesigner
1138
make install
1239
```
1340

14-
Test your installation:
41+
### 2. Set your API key
42+
43+
Get your API key from [build.nvidia.com](https://build.nvidia.com) or [OpenAI](https://platform.openai.com/api-keys):
1544

1645
```bash
17-
make test
46+
export NVIDIA_API_KEY="your-api-key-here"
47+
# Or use OpenAI
48+
export OPENAI_API_KEY="your-openai-api-key-here"
1849
```
1950

20-
## Example Usage
51+
### 3. Generate your first dataset
2152

2253
```python
2354
from data_designer.essentials import (
2455
CategorySamplerParams,
2556
DataDesigner,
2657
DataDesignerConfigBuilder,
27-
InferenceParameters,
2858
LLMTextColumnConfig,
29-
ModelConfig,
3059
PersonSamplerParams,
3160
SamplerColumnConfig,
3261
SamplerType,
33-
SubcategorySamplerParams,
34-
UniformSamplerParams,
35-
)
36-
37-
data_designer = DataDesigner(artifact_path="./artifacts")
38-
39-
# The model ID is from build.nvidia.com.
40-
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"
41-
42-
# We choose this alias to be descriptive for our use case.
43-
MODEL_ALIAS = "nemotron-nano-v2"
44-
45-
# This sets reasoning to False for the nemotron-nano-v2 model.
46-
SYSTEM_PROMPT = "/no_think"
47-
48-
model_configs = [
49-
ModelConfig(
50-
alias=MODEL_ALIAS,
51-
model=MODEL_ID,
52-
inference_parameters=InferenceParameters(
53-
temperature=0.5,
54-
top_p=1.0,
55-
max_tokens=1024,
56-
),
57-
)
58-
]
59-
60-
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)
61-
62-
63-
config_builder.add_column(
64-
SamplerColumnConfig(
65-
name="customer",
66-
sampler_type=SamplerType.PERSON,
67-
params=PersonSamplerParams(age_range=[18, 70]),
68-
)
6962
)
7063

64+
# Initialize with default settings
65+
data_designer = DataDesigner()
66+
config_builder = DataDesignerConfigBuilder()
7167

68+
# Add a product category
7269
config_builder.add_column(
7370
SamplerColumnConfig(
7471
name="product_category",
7572
sampler_type=SamplerType.CATEGORY,
7673
params=CategorySamplerParams(
77-
values=[
78-
"Electronics",
79-
"Clothing",
80-
"Home & Kitchen",
81-
"Books",
82-
"Home Office",
83-
],
74+
values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
8475
),
8576
)
8677
)
8778

79+
# Generate personalized customer reviews
8880
config_builder.add_column(
8981
LLMTextColumnConfig(
90-
name="customer_review",
91-
prompt=(
92-
"You are a customer named {{ customer.first_name }} from {{ customer.city }}, "
93-
"{{ customer.state }}. Tell me about your experience working in the "
94-
"{{ product_category }} department of our company."
95-
),
96-
system_prompt=SYSTEM_PROMPT,
97-
model_alias=MODEL_ALIAS,
82+
name="review",
83+
model_alias="nvidia-text",
84+
prompt="""Write a brief product review for a {{ product_category }} item you recently purchased.""",
9885
)
9986
)
10087

101-
preview = data_designer.preview(config_builder)
102-
88+
# Preview your dataset
89+
preview = data_designer.preview(config_builder=config_builder)
10390
preview.display_sample_record()
10491
```
10592

106-
## A note about about Person Sampling
93+
**That's it!** You've created a dataset.
94+
95+
---
96+
97+
## What's next?
10798

108-
> **Note:** The below usage is only temporary. The library's support for the Nemotron-Personas datasets will be evolve as we prepare to open source.
99+
### 📚 Learn more
109100

110-
The PII and persona managed datasets have been updated for 25.11. If you want to use our Nemotron-Personas datasets for person / persona sampling, you need to do the following.
101+
- **[Quick Start Guide](https://nvidia-nemo.github.io/DataDesigner)** – Detailed walkthrough with more examples
102+
- **[Tutorial Notebooks](https://nvidia-nemo.github.io/DataDesigner/notebooks/1-the-basics/)** – Step-by-step interactive tutorials
103+
- **[Column Types](https://nvidia-nemo.github.io/DataDesigner/concepts/columns/)** – Explore samplers, LLM columns, validators, and more
104+
- **[Model Configuration](https://nvidia-nemo.github.io/DataDesigner/models/model-configs/)** – Configure custom models and providers
105+
106+
### 🔧 Configure models via CLI
111107

112-
Download the datasets from NGC:
113108
```bash
114-
ngc registry resource download-version --org nvidian nvidian/nemo-llm/nemotron-personas-datasets:0.0.6-slim
109+
data-designer config providers # Configure model providers
110+
data-designer config models # Set up your model configurations
111+
data-designer config list # View current settings
115112
```
116113

117-
The "slim" version is smaller for fast development. Remove the "-slim" to get the full datasets.
114+
### 🤝 Get involved
118115

119-
Tell `DataDesigner` where to find the datasets:
120-
```python
121-
data_designer = DataDesigner(artifact_path="./artifacts", blob_storage_path="/path/to/nemotron-personas-datasets")
122-
```
116+
- **[Contributing Guide](https://nvidia-nemo.github.io/DataDesigner/CONTRIBUTING.md)** – Help improve Data Designer
117+
- **[GitHub Issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues)** – Report bugs or request features
118+
- **[GitHub Discussions](https://github.com/NVIDIA-NeMo/DataDesigner/discussions)** – Ask questions and share ideas
119+
120+
---
121+
122+
## License
123+
124+
Apache License 2.0 – see [LICENSE](LICENSE) for details.

0 commit comments

Comments
 (0)