|
2 | 2 | "cells": [ |
3 | 3 | { |
4 | 4 | "cell_type": "markdown", |
5 | | - "id": "a4ac4d55", |
| 5 | + "id": "39d7d274", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | 8 | "# 🎨 Data Designer Tutorial: The Basics\n", |
|
14 | 14 | }, |
15 | 15 | { |
16 | 16 | "cell_type": "markdown", |
17 | | - "id": "9e9f3c47", |
| 17 | + "id": "60f1d002", |
18 | 18 | "metadata": {}, |
19 | 19 | "source": [ |
20 | 20 | "### ⚡ Colab Setup\n", |
|
25 | 25 | { |
26 | 26 | "cell_type": "code", |
27 | 27 | "execution_count": null, |
28 | | - "id": "41b31194", |
| 28 | + "id": "99c42292", |
29 | 29 | "metadata": {}, |
30 | 30 | "outputs": [], |
31 | 31 | "source": [ |
32 | | - "!pip install -qU data-designer" |
| 32 | + "%%capture\n", |
| 33 | + "!pip install -U data-designer" |
33 | 34 | ] |
34 | 35 | }, |
35 | 36 | { |
36 | 37 | "cell_type": "code", |
37 | 38 | "execution_count": null, |
38 | | - "id": "502b3aba", |
| 39 | + "id": "2c959ca9", |
39 | 40 | "metadata": {}, |
40 | 41 | "outputs": [], |
41 | 42 | "source": [ |
|
52 | 53 | }, |
53 | 54 | { |
54 | 55 | "cell_type": "markdown", |
55 | | - "id": "8c512fbc", |
| 56 | + "id": "bc185897", |
56 | 57 | "metadata": {}, |
57 | 58 | "source": [ |
58 | 59 | "### 📦 Import the essentials\n", |
|
63 | 64 | { |
64 | 65 | "cell_type": "code", |
65 | 66 | "execution_count": null, |
66 | | - "id": "8fae521f", |
| 67 | + "id": "dc3a2d9d", |
67 | 68 | "metadata": {}, |
68 | 69 | "outputs": [], |
69 | 70 | "source": [ |
|
84 | 85 | }, |
85 | 86 | { |
86 | 87 | "cell_type": "markdown", |
87 | | - "id": "e71d0256", |
| 88 | + "id": "36c5f571", |
88 | 89 | "metadata": {}, |
89 | 90 | "source": [ |
90 | 91 | "### ⚙️ Initialize the Data Designer interface\n", |
91 | 92 | "\n", |
92 | 93 | "- `DataDesigner` is the main object is responsible for managing the data generation process.\n", |
93 | 94 | "\n", |
94 | | - "- When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/concepts/models/default-model-settings/) are used.\n" |
| 95 | + "- When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used.\n" |
95 | 96 | ] |
96 | 97 | }, |
97 | 98 | { |
98 | 99 | "cell_type": "code", |
99 | 100 | "execution_count": null, |
100 | | - "id": "68fc7172", |
| 101 | + "id": "61b23c70", |
101 | 102 | "metadata": {}, |
102 | 103 | "outputs": [], |
103 | 104 | "source": [ |
|
106 | 107 | }, |
107 | 108 | { |
108 | 109 | "cell_type": "markdown", |
109 | | - "id": "9a821a27", |
| 110 | + "id": "3c9b7cb6", |
110 | 111 | "metadata": {}, |
111 | 112 | "source": [ |
112 | 113 | "### 🎛️ Define model configurations\n", |
|
115 | 116 | "\n", |
116 | 117 | "- The \"model alias\" is used to reference the model in the Data Designer config (as we will see below).\n", |
117 | 118 | "\n", |
118 | | - "- The \"model provider\" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/concepts/models/default-model-settings/) docs for more details).\n", |
| 119 | + "- The \"model provider\" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details).\n", |
119 | 120 | "\n", |
120 | 121 | "- By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider.\n" |
121 | 122 | ] |
122 | 123 | }, |
123 | 124 | { |
124 | 125 | "cell_type": "code", |
125 | 126 | "execution_count": null, |
126 | | - "id": "a9515141", |
| 127 | + "id": "b86f6217", |
127 | 128 | "metadata": {}, |
128 | 129 | "outputs": [], |
129 | 130 | "source": [ |
|
155 | 156 | }, |
156 | 157 | { |
157 | 158 | "cell_type": "markdown", |
158 | | - "id": "3b940ab9", |
| 159 | + "id": "1f089871", |
159 | 160 | "metadata": {}, |
160 | 161 | "source": [ |
161 | 162 | "### 🏗️ Initialize the Data Designer Config Builder\n", |
|
170 | 171 | { |
171 | 172 | "cell_type": "code", |
172 | 173 | "execution_count": null, |
173 | | - "id": "ec21da7e", |
| 174 | + "id": "3d666193", |
174 | 175 | "metadata": {}, |
175 | 176 | "outputs": [], |
176 | 177 | "source": [ |
|
179 | 180 | }, |
180 | 181 | { |
181 | 182 | "cell_type": "markdown", |
182 | | - "id": "85b2324e", |
| 183 | + "id": "e88c8881", |
183 | 184 | "metadata": {}, |
184 | 185 | "source": [ |
185 | 186 | "## 🎲 Getting started with sampler columns\n", |
|
196 | 197 | { |
197 | 198 | "cell_type": "code", |
198 | 199 | "execution_count": null, |
199 | | - "id": "f49f435e", |
| 200 | + "id": "79fb85c6", |
200 | 201 | "metadata": {}, |
201 | 202 | "outputs": [], |
202 | 203 | "source": [ |
|
205 | 206 | }, |
206 | 207 | { |
207 | 208 | "cell_type": "markdown", |
208 | | - "id": "f582b642", |
| 209 | + "id": "5106cc10", |
209 | 210 | "metadata": {}, |
210 | 211 | "source": [ |
211 | 212 | "Let's start designing our product review dataset by adding product category and subcategory columns.\n" |
|
214 | 215 | { |
215 | 216 | "cell_type": "code", |
216 | 217 | "execution_count": null, |
217 | | - "id": "8cfc43b1", |
| 218 | + "id": "22b97af1", |
218 | 219 | "metadata": {}, |
219 | 220 | "outputs": [], |
220 | 221 | "source": [ |
|
295 | 296 | }, |
296 | 297 | { |
297 | 298 | "cell_type": "markdown", |
298 | | - "id": "2d0eea21", |
| 299 | + "id": "4857b085", |
299 | 300 | "metadata": {}, |
300 | 301 | "source": [ |
301 | 302 | "Next, let's add samplers to generate data related to the customer and their review.\n" |
|
304 | 305 | { |
305 | 306 | "cell_type": "code", |
306 | 307 | "execution_count": null, |
307 | | - "id": "b5e65724", |
| 308 | + "id": "9e90b3cb", |
308 | 309 | "metadata": {}, |
309 | 310 | "outputs": [], |
310 | 311 | "source": [ |
|
341 | 342 | }, |
342 | 343 | { |
343 | 344 | "cell_type": "markdown", |
344 | | - "id": "e6788771", |
| 345 | + "id": "b36a153b", |
345 | 346 | "metadata": {}, |
346 | 347 | "source": [ |
347 | 348 | "## 🦜 LLM-generated columns\n", |
|
356 | 357 | { |
357 | 358 | "cell_type": "code", |
358 | 359 | "execution_count": null, |
359 | | - "id": "a2705cd9", |
| 360 | + "id": "4da88fe6", |
360 | 361 | "metadata": {}, |
361 | 362 | "outputs": [], |
362 | 363 | "source": [ |
|
393 | 394 | }, |
394 | 395 | { |
395 | 396 | "cell_type": "markdown", |
396 | | - "id": "e3dd2f69", |
| 397 | + "id": "5f1b9ac8", |
397 | 398 | "metadata": {}, |
398 | 399 | "source": [ |
399 | 400 | "### 🔁 Iteration is key – preview the dataset!\n", |
|
410 | 411 | { |
411 | 412 | "cell_type": "code", |
412 | 413 | "execution_count": null, |
413 | | - "id": "c6e43147", |
| 414 | + "id": "543e2f9c", |
414 | 415 | "metadata": {}, |
415 | 416 | "outputs": [], |
416 | 417 | "source": [ |
|
420 | 421 | { |
421 | 422 | "cell_type": "code", |
422 | 423 | "execution_count": null, |
423 | | - "id": "fab77d01", |
| 424 | + "id": "26136a8a", |
424 | 425 | "metadata": {}, |
425 | 426 | "outputs": [], |
426 | 427 | "source": [ |
|
431 | 432 | { |
432 | 433 | "cell_type": "code", |
433 | 434 | "execution_count": null, |
434 | | - "id": "875ee6a6", |
| 435 | + "id": "aca4360d", |
435 | 436 | "metadata": {}, |
436 | 437 | "outputs": [], |
437 | 438 | "source": [ |
|
441 | 442 | }, |
442 | 443 | { |
443 | 444 | "cell_type": "markdown", |
444 | | - "id": "87b59e4b", |
| 445 | + "id": "35ca0470", |
445 | 446 | "metadata": {}, |
446 | 447 | "source": [ |
447 | 448 | "### 📊 Analyze the generated data\n", |
|
454 | 455 | { |
455 | 456 | "cell_type": "code", |
456 | 457 | "execution_count": null, |
457 | | - "id": "5d347f4c", |
| 458 | + "id": "d55b402d", |
458 | 459 | "metadata": {}, |
459 | 460 | "outputs": [], |
460 | 461 | "source": [ |
|
464 | 465 | }, |
465 | 466 | { |
466 | 467 | "cell_type": "markdown", |
467 | | - "id": "d2fb84f2", |
| 468 | + "id": "245b48cf", |
468 | 469 | "metadata": {}, |
469 | 470 | "source": [ |
470 | 471 | "### 🆙 Scale up!\n", |
|
477 | 478 | { |
478 | 479 | "cell_type": "code", |
479 | 480 | "execution_count": null, |
480 | | - "id": "71a31e85", |
| 481 | + "id": "fc803eb0", |
481 | 482 | "metadata": {}, |
482 | 483 | "outputs": [], |
483 | 484 | "source": [ |
|
487 | 488 | { |
488 | 489 | "cell_type": "code", |
489 | 490 | "execution_count": null, |
490 | | - "id": "501e9092", |
| 491 | + "id": "881c2043", |
491 | 492 | "metadata": {}, |
492 | 493 | "outputs": [], |
493 | 494 | "source": [ |
|
500 | 501 | { |
501 | 502 | "cell_type": "code", |
502 | 503 | "execution_count": null, |
503 | | - "id": "6f217b4a", |
| 504 | + "id": "d79860d4", |
504 | 505 | "metadata": {}, |
505 | 506 | "outputs": [], |
506 | 507 | "source": [ |
|
512 | 513 | }, |
513 | 514 | { |
514 | 515 | "cell_type": "markdown", |
515 | | - "id": "4da82b0f", |
| 516 | + "id": "b4b45176", |
516 | 517 | "metadata": {}, |
517 | 518 | "source": [ |
518 | 519 | "## ⏭️ Next Steps\n", |
519 | 520 | "\n", |
520 | 521 | "Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:\n", |
521 | 522 | "\n", |
522 | | - "- [Structured outputs and jinja expressions](/notebooks/2-structured-outputs-and-jinja-expressions/)\n", |
| 523 | + "- [Structured outputs and jinja expressions](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)\n", |
523 | 524 | "\n", |
524 | | - "- [Seeding synthetic data generation with an external dataset](/notebooks/3-seeding-with-a-dataset/)\n" |
| 525 | + "- [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)\n", |
| 526 | + "\n", |
| 527 | + "- [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/)\n" |
525 | 528 | ] |
526 | 529 | } |
527 | 530 | ], |
|
0 commit comments