Skip to content

Commit 8370e4a

Browse files
authored
feat: support native embedding generation (#106)
* Add generation type to ModelConfig * pass tests * added generate_text_embeddings * tests * remove sensitive=True old artifact no longer needed * Slight refactor * slight refactor * Added embedding generator * chunk_separator -> chunk_pattern * update tests * rename for consistency * Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters * Remove purpose from consolidated kwargs * WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters * Type as WithModelGeneration * Add image generation modality * update return type for generate_kwargs * make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters * remove regex based chunking from embedding generator * Remove image generation for now * more tests and updates * column_type_is_llm_generated -> column_type_is_model_generated * change set to list: fix flaky tests * CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type * Update docs * fix deprecation warning originating from cli model settings * update display of inference parameters in cli list * save prog on inference parameter * updates for the ocnfig builder * update cli readme * update cli for inference parmeters * update inference parameter names * flip order of vars * WithCompletion -> WithChatCompletion * specify InferenceParamsT * Update columns.md with EmbeddingColumnConfig info * make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout * DRY out some stuff in field.py * Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency * Add nvidia-embedding and openai-embedding to default model configs * Fix typo in docs * Make generate collab notebooks * fine-tune -> adjust
1 parent 68533c7 commit 8370e4a

File tree

64 files changed

+2016
-834
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+2016
-834
lines changed

docs/colab_notebooks/1-the-basics.ipynb

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"id": "39d7d274",
5+
"id": "3599c474",
66
"metadata": {},
77
"source": [
88
"# 🎨 Data Designer Tutorial: The Basics\n",
@@ -14,7 +14,7 @@
1414
},
1515
{
1616
"cell_type": "markdown",
17-
"id": "60f1d002",
17+
"id": "ee8bed13",
1818
"metadata": {},
1919
"source": [
2020
"### ⚡ Colab Setup\n",
@@ -25,7 +25,7 @@
2525
{
2626
"cell_type": "code",
2727
"execution_count": null,
28-
"id": "99c42292",
28+
"id": "f43069d1",
2929
"metadata": {},
3030
"outputs": [],
3131
"source": [
@@ -36,7 +36,7 @@
3636
{
3737
"cell_type": "code",
3838
"execution_count": null,
39-
"id": "2c959ca9",
39+
"id": "c136bf4f",
4040
"metadata": {},
4141
"outputs": [],
4242
"source": [
@@ -53,7 +53,7 @@
5353
},
5454
{
5555
"cell_type": "markdown",
56-
"id": "bc185897",
56+
"id": "48739393",
5757
"metadata": {},
5858
"source": [
5959
"### 📦 Import the essentials\n",
@@ -64,15 +64,15 @@
6464
{
6565
"cell_type": "code",
6666
"execution_count": null,
67-
"id": "dc3a2d9d",
67+
"id": "e459cd98",
6868
"metadata": {},
6969
"outputs": [],
7070
"source": [
7171
"from data_designer.essentials import (\n",
7272
" CategorySamplerParams,\n",
73+
" ChatCompletionInferenceParams,\n",
7374
" DataDesigner,\n",
7475
" DataDesignerConfigBuilder,\n",
75-
" InferenceParameters,\n",
7676
" LLMTextColumnConfig,\n",
7777
" ModelConfig,\n",
7878
" PersonFromFakerSamplerParams,\n",
@@ -85,7 +85,7 @@
8585
},
8686
{
8787
"cell_type": "markdown",
88-
"id": "36c5f571",
88+
"id": "b705d204",
8989
"metadata": {},
9090
"source": [
9191
"### ⚙️ Initialize the Data Designer interface\n",
@@ -98,7 +98,7 @@
9898
{
9999
"cell_type": "code",
100100
"execution_count": null,
101-
"id": "61b23c70",
101+
"id": "aee62c85",
102102
"metadata": {},
103103
"outputs": [],
104104
"source": [
@@ -107,7 +107,7 @@
107107
},
108108
{
109109
"cell_type": "markdown",
110-
"id": "3c9b7cb6",
110+
"id": "ae65c557",
111111
"metadata": {},
112112
"source": [
113113
"### 🎛️ Define model configurations\n",
@@ -124,7 +124,7 @@
124124
{
125125
"cell_type": "code",
126126
"execution_count": null,
127-
"id": "b86f6217",
127+
"id": "1079200d",
128128
"metadata": {},
129129
"outputs": [],
130130
"source": [
@@ -145,7 +145,7 @@
145145
" alias=MODEL_ALIAS,\n",
146146
" model=MODEL_ID,\n",
147147
" provider=MODEL_PROVIDER,\n",
148-
" inference_parameters=InferenceParameters(\n",
148+
" inference_parameters=ChatCompletionInferenceParams(\n",
149149
" temperature=0.5,\n",
150150
" top_p=1.0,\n",
151151
" max_tokens=1024,\n",
@@ -156,7 +156,7 @@
156156
},
157157
{
158158
"cell_type": "markdown",
159-
"id": "1f089871",
159+
"id": "9f15426a",
160160
"metadata": {},
161161
"source": [
162162
"### 🏗️ Initialize the Data Designer Config Builder\n",
@@ -171,7 +171,7 @@
171171
{
172172
"cell_type": "code",
173173
"execution_count": null,
174-
"id": "3d666193",
174+
"id": "79b8212c",
175175
"metadata": {},
176176
"outputs": [],
177177
"source": [
@@ -180,7 +180,7 @@
180180
},
181181
{
182182
"cell_type": "markdown",
183-
"id": "e88c8881",
183+
"id": "cd1d9e09",
184184
"metadata": {},
185185
"source": [
186186
"## 🎲 Getting started with sampler columns\n",
@@ -197,7 +197,7 @@
197197
{
198198
"cell_type": "code",
199199
"execution_count": null,
200-
"id": "79fb85c6",
200+
"id": "b3f469d6",
201201
"metadata": {},
202202
"outputs": [],
203203
"source": [
@@ -206,7 +206,7 @@
206206
},
207207
{
208208
"cell_type": "markdown",
209-
"id": "5106cc10",
209+
"id": "e44adc6c",
210210
"metadata": {},
211211
"source": [
212212
"Let's start designing our product review dataset by adding product category and subcategory columns.\n"
@@ -215,7 +215,7 @@
215215
{
216216
"cell_type": "code",
217217
"execution_count": null,
218-
"id": "22b97af1",
218+
"id": "82b32804",
219219
"metadata": {},
220220
"outputs": [],
221221
"source": [
@@ -296,7 +296,7 @@
296296
},
297297
{
298298
"cell_type": "markdown",
299-
"id": "4857b085",
299+
"id": "bd65456c",
300300
"metadata": {},
301301
"source": [
302302
"Next, let's add samplers to generate data related to the customer and their review.\n"
@@ -305,7 +305,7 @@
305305
{
306306
"cell_type": "code",
307307
"execution_count": null,
308-
"id": "9e90b3cb",
308+
"id": "6d6d4eef",
309309
"metadata": {},
310310
"outputs": [],
311311
"source": [
@@ -342,7 +342,7 @@
342342
},
343343
{
344344
"cell_type": "markdown",
345-
"id": "b36a153b",
345+
"id": "eb7b415c",
346346
"metadata": {},
347347
"source": [
348348
"## 🦜 LLM-generated columns\n",
@@ -357,7 +357,7 @@
357357
{
358358
"cell_type": "code",
359359
"execution_count": null,
360-
"id": "4da88fe6",
360+
"id": "ed811560",
361361
"metadata": {},
362362
"outputs": [],
363363
"source": [
@@ -394,7 +394,7 @@
394394
},
395395
{
396396
"cell_type": "markdown",
397-
"id": "5f1b9ac8",
397+
"id": "fdc0a2c8",
398398
"metadata": {},
399399
"source": [
400400
"### 🔁 Iteration is key – preview the dataset!\n",
@@ -411,7 +411,7 @@
411411
{
412412
"cell_type": "code",
413413
"execution_count": null,
414-
"id": "543e2f9c",
414+
"id": "59987c81",
415415
"metadata": {},
416416
"outputs": [],
417417
"source": [
@@ -421,7 +421,7 @@
421421
{
422422
"cell_type": "code",
423423
"execution_count": null,
424-
"id": "26136a8a",
424+
"id": "0823ca7f",
425425
"metadata": {},
426426
"outputs": [],
427427
"source": [
@@ -432,7 +432,7 @@
432432
{
433433
"cell_type": "code",
434434
"execution_count": null,
435-
"id": "aca4360d",
435+
"id": "eca4f0bc",
436436
"metadata": {},
437437
"outputs": [],
438438
"source": [
@@ -442,7 +442,7 @@
442442
},
443443
{
444444
"cell_type": "markdown",
445-
"id": "35ca0470",
445+
"id": "edd57f85",
446446
"metadata": {},
447447
"source": [
448448
"### 📊 Analyze the generated data\n",
@@ -455,7 +455,7 @@
455455
{
456456
"cell_type": "code",
457457
"execution_count": null,
458-
"id": "d55b402d",
458+
"id": "5c681eee",
459459
"metadata": {},
460460
"outputs": [],
461461
"source": [
@@ -465,7 +465,7 @@
465465
},
466466
{
467467
"cell_type": "markdown",
468-
"id": "245b48cf",
468+
"id": "14bf06f2",
469469
"metadata": {},
470470
"source": [
471471
"### 🆙 Scale up!\n",
@@ -478,7 +478,7 @@
478478
{
479479
"cell_type": "code",
480480
"execution_count": null,
481-
"id": "fc803eb0",
481+
"id": "b7ffead1",
482482
"metadata": {},
483483
"outputs": [],
484484
"source": [
@@ -488,7 +488,7 @@
488488
{
489489
"cell_type": "code",
490490
"execution_count": null,
491-
"id": "881c2043",
491+
"id": "aa966388",
492492
"metadata": {},
493493
"outputs": [],
494494
"source": [
@@ -501,7 +501,7 @@
501501
{
502502
"cell_type": "code",
503503
"execution_count": null,
504-
"id": "d79860d4",
504+
"id": "98e1085c",
505505
"metadata": {},
506506
"outputs": [],
507507
"source": [
@@ -513,7 +513,7 @@
513513
},
514514
{
515515
"cell_type": "markdown",
516-
"id": "b4b45176",
516+
"id": "e0b9c65a",
517517
"metadata": {},
518518
"source": [
519519
"## ⏭️ Next Steps\n",

0 commit comments

Comments
 (0)