You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Enhancement] Support for multiple dataset as source, sink and in prompts (#72)
* support for multiple dataset in sygra
* constants and warning fix
* documentation
* vstack support
* bug fix and unit test set 1
* unit test set 2
* feat: add metadata for multi-source dataset
* fix: resolve metadata for single dataset
* formatting
* working example
* doc merge into new format
* document update for yaml example
---------
Co-authored-by: Surajit Dasgupta <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -114,12 +114,14 @@ workflow.run(num_records=1)
114
114
SyGra supports extendability and ease of implementation—most tasks are defined as graph configuration YAML files. Each task consists of two major components: a graph configuration and Python code to define conditions and processors.
115
115
YAML contains various parts:
116
116
117
-
-**Data configuration** : Configure file or huggingface as source and sink for the task.
117
+
-**Data configuration** : Configure file or huggingface or ServiceNow instance as source and sink for the task.
118
118
-**Data transformation** : Configuration to transform the data into the format it can be used in the graph.
119
119
-**Node configuration** : Configure nodes and corresponding properties, preprocessor and post processor.
120
120
-**Edge configuration** : Connect the nodes configured above with or without conditions.
121
121
-**Output configuration** : Configuration for data tranformation before writing the data into sink.
122
122
123
+
The data configuration supports source and sink configuration, which can be a single configuration or a list. When it is a list of dataset configuration, it allows merging the dataset as column based and row based. Access the dataset keys or columns with alias prefix in the prompt, and finally write into various output dataset in a single flow.
124
+
123
125
A node is defined by the node module, supporting types like LLM call, multiple LLM call, lambda node, and sampler node.
124
126
125
127
LLM-based nodes require a model configured in `models.yaml` and runtime parameters. Sampler nodes pick random samples from static YAML lists. For custom node types, you can implement new nodes in the platform.
SyGra allow data generation engineer to connect multiple dataset, merge them into single and write into multiple dataset. This usecase can be very useful when working with multiple tables in ServiceNow instance.
408
+
409
+
Let's look at the below scenario. We have ServiceNow instance with incident table contains 5 records, we want to generate many unique incident records with variety of domains.
410
+
411
+
First we will configure two datasets: one to fetch incident records and apply transform(`CombineRecords`) to create single record with 5 fewshot example, lets call it ds1(alias name).
412
+
Second, load domain and sub domain from a file(csv or json), lets call it ds2(alias name). Assume we have 100000 records, but we picked only 1000 records. We join the incident table(1 record) with file data as more columns.
413
+
414
+
Here we can use a 'cross' type join, which multiplies 2 dataset and creates final dataset.
415
+
The result dataset will contain columns or keys with prefix of alias name of the dataset, so column description will become ds1->description and domain will become ds2->domain.
416
+
In the graph yaml file, we can use the variables along with alias prefix like `{ds2->domain}`.
417
+
418
+
We also need to define multiple sink with alias name, in our case we only need one sink with alias name as ds1 as we are generating only incident records(ds1), however we can have multiple sink configuration to write data into various dataset.
Here is one example task with multiple dataset: `tasks/examples/multiple_dataset`
423
+
424
+
Extra parameters supported for dataset configuration as a list:
425
+
*`alias`: This variable gives a name to the dataset, so keys can be accessed in the prompt with alias prefix. The format to access in prompt `alias_name->column`
426
+
*`join_type`: Supports various join type like `primary`, `cross`, `sequential`, `random`, `column`.
427
+
* Horizontal or column based: In this join type, one dataset should have `join_type`: `primary`, where other dataset will be able to join in various ways:
428
+
*`sequential`: Dataset with this join type will sequentially pick one record and merge horizontally with one record from primary dataset. If the primary dataset is small, it will truncate and join, else it will rotate the record index.
429
+
*`random`: Dataset with this join type will pick one random record and merge horizontally with one record from primary dataset.
430
+
*`cross`: Dataset with this join type, will multiple with primary dataset. One record from this dataset will merge horizontally with each primary record. So, if this dataset has 10 records and primary has 100, final dataset will be 1000 records.
431
+
*`column`: This dataset type will use one column(`join_key`) and try to match with one column(`primary_key`) from primary dataset. This is same as RDBMS table join with foreign key.
432
+
* Vertical stack or row based: This type of joining is possible if there are matching column is the dataset. The `join_key` should be `vstack` for all the dataset in the list. A dataset transformation(rename column) can be applied to match the column name with other dataset.
433
+
During vstack merged dataset will have common column names, alias name will not be prefixed in the column name. Use variable name directly in the prompt, without the alias prefix.
434
+
Sink configuration should be a single configuration if aliasing not done in the python code.
435
+
436
+
*`primary_key`: Signifies the column of the primary dataset which should match with other dataset column `join_key` when join type is `column`
437
+
*`join_key`: Signifies the column of other dataset which should match with primary dataset column `primary_key` when join type is `column`
438
+
439
+
##### Example graph YAML for horizontal join
440
+
- Here each primary row is picked and merged(column wise) with one random row from secondary, generates 10 records only.
441
+
- If join_type of secondary is changed to `cross`, each primary row is joined with each secondary row, generates 10 x n rows.
442
+
- If join_type of secondary is changed to `sequential`, each primary row is joined with one secondary row sequentially, generates 10 rows.
443
+
- Example for join_type `column` is given at `tasks/examples/multiple_dataset`
444
+
```yaml
445
+
# This graph config explains how incidents can be created for a role
446
+
# as it is random horizontal join, the output record count is same as incident table(10)
447
+
data_config:
448
+
id_column: sys_id
449
+
source:
450
+
- alias: inc
451
+
join_type: primary
452
+
453
+
type: servicenow
454
+
table: incident
455
+
456
+
filters:
457
+
active: "true"
458
+
priority: ["1", "2"]
459
+
460
+
fields:
461
+
- sys_id
462
+
- short_description
463
+
- description
464
+
465
+
limit: 10
466
+
467
+
order_by: sys_created_on
468
+
order_desc: true
469
+
470
+
- alias: roles
471
+
join_type: random # join the secondary row randomly into primary
472
+
473
+
type: "hf"
474
+
repo_id: "fazni/roles-based-on-skills"
475
+
config_name: "default"
476
+
split: "test"
477
+
478
+
sink:
479
+
- alias: new_inc
480
+
#type: "disk"
481
+
file_path: "data/new_inc.json"
482
+
file_type: "json"
483
+
484
+
graph_config:
485
+
nodes:
486
+
incident_generator:
487
+
node_type: llm
488
+
model:
489
+
name: gpt-5
490
+
temperature: 0.1
491
+
max_tokens: 1024
492
+
structured_output:
493
+
schema:
494
+
fields:
495
+
description:
496
+
type: str
497
+
description: "Incident detailed description"
498
+
short_description:
499
+
type: str
500
+
description: "Short summary of the incident in one line"
501
+
502
+
input_key: input
503
+
output_keys:
504
+
- description
505
+
- short_description
506
+
507
+
# below post processor will just parse the string and return dict with description and short_description key
0 commit comments