Skip to content

Commit e93e451

Browse files
Fix Entity Relationship Graph & Add Extra Data Dictionary CLI Command (#102)
1 parent 210ee09 commit e93e451

16 files changed

+957
-601
lines changed

text_2_sql/data_dictionary/README.md

Lines changed: 150 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,72 +1,161 @@
11
# Data Dictionary
22

3-
## entities.json
3+
## Schema Store JSON
44

55
To power the knowledge of the LLM, a data dictionary containing all the SQL views / table metadata is used. Whilst the LLM could query the database at runtime to find out the schemas for the database, storing them in a text file reduces the overall latency of the system and allows the metadata for each table to be adjusted in a form of prompt engineering.
66

77
Below is a sample entry for a view / table that we which to expose to the LLM. The Microsoft SQL Server [Adventure Works Database](https://learn.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver16) is used as an sample.
88

99
```json
1010
{
11-
"Entity": "SalesLT.SalesOrderDetail",
12-
"Definition": "The SalesLT.SalesOrderDetail entity contains detailed information about individual items within sales orders. This entity includes data on the sales order ID, the specific details of each order item such as quantity, product ID, unit price, and any discounts applied. It also includes calculated fields such as the line total for each order item. This entity can be used to answer questions related to the specifics of sales transactions, such as which products were purchased in each order, the quantity of each product ordered, and the total price of each order item.",
13-
"EntityName": "Sales Line Items Information",
14-
"Database": "AdventureWorksLT",
15-
"Warehouse": null,
11+
"Entity": "SalesOrderDetail",
12+
"Definition": null,
13+
"Schema": "SalesLT",
14+
"EntityName": null,
15+
"Database": "text2sql-adventure-works",
1616
"EntityRelationships": [
1717
{
18-
"ForeignEntity": "SalesLT.Product",
18+
"ForeignEntity": "Product",
19+
"ForeignSchema": "SalesLT",
1920
"ForeignKeys": [
2021
{
2122
"Column": "ProductID",
2223
"ForeignColumn": "ProductID"
2324
}
24-
]
25+
],
26+
"ForeignDatabase": "text2sql-adventure-works",
27+
"FQN": "text2sql-adventure-works.SalesLT.SalesOrderDetail",
28+
"ForeignFQN": "text2sql-adventure-works.SalesLT.Product"
2529
},
2630
{
27-
"ForeignEntity": "SalesLT.SalesOrderHeader",
31+
"ForeignEntity": "SalesOrderHeader",
32+
"ForeignSchema": "SalesLT",
2833
"ForeignKeys": [
2934
{
3035
"Column": "SalesOrderID",
3136
"ForeignColumn": "SalesOrderID"
3237
}
33-
]
38+
],
39+
"ForeignDatabase": "text2sql-adventure-works",
40+
"FQN": "text2sql-adventure-works.SalesLT.SalesOrderDetail",
41+
"ForeignFQN": "text2sql-adventure-works.SalesLT.SalesOrderHeader"
3442
}
3543
],
3644
"CompleteEntityRelationshipsGraph": [
37-
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductCategory",
38-
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductModel -> SalesLT.ProductModelProductDescription -> SalesLT.ProductDescription",
39-
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Address -> SalesLT.CustomerAddress -> SalesLT.Customer",
40-
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Customer -> SalesLT.CustomerAddress -> SalesLT.Address"
45+
"text2sql-adventure-works.SalesLT.SalesOrderDetail -> text2sql-adventure-works.SalesLT.Product -> text2sql-adventure-works.SalesLT.ProductCategory -> Product",
46+
"text2sql-adventure-works.SalesLT.SalesOrderDetail -> text2sql-adventure-works.SalesLT.Product -> text2sql-adventure-works.SalesLT.ProductModel -> Product",
47+
"text2sql-adventure-works.SalesLT.SalesOrderDetail -> text2sql-adventure-works.SalesLT.Product -> text2sql-adventure-works.SalesLT.ProductModel -> text2sql-adventure-works.SalesLT.ProductModelProductDescription -> text2sql-adventure-works.SalesLT.ProductDescription -> ProductModelProductDescription",
48+
"text2sql-adventure-works.SalesLT.SalesOrderDetail -> text2sql-adventure-works.SalesLT.SalesOrderHeader -> SalesOrderDetail",
49+
"text2sql-adventure-works.SalesLT.SalesOrderDetail -> text2sql-adventure-works.SalesLT.SalesOrderHeader -> text2sql-adventure-works.SalesLT.Address -> CustomerAddress",
50+
"text2sql-adventure-works.SalesLT.SalesOrderDetail -> text2sql-adventure-works.SalesLT.SalesOrderHeader -> text2sql-adventure-works.SalesLT.Customer -> CustomerAddress"
4151
],
4252
"Columns": [
4353
{
4454
"Name": "SalesOrderID",
4555
"DataType": "int",
46-
"Definition": "The SalesOrderID column in the SalesLT.SalesOrderDetail entity contains unique numerical identifiers for each sales order. Each value represents a specific sales order, ensuring that each order can be individually referenced and tracked. The values are in a sequential numeric format, indicating the progression and uniqueness of each sales transaction within the database.",
47-
"AllowedValues": null,
56+
"Definition": null,
4857
"SampleValues": [
49-
71938,
50-
71784,
51-
71935,
52-
71923,
58+
71898,
59+
71831,
60+
71899,
61+
71796,
5362
71946
5463
]
5564
},
5665
{
5766
"Name": "SalesOrderDetailID",
5867
"DataType": "int",
59-
"Definition": "The SalesOrderDetailID column in the SalesLT.SalesOrderDetail entity contains unique identifier values for each sales order detail record. The values are numeric and are used to distinguish each order detail entry within the database. These identifiers are essential for maintaining data integrity and enabling efficient querying and data manipulation within the sales order system.",
60-
"AllowedValues": null,
68+
"Definition": null,
69+
"SampleValues": [
70+
110691,
71+
113288,
72+
112940,
73+
112979,
74+
111078
75+
]
76+
},
77+
{
78+
"Name": "OrderQty",
79+
"DataType": "smallint",
80+
"Definition": null,
81+
"SampleValues": [
82+
15,
83+
23,
84+
16,
85+
7,
86+
5
87+
]
88+
},
89+
{
90+
"Name": "ProductID",
91+
"DataType": "int",
92+
"Definition": null,
93+
"SampleValues": [
94+
889,
95+
780,
96+
793,
97+
795,
98+
974
99+
]
100+
},
101+
{
102+
"Name": "UnitPrice",
103+
"DataType": "money",
104+
"Definition": null,
61105
"SampleValues": [
62-
110735,
63-
113231,
64-
110686,
65-
113257,
66-
113307
106+
"602.3460",
107+
"32.9940",
108+
"323.9940",
109+
"149.8740",
110+
"20.2942"
111+
]
112+
},
113+
{
114+
"Name": "UnitPriceDiscount",
115+
"DataType": "money",
116+
"Definition": null,
117+
"SampleValues": [
118+
"0.4000",
119+
"0.1000",
120+
"0.0500",
121+
"0.0200",
122+
"0.0000"
123+
]
124+
},
125+
{
126+
"Name": "LineTotal",
127+
"DataType": "numeric",
128+
"Definition": null,
129+
"SampleValues": [
130+
"66.428908",
131+
"2041.188000",
132+
"64.788000",
133+
"1427.592000",
134+
"5102.970000"
135+
]
136+
},
137+
{
138+
"Name": "rowguid",
139+
"DataType": "uniqueidentifier",
140+
"Definition": null,
141+
"SampleValues": [
142+
"09E7A695-3260-483E-91F8-A980441B9DD6",
143+
"C9FCF326-D1B9-44A4-B29D-2D1888F6B0FD",
144+
"5CA4F84A-BAFE-485C-B7AD-897F741F76CE",
145+
"E11CF974-4DCC-4A5C-98C3-2DE92DD2A15D",
146+
"E7C11996-8D83-4515-BFBD-7E380CDB6252"
147+
]
148+
},
149+
{
150+
"Name": "ModifiedDate",
151+
"DataType": "datetime",
152+
"Definition": null,
153+
"SampleValues": [
154+
"2008-06-01 00:00:00"
67155
]
68156
}
69-
]
157+
],
158+
"FQN": "text2sql-adventure-works.SalesLT.SalesOrderDetail"
70159
}
71160
```
72161

@@ -85,6 +174,32 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
85174

86175
A full data dictionary must be built for all the views / tables you which to expose to the LLM. The metadata provide directly influences the accuracy of the Text2SQL component.
87176

177+
## Column Value Store JSONL
178+
179+
To aid LLM understand, the dimension tables within a star schema are indexed if they contain 'string' based values. This allows the LLM to use search to understand the context of the question asked. e.g. If a user asks 'What are the total sales on VE-C304-S', we can use search to determine that 'VE-C304-S' is in fact a Product Number and which entity it belongs to.
180+
181+
This avoids having to index the fact tables, saving storage, and allows us to still use the SQL queries to slice and dice the data accordingly.
182+
183+
```json
184+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "WB-H098", "Synonyms": []}
185+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "VE-C304-S", "Synonyms": []}
186+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "VE-C304-M", "Synonyms": []}
187+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "VE-C304-L", "Synonyms": []}
188+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TT-T092", "Synonyms": []}
189+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TT-R982", "Synonyms": []}
190+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TT-M928", "Synonyms": []}
191+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TI-T723", "Synonyms": []}
192+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TI-R982", "Synonyms": []}
193+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TI-R628", "Synonyms": []}
194+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TI-R092", "Synonyms": []}
195+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TI-M823", "Synonyms": []}
196+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TI-M602", "Synonyms": []}
197+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TI-M267", "Synonyms": []}
198+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TG-W091-S", "Synonyms": []}
199+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TG-W091-M", "Synonyms": []}
200+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "TG-W091-L", "Synonyms": []}
201+
{"Entity": "Product", "Schema": "SalesLT", "Database": "text2sql-adventure-works", "FQN": "text2sql-adventure-works.SalesLT.Product.ProductNumber", "Column": "ProductNumber", "Value": "ST-1401", "Synonyms": []}
202+
```
88203

89204
## Indexing
90205

@@ -122,6 +237,13 @@ You can pass the following command line arguements:
122237

123238
- `-- output_directory` or `-o`: Optional directory that the script will write the output files to.
124239
- `-- single_file` or `-s`: Optional flag that writes all schemas to a single file.
240+
- `-- generate_definitions` or `-gen`: Optional flag that uses OpenAI to generate descriptions.
241+
242+
If you need control over the following, run the file directly:
243+
244+
- `entities`: A list of entities to extract. Defaults to None.
245+
- `excluded_entities`: A list of entities to exclude.
246+
- `excluded_schemas`: A list of schemas to exclude.
125247

126248
> [!IMPORTANT]
127249
>

0 commit comments

Comments
 (0)