Skip to content

Commit 51346f8

Browse files
committed
Udpate text2sql
1 parent 54fb6c8 commit 51346f8

File tree

2 files changed

+58
-16
lines changed

2 files changed

+58
-16
lines changed

deploy_ai_search/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.
44

5-
## Steps for Rag Documents Index Deployment
5+
## Steps for Rag Documents Index Deployment (For Unstructured RAG)
66

77
1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
88
2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
@@ -13,7 +13,7 @@ The associated scripts in this portion of the repository contains pre-built scri
1313
- `rebuild`. Whether to delete and rebuild the index.
1414
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
1515

16-
## Steps for Text2SQL Index Deployment
16+
## Steps for Text2SQL Index Deployment (For Structured RAG)
1717

1818
### Schema Store Index
1919

@@ -29,7 +29,7 @@ The associated scripts in this portion of the repository contains pre-built scri
2929
### Query Cache Index
3030

3131
1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
32-
2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it.**
32+
2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it. See the details in the Text2SQL README for different cache strategies.**
3333
3. Run `deploy.py` with the following args:
3434

3535
- `index_type text_2_sql_query_cache`. This selects the `Text2SQLQueryCacheAISearch` sub class.

text_2_sql/README.md

Lines changed: 55 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ A common way to perform Text2SQL generation _(Iteration 1)_ is to provide the co
2929

3030
- More tables / views significantly increases the number of tokens used within the prompt and the cost of inference.
3131
- More schema information can cause confusion with the LLM. In our original use case, when exceeding 5 complex tables / views, we found that the LLM could get confused between which columns belonged to which entity and as such, would generate invalid SQL queries.
32+
- Entity relationships between different tables is challenging for the LLM to understand.
3233

3334
To solve these issues, a Multi-Shot approach is developed. Below is the iterations of development on the Text2SQL query component.
3435

@@ -43,6 +44,8 @@ All approaches limit the number of tokens used and avoids filling the prompt wit
4344

4445
Using Auto-Function calling capabilities, the LLM is able to retrieve from the plugin the full schema information for the views / tables that it considers useful for answering the question. Once retrieved, the full SQL query can then be generated. The schemas for multiple views / tables can be retrieved to allow the LLM to perform joins and other complex queries.
4546

47+
To improve the scalability and accuracy in SQL Query generation, the entity relationships within the database are stored within the vector store. This allows the LLM to use **entity relationship graph** to navigate complex system joins. See the details in `./data_dictionary` for more details.
48+
4649
For the query cache enabled approach, AI Search is used as a vector based cache, but any other cache that supports vector queries could be used, such as Redis.
4750

4851
### Full Logical Flow for Vector Based Approach
@@ -161,24 +164,63 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
161164

162165
```json
163166
{
164-
"EntityName": "Get All Categories",
165-
"Entity": "vGetAllCategories",
166-
"Description": "This view provides a comprehensive list of all product categories and their corresponding subcategories in the SalesLT schema of the AdventureWorksLT database. It is used to understand the hierarchical structure of product categories, facilitating product organization and categorization.",
167-
"Columns": [
167+
"Entity": "SalesLT.SalesOrderDetail",
168+
"Definition": "The SalesLT.SalesOrderDetail entity contains detailed information about individual items within sales orders. This entity includes data on the sales order ID, the specific details of each order item such as quantity, product ID, unit price, and any discounts applied. It also includes calculated fields such as the line total for each order item. This entity can be used to answer questions related to the specifics of sales transactions, such as which products were purchased in each order, the quantity of each product ordered, and the total price of each order item.",
169+
"EntityName": "Sales Line Items Information",
170+
"Database": "AdventureWorksLT",
171+
"Warehouse": null,
172+
"EntityRelationships": [
168173
{
169-
"Definition": "A unique identifier for each product category. This ID is used to reference specific categories.",
170-
"Name": "ProductCategoryID",
171-
"Type": "INT"
174+
"ForeignEntity": "SalesLT.Product",
175+
"ForeignKeys": [
176+
{
177+
"Column": "ProductID",
178+
"ForeignColumn": "ProductID"
179+
}
180+
]
172181
},
173182
{
174-
"Definition": "The name of the parent product category. This represents the top-level category under which subcategories are grouped.",
175-
"Name": "ParentProductCategoryName",
176-
"Type": "NVARCHAR(50)"
183+
"ForeignEntity": "SalesLT.SalesOrderHeader",
184+
"ForeignKeys": [
185+
{
186+
"Column": "SalesOrderID",
187+
"ForeignColumn": "SalesOrderID"
188+
}
189+
]
190+
}
191+
],
192+
"CompleteEntityRelationshipsGraph": [
193+
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductCategory",
194+
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductModel -> SalesLT.ProductModelProductDescription -> SalesLT.ProductDescription",
195+
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Address -> SalesLT.CustomerAddress -> SalesLT.Customer",
196+
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Customer -> SalesLT.CustomerAddress -> SalesLT.Address"
197+
],
198+
"Columns": [
199+
{
200+
"Name": "SalesOrderID",
201+
"DataType": "int",
202+
"Definition": "The SalesOrderID column in the SalesLT.SalesOrderDetail entity contains unique numerical identifiers for each sales order. Each value represents a specific sales order, ensuring that each order can be individually referenced and tracked. The values are in a sequential numeric format, indicating the progression and uniqueness of each sales transaction within the database.",
203+
"AllowedValues": null,
204+
"SampleValues": [
205+
71938,
206+
71784,
207+
71935,
208+
71923,
209+
71946
210+
]
177211
},
178212
{
179-
"Definition": "The name of the product category. This can refer to either a top-level category or a subcategory, depending on the context.",
180-
"Name": "ProductCategoryName",
181-
"Type": "NVARCHAR(50)"
213+
"Name": "SalesOrderDetailID",
214+
"DataType": "int",
215+
"Definition": "The SalesOrderDetailID column in the SalesLT.SalesOrderDetail entity contains unique identifier values for each sales order detail record. The values are numeric and are used to distinguish each order detail entry within the database. These identifiers are essential for maintaining data integrity and enabling efficient querying and data manipulation within the sales order system.",
216+
"AllowedValues": null,
217+
"SampleValues": [
218+
110735,
219+
113231,
220+
110686,
221+
113257,
222+
113307
223+
]
182224
}
183225
]
184226
}

0 commit comments

Comments
 (0)