Udpate text2sql

BenConstable9 · BenConstable9 · commit 51346f81761e · 2024-11-15T12:07:07.000Z
diff --git a/deploy_ai_search/README.md b/deploy_ai_search/README.md
@@ -2,7 +2,7 @@
 
 The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.
 
-## Steps for Rag Documents Index Deployment
+## Steps for Rag Documents Index Deployment (For Unstructured RAG)
 
 1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
 2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
@@ -13,7 +13,7 @@ The associated scripts in this portion of the repository contains pre-built scri
     - `rebuild`. Whether to delete and rebuild the index.
     - `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
 
-## Steps for Text2SQL Index Deployment
+## Steps for Text2SQL Index Deployment (For Structured RAG)
 
 ### Schema Store Index
 
@@ -29,7 +29,7 @@ The associated scripts in this portion of the repository contains pre-built scri
 ### Query Cache Index
 
 1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
-2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it.**
+2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it. See the details in the Text2SQL README for different cache strategies.**
 3. Run `deploy.py` with the following args:
 
     - `index_type text_2_sql_query_cache`. This selects the `Text2SQLQueryCacheAISearch` sub class.
diff --git a/text_2_sql/README.md b/text_2_sql/README.md
@@ -29,6 +29,7 @@ A common way to perform Text2SQL generation _(Iteration 1)_ is to provide the co
 
 - More tables / views significantly increases the number of tokens used within the prompt and the cost of inference.
 - More schema information can cause confusion with the LLM. In our original use case, when exceeding 5 complex tables / views, we found that the LLM could get confused between which columns belonged to which entity and as such, would generate invalid SQL queries.
+- Entity relationships between different tables is challenging for the LLM to understand.
 
 To solve these issues, a Multi-Shot approach is developed. Below is the iterations of development on the Text2SQL query component.
 
@@ -43,6 +44,8 @@ All approaches limit the number of tokens used and avoids filling the prompt wit
 
 Using Auto-Function calling capabilities, the LLM is able to retrieve from the plugin the full schema information for the views / tables that it considers useful for answering the question. Once retrieved, the full SQL query can then be generated. The schemas for multiple views / tables can be retrieved to allow the LLM to perform joins and other complex queries.
 
+To improve the scalability and accuracy in SQL Query generation, the entity relationships within the database are stored within the vector store. This allows the LLM to use **entity relationship graph** to navigate complex system joins. See the details in `./data_dictionary` for more details.
+
 For the query cache enabled approach, AI Search is used as a vector based cache, but any other cache that supports vector queries could be used, such as Redis.
 
 ### Full Logical Flow for Vector Based Approach
@@ -161,24 +164,63 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
 
 ```json
 {
-    "EntityName": "Get All Categories",
-    "Entity": "vGetAllCategories",
-    "Description": "This view provides a comprehensive list of all product categories and their corresponding subcategories in the SalesLT schema of the AdventureWorksLT database. It is used to understand the hierarchical structure of product categories, facilitating product organization and categorization.",
-    "Columns": [
+    "Entity": "SalesLT.SalesOrderDetail",
+    "Definition": "The SalesLT.SalesOrderDetail entity contains detailed information about individual items within sales orders. This entity includes data on the sales order ID, the specific details of each order item such as quantity, product ID, unit price, and any discounts applied. It also includes calculated fields such as the line total for each order item. This entity can be used to answer questions related to the specifics of sales transactions, such as which products were purchased in each order, the quantity of each product ordered, and the total price of each order item.",
+    "EntityName": "Sales Line Items Information",
+    "Database": "AdventureWorksLT",
+    "Warehouse": null,
+    "EntityRelationships": [
         {
-            "Definition": "A unique identifier for each product category. This ID is used to reference specific categories.",
-            "Name": "ProductCategoryID",
-            "Type": "INT"
+            "ForeignEntity": "SalesLT.Product",
+            "ForeignKeys": [
+                {
+                    "Column": "ProductID",
+                    "ForeignColumn": "ProductID"
+                }
+            ]
         },
         {
-            "Definition": "The name of the parent product category. This represents the top-level category under which subcategories are grouped.",
-            "Name": "ParentProductCategoryName",
-            "Type": "NVARCHAR(50)"
+            "ForeignEntity": "SalesLT.SalesOrderHeader",
+            "ForeignKeys": [
+                {
+                    "Column": "SalesOrderID",
+                    "ForeignColumn": "SalesOrderID"
+                }
+            ]
+        }
+    ],
+    "CompleteEntityRelationshipsGraph": [
+        "SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductCategory",
+        "SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductModel -> SalesLT.ProductModelProductDescription -> SalesLT.ProductDescription",
+        "SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Address -> SalesLT.CustomerAddress -> SalesLT.Customer",
+        "SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Customer -> SalesLT.CustomerAddress -> SalesLT.Address"
+    ],
+    "Columns": [
+        {
+            "Name": "SalesOrderID",
+            "DataType": "int",
+            "Definition": "The SalesOrderID column in the SalesLT.SalesOrderDetail entity contains unique numerical identifiers for each sales order. Each value represents a specific sales order, ensuring that each order can be individually referenced and tracked. The values are in a sequential numeric format, indicating the progression and uniqueness of each sales transaction within the database.",
+            "AllowedValues": null,
+            "SampleValues": [
+                71938,
+                71784,
+                71935,
+                71923,
+                71946
+            ]
         },
         {
-            "Definition": "The name of the product category. This can refer to either a top-level category or a subcategory, depending on the context.",
-            "Name": "ProductCategoryName",
-            "Type": "NVARCHAR(50)"
+            "Name": "SalesOrderDetailID",
+            "DataType": "int",
+            "Definition": "The SalesOrderDetailID column in the SalesLT.SalesOrderDetail entity contains unique identifier values for each sales order detail record. The values are numeric and are used to distinguish each order detail entry within the database. These identifiers are essential for maintaining data integrity and enabling efficient querying and data manipulation within the sales order system.",
+            "AllowedValues": null,
+            "SampleValues": [
+                110735,
+                113231,
+                110686,
+                113257,
+                113307
+            ]
         }
     ]
 }