microsoft
diff --git a/‎deploy_ai_search/.env‎
Lines changed: 2 additions & 2 deletions b/‎deploy_ai_search/.env‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎deploy_ai_search/README.md‎
Lines changed: 3 additions & 3 deletions b/‎deploy_ai_search/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎deploy_ai_search/text_2_sql_schema_store.py‎
Lines changed: 0 additions & 5 deletions b/‎deploy_ai_search/text_2_sql_schema_store.py‎
Lines changed: 0 additions & 5 deletions
diff --git a/‎text_2_sql/README.md‎
Lines changed: 64 additions & 13 deletions b/‎text_2_sql/README.md‎
Lines changed: 64 additions & 13 deletions
diff --git a/‎text_2_sql/data_dictionary/README.md‎
Lines changed: 56 additions & 15 deletions b/‎text_2_sql/data_dictionary/README.md‎
Lines changed: 56 additions & 15 deletions
@@ -11,9 +11,9 @@ AIService__AzureSearchOptions__Key=<searchServiceKey if not using identity>
 AIService__AzureSearchOptions__UsePrivateEndpoint=<true/false>
 AIService__AzureSearchOptions__Identity__FQName=<fully qualified name of the identity if using user assigned identity>
 StorageAccount__FQEndpoint=<Fully qualified endpoint in form ResourceId=resourceId if using identity based connections>
-StorageAccount__ConnectionString=<connectionString if using non managed identity>
+StorageAccount__ConnectionString=<connectionString if using non managed identity. In format: DefaultEndpointsProtocol=https;AccountName=<STG NAME>;AccountKey=<ACCOUNT KEY>;EndpointSuffix=core.windows.net>
 StorageAccount__RagDocuments__Container=<containerName>
-StorageAccount__Text2Sql__Container=<containerName>
+StorageAccount__Text2SqlSchemaStore__Container=<containerName>
 OpenAI__ApiKey=<openAIKey if using non managed identity>
 OpenAI__Endpoint=<openAIEndpoint>
 OpenAI__EmbeddingModel=<openAIEmbeddingModelName>
 
@@ -2,7 +2,7 @@
 
 The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.
 
-## Steps for Rag Documents Index Deployment
+## Steps for Rag Documents Index Deployment (For Unstructured RAG)
 
 1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
 2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
@@ -13,7 +13,7 @@ The associated scripts in this portion of the repository contains pre-built scri
     - `rebuild`. Whether to delete and rebuild the index.
     - `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
 
-## Steps for Text2SQL Index Deployment
+## Steps for Text2SQL Index Deployment (For Structured RAG)
 
 ### Schema Store Index
 
@@ -29,7 +29,7 @@ The associated scripts in this portion of the repository contains pre-built scri
 ### Query Cache Index
 
 1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
-2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it.**
+2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it. See the details in the Text2SQL README for different cache strategies.**
 3. Run `deploy.py` with the following args:
 
     - `index_type text_2_sql_query_cache`. This selects the `Text2SQLQueryCacheAISearch` sub class.
 
@@ -111,11 +111,6 @@ def get_index_fields(self) -> list[SearchableField]:
                         collection=True,
                         searchable=False,
                     ),
-                    SearchableField(
-                        name="JoinableEntities",
-                        type=SearchFieldDataType.String,
-                        collection=True,
-                    ),
                 ],
             ),
             SearchableField(
 
@@ -29,6 +29,7 @@ A common way to perform Text2SQL generation _(Iteration 1)_ is to provide the co
 
 - More tables / views significantly increases the number of tokens used within the prompt and the cost of inference.
 - More schema information can cause confusion with the LLM. In our original use case, when exceeding 5 complex tables / views, we found that the LLM could get confused between which columns belonged to which entity and as such, would generate invalid SQL queries.
+- Entity relationships between different tables is challenging for the LLM to understand.
 
 To solve these issues, a Multi-Shot approach is developed. Below is the iterations of development on the Text2SQL query component.
 
@@ -43,6 +44,8 @@ All approaches limit the number of tokens used and avoids filling the prompt wit
 
 Using Auto-Function calling capabilities, the LLM is able to retrieve from the plugin the full schema information for the views / tables that it considers useful for answering the question. Once retrieved, the full SQL query can then be generated. The schemas for multiple views / tables can be retrieved to allow the LLM to perform joins and other complex queries.
 
+To improve the scalability and accuracy in SQL Query generation, the entity relationships within the database are stored within the vector store. This allows the LLM to use **entity relationship graph** to navigate complex system joins. See the details in `./data_dictionary` for more details.
+
 For the query cache enabled approach, AI Search is used as a vector based cache, but any other cache that supports vector queries could be used, such as Redis.
 
 ### Full Logical Flow for Vector Based Approach
@@ -55,6 +58,15 @@ As the query cache is shared between users (no data is stored in the cache), a n
 
 ![Vector Based with Query Cache Logical Flow.](./images/Text2SQL%20Query%20Cache.png "Vector Based with Query Cache Logical Flow")
 
+### Caching Strategy
+
+The cache strategy implementation is a simple way to prove that the system works. You can adopt several different strategies for cache population. Below are some of the strategies that could be used:
+
+- **Pre-population:** Run an offline pipeline to generate SQL queries for the known questions that you expect from the user to prevent a 'cold start' problem.
+- **Chat History Management Pipeline:** Run a real-time pipeline that logs the chat history to a database. Within this pipeline, analyse questions that are Text2SQL and generate the cache entry.
+- **Positive Indication System:** Only update the cache when a user positively reacts to a question e.g. a thumbs up from the UI or doesn't ask a follow up question.
+- **Always update:** Always add all questions into the cache when they are asked. The sample code in the repository currently implements this approach, but this could lead to poor SQL queries reaching the cache. One of the other caching strategies would be better production version.
+
 ### Comparison of Iterations
 | | Common Text2SQL Approach | Prompt Based Multi-Shot Text2SQL Approach | Vector Based Multi-Shot Text2SQL Approach | Vector Based Multi-Shot Text2SQL Approach With Query Cache |
 |-|-|-|-|-|
@@ -152,24 +164,63 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
 
 ```json
 {
-    "EntityName": "Get All Categories",
-    "Entity": "vGetAllCategories",
-    "Description": "This view provides a comprehensive list of all product categories and their corresponding subcategories in the SalesLT schema of the AdventureWorksLT database. It is used to understand the hierarchical structure of product categories, facilitating product organization and categorization.",
-    "Columns": [
+    "Entity": "SalesLT.SalesOrderDetail",
+    "Definition": "The SalesLT.SalesOrderDetail entity contains detailed information about individual items within sales orders. This entity includes data on the sales order ID, the specific details of each order item such as quantity, product ID, unit price, and any discounts applied. It also includes calculated fields such as the line total for each order item. This entity can be used to answer questions related to the specifics of sales transactions, such as which products were purchased in each order, the quantity of each product ordered, and the total price of each order item.",
+    "EntityName": "Sales Line Items Information",
+    "Database": "AdventureWorksLT",
+    "Warehouse": null,
+    "EntityRelationships": [
         {
-            "Definition": "A unique identifier for each product category. This ID is used to reference specific categories.",
-            "Name": "ProductCategoryID",
-            "Type": "INT"
+            "ForeignEntity": "SalesLT.Product",
+            "ForeignKeys": [
+                {
+                    "Column": "ProductID",
+                    "ForeignColumn": "ProductID"
+                }
+            ]
         },
         {
-            "Definition": "The name of the parent product category. This represents the top-level category under which subcategories are grouped.",
-            "Name": "ParentProductCategoryName",
-            "Type": "NVARCHAR(50)"
+            "ForeignEntity": "SalesLT.SalesOrderHeader",
+            "ForeignKeys": [
+                {
+                    "Column": "SalesOrderID",
+                    "ForeignColumn": "SalesOrderID"
+                }
+            ]
+        }
+    ],
+    "CompleteEntityRelationshipsGraph": [
+        "SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductCategory",
+        "SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductModel -> SalesLT.ProductModelProductDescription -> SalesLT.ProductDescription",
+        "SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Address -> SalesLT.CustomerAddress -> SalesLT.Customer",
+        "SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Customer -> SalesLT.CustomerAddress -> SalesLT.Address"
+    ],
+    "Columns": [
+        {
+            "Name": "SalesOrderID",
+            "DataType": "int",
+            "Definition": "The SalesOrderID column in the SalesLT.SalesOrderDetail entity contains unique numerical identifiers for each sales order. Each value represents a specific sales order, ensuring that each order can be individually referenced and tracked. The values are in a sequential numeric format, indicating the progression and uniqueness of each sales transaction within the database.",
+            "AllowedValues": null,
+            "SampleValues": [
+                71938,
+                71784,
+                71935,
+                71923,
+                71946
+            ]
         },
         {
-            "Definition": "The name of the product category. This can refer to either a top-level category or a subcategory, depending on the context.",
-            "Name": "ProductCategoryName",
-            "Type": "NVARCHAR(50)"
+            "Name": "SalesOrderDetailID",
+            "DataType": "int",
+            "Definition": "The SalesOrderDetailID column in the SalesLT.SalesOrderDetail entity contains unique identifier values for each sales order detail record. The values are numeric and are used to distinguish each order detail entry within the database. These identifiers are essential for maintaining data integrity and enabling efficient querying and data manipulation within the sales order system.",
+            "AllowedValues": null,
+            "SampleValues": [
+                110735,
+                113231,
+                110686,
+                113257,
+                113307
+            ]
         }
     ]
 }
 
@@ -8,24 +8,63 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
 
 ```json
 {
-    "EntityName": "Get All Categories",
-    "Entity": "vGetAllCategories",
-    "Description": "This view provides a comprehensive list of all product categories and their corresponding subcategories in the SalesLT schema of the AdventureWorksLT database. It is used to understand the hierarchical structure of product categories, facilitating product organization and categorization.",
-    "Columns": [
+    "Entity": "SalesLT.SalesOrderDetail",
+    "Definition": "The SalesLT.SalesOrderDetail entity contains detailed information about individual items within sales orders. This entity includes data on the sales order ID, the specific details of each order item such as quantity, product ID, unit price, and any discounts applied. It also includes calculated fields such as the line total for each order item. This entity can be used to answer questions related to the specifics of sales transactions, such as which products were purchased in each order, the quantity of each product ordered, and the total price of each order item.",
+    "EntityName": "Sales Line Items Information",
+    "Database": "AdventureWorksLT",
+    "Warehouse": null,
+    "EntityRelationships": [
         {
-            "Definition": "A unique identifier for each product category. This ID is used to reference specific categories.",
-            "Name": "ProductCategoryID",
-            "Type": "INT"
+            "ForeignEntity": "SalesLT.Product",
+            "ForeignKeys": [
+                {
+                    "Column": "ProductID",
+                    "ForeignColumn": "ProductID"
+                }
+            ]
         },
         {
-            "Definition": "The name of the parent product category. This represents the top-level category under which subcategories are grouped.",
-            "Name": "ParentProductCategoryName",
-            "Type": "NVARCHAR(50)"
+            "ForeignEntity": "SalesLT.SalesOrderHeader",
+            "ForeignKeys": [
+                {
+                    "Column": "SalesOrderID",
+                    "ForeignColumn": "SalesOrderID"
+                }
+            ]
+        }
+    ],
+    "CompleteEntityRelationshipsGraph": [
+        "SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductCategory",
+        "SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductModel -> SalesLT.ProductModelProductDescription -> SalesLT.ProductDescription",
+        "SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Address -> SalesLT.CustomerAddress -> SalesLT.Customer",
+        "SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Customer -> SalesLT.CustomerAddress -> SalesLT.Address"
+    ],
+    "Columns": [
+        {
+            "Name": "SalesOrderID",
+            "DataType": "int",
+            "Definition": "The SalesOrderID column in the SalesLT.SalesOrderDetail entity contains unique numerical identifiers for each sales order. Each value represents a specific sales order, ensuring that each order can be individually referenced and tracked. The values are in a sequential numeric format, indicating the progression and uniqueness of each sales transaction within the database.",
+            "AllowedValues": null,
+            "SampleValues": [
+                71938,
+                71784,
+                71935,
+                71923,
+                71946
+            ]
         },
         {
-            "Definition": "The name of the product category. This can refer to either a top-level category or a subcategory, depending on the context.",
-            "Name": "ProductCategoryName",
-            "Type": "NVARCHAR(50)"
+            "Name": "SalesOrderDetailID",
+            "DataType": "int",
+            "Definition": "The SalesOrderDetailID column in the SalesLT.SalesOrderDetail entity contains unique identifier values for each sales order detail record. The values are numeric and are used to distinguish each order detail entry within the database. These identifiers are essential for maintaining data integrity and enabling efficient querying and data manipulation within the sales order system.",
+            "AllowedValues": null,
+            "SampleValues": [
+                110735,
+                113231,
+                110686,
+                113257,
+                113307
+            ]
         }
     ]
 }
@@ -34,13 +73,15 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
 ## Property Definitions
 - **EntityName** is a human readable name for the entity.
 - **Entity** is the actual name for the entity that is used in the SQL query.
-- **Description** provides a comprehensive description of what information the entity contains.
+- **Definition** provides a comprehensive description of what information the entity contains.
 - **Columns** contains a list of the columns exposed for querying. Each column contains:
     - **Definition** a short definition of what information the column contains. Here you can add extra metadata to **prompt engineer** the LLM to select the right columns or interpret the data in the column correctly.
     - **Name** is the actual column name.
-    - **Type** is the datatype for the column.
+    - **DataType** is the datatype for the column.
     - **SampleValues (optional)** is a list of sample values that are in the column. This is useful for instructing the LLM of what format the data may be in.
     - **AllowedValues (optional)** is a list of absolute allowed values for the column. This instructs the LLM only to use these values if filtering against this column.
+- **EntityRelationships** contains mapping of the immediate relationships to this entity. Contains details of the foreign keys to join against.
+- **CompleteEntityRelationshipsGraph** contains a directed graph of how this entity relates to all others in the database. The LLM can use this to work out the joins to make.
 
 A full data dictionary must be built for all the views / tables you which to expose to the LLM. The metadata provide directly influences the accuracy of the Text2SQL component.