Skip to content

Commit 2440e73

Browse files
Update the Documentation for Caching (#51)
1 parent 785594c commit 2440e73

19 files changed

+2183
-2096
lines changed

deploy_ai_search/.env

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ AIService__AzureSearchOptions__Key=<searchServiceKey if not using identity>
1111
AIService__AzureSearchOptions__UsePrivateEndpoint=<true/false>
1212
AIService__AzureSearchOptions__Identity__FQName=<fully qualified name of the identity if using user assigned identity>
1313
StorageAccount__FQEndpoint=<Fully qualified endpoint in form ResourceId=resourceId if using identity based connections>
14-
StorageAccount__ConnectionString=<connectionString if using non managed identity>
14+
StorageAccount__ConnectionString=<connectionString if using non managed identity. In format: DefaultEndpointsProtocol=https;AccountName=<STG NAME>;AccountKey=<ACCOUNT KEY>;EndpointSuffix=core.windows.net>
1515
StorageAccount__RagDocuments__Container=<containerName>
16-
StorageAccount__Text2Sql__Container=<containerName>
16+
StorageAccount__Text2SqlSchemaStore__Container=<containerName>
1717
OpenAI__ApiKey=<openAIKey if using non managed identity>
1818
OpenAI__Endpoint=<openAIEndpoint>
1919
OpenAI__EmbeddingModel=<openAIEmbeddingModelName>

deploy_ai_search/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.
44

5-
## Steps for Rag Documents Index Deployment
5+
## Steps for Rag Documents Index Deployment (For Unstructured RAG)
66

77
1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
88
2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
@@ -13,7 +13,7 @@ The associated scripts in this portion of the repository contains pre-built scri
1313
- `rebuild`. Whether to delete and rebuild the index.
1414
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
1515

16-
## Steps for Text2SQL Index Deployment
16+
## Steps for Text2SQL Index Deployment (For Structured RAG)
1717

1818
### Schema Store Index
1919

@@ -29,7 +29,7 @@ The associated scripts in this portion of the repository contains pre-built scri
2929
### Query Cache Index
3030

3131
1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
32-
2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it.**
32+
2. Adjust `text_2_sql_query_cache.py` with any changes to the index. **There is no provided indexer or skillset for this cache, it is expected that application code will write directly to it. See the details in the Text2SQL README for different cache strategies.**
3333
3. Run `deploy.py` with the following args:
3434

3535
- `index_type text_2_sql_query_cache`. This selects the `Text2SQLQueryCacheAISearch` sub class.

deploy_ai_search/text_2_sql_schema_store.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -111,11 +111,6 @@ def get_index_fields(self) -> list[SearchableField]:
111111
collection=True,
112112
searchable=False,
113113
),
114-
SearchableField(
115-
name="JoinableEntities",
116-
type=SearchFieldDataType.String,
117-
collection=True,
118-
),
119114
],
120115
),
121116
SearchableField(

text_2_sql/README.md

Lines changed: 64 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ A common way to perform Text2SQL generation _(Iteration 1)_ is to provide the co
2929

3030
- More tables / views significantly increases the number of tokens used within the prompt and the cost of inference.
3131
- More schema information can cause confusion with the LLM. In our original use case, when exceeding 5 complex tables / views, we found that the LLM could get confused between which columns belonged to which entity and as such, would generate invalid SQL queries.
32+
- Entity relationships between different tables is challenging for the LLM to understand.
3233

3334
To solve these issues, a Multi-Shot approach is developed. Below is the iterations of development on the Text2SQL query component.
3435

@@ -43,6 +44,8 @@ All approaches limit the number of tokens used and avoids filling the prompt wit
4344

4445
Using Auto-Function calling capabilities, the LLM is able to retrieve from the plugin the full schema information for the views / tables that it considers useful for answering the question. Once retrieved, the full SQL query can then be generated. The schemas for multiple views / tables can be retrieved to allow the LLM to perform joins and other complex queries.
4546

47+
To improve the scalability and accuracy in SQL Query generation, the entity relationships within the database are stored within the vector store. This allows the LLM to use **entity relationship graph** to navigate complex system joins. See the details in `./data_dictionary` for more details.
48+
4649
For the query cache enabled approach, AI Search is used as a vector based cache, but any other cache that supports vector queries could be used, such as Redis.
4750

4851
### Full Logical Flow for Vector Based Approach
@@ -55,6 +58,15 @@ As the query cache is shared between users (no data is stored in the cache), a n
5558

5659
![Vector Based with Query Cache Logical Flow.](./images/Text2SQL%20Query%20Cache.png "Vector Based with Query Cache Logical Flow")
5760

61+
### Caching Strategy
62+
63+
The cache strategy implementation is a simple way to prove that the system works. You can adopt several different strategies for cache population. Below are some of the strategies that could be used:
64+
65+
- **Pre-population:** Run an offline pipeline to generate SQL queries for the known questions that you expect from the user to prevent a 'cold start' problem.
66+
- **Chat History Management Pipeline:** Run a real-time pipeline that logs the chat history to a database. Within this pipeline, analyse questions that are Text2SQL and generate the cache entry.
67+
- **Positive Indication System:** Only update the cache when a user positively reacts to a question e.g. a thumbs up from the UI or doesn't ask a follow up question.
68+
- **Always update:** Always add all questions into the cache when they are asked. The sample code in the repository currently implements this approach, but this could lead to poor SQL queries reaching the cache. One of the other caching strategies would be better production version.
69+
5870
### Comparison of Iterations
5971
| | Common Text2SQL Approach | Prompt Based Multi-Shot Text2SQL Approach | Vector Based Multi-Shot Text2SQL Approach | Vector Based Multi-Shot Text2SQL Approach With Query Cache |
6072
|-|-|-|-|-|
@@ -152,24 +164,63 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
152164

153165
```json
154166
{
155-
"EntityName": "Get All Categories",
156-
"Entity": "vGetAllCategories",
157-
"Description": "This view provides a comprehensive list of all product categories and their corresponding subcategories in the SalesLT schema of the AdventureWorksLT database. It is used to understand the hierarchical structure of product categories, facilitating product organization and categorization.",
158-
"Columns": [
167+
"Entity": "SalesLT.SalesOrderDetail",
168+
"Definition": "The SalesLT.SalesOrderDetail entity contains detailed information about individual items within sales orders. This entity includes data on the sales order ID, the specific details of each order item such as quantity, product ID, unit price, and any discounts applied. It also includes calculated fields such as the line total for each order item. This entity can be used to answer questions related to the specifics of sales transactions, such as which products were purchased in each order, the quantity of each product ordered, and the total price of each order item.",
169+
"EntityName": "Sales Line Items Information",
170+
"Database": "AdventureWorksLT",
171+
"Warehouse": null,
172+
"EntityRelationships": [
159173
{
160-
"Definition": "A unique identifier for each product category. This ID is used to reference specific categories.",
161-
"Name": "ProductCategoryID",
162-
"Type": "INT"
174+
"ForeignEntity": "SalesLT.Product",
175+
"ForeignKeys": [
176+
{
177+
"Column": "ProductID",
178+
"ForeignColumn": "ProductID"
179+
}
180+
]
163181
},
164182
{
165-
"Definition": "The name of the parent product category. This represents the top-level category under which subcategories are grouped.",
166-
"Name": "ParentProductCategoryName",
167-
"Type": "NVARCHAR(50)"
183+
"ForeignEntity": "SalesLT.SalesOrderHeader",
184+
"ForeignKeys": [
185+
{
186+
"Column": "SalesOrderID",
187+
"ForeignColumn": "SalesOrderID"
188+
}
189+
]
190+
}
191+
],
192+
"CompleteEntityRelationshipsGraph": [
193+
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductCategory",
194+
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductModel -> SalesLT.ProductModelProductDescription -> SalesLT.ProductDescription",
195+
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Address -> SalesLT.CustomerAddress -> SalesLT.Customer",
196+
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Customer -> SalesLT.CustomerAddress -> SalesLT.Address"
197+
],
198+
"Columns": [
199+
{
200+
"Name": "SalesOrderID",
201+
"DataType": "int",
202+
"Definition": "The SalesOrderID column in the SalesLT.SalesOrderDetail entity contains unique numerical identifiers for each sales order. Each value represents a specific sales order, ensuring that each order can be individually referenced and tracked. The values are in a sequential numeric format, indicating the progression and uniqueness of each sales transaction within the database.",
203+
"AllowedValues": null,
204+
"SampleValues": [
205+
71938,
206+
71784,
207+
71935,
208+
71923,
209+
71946
210+
]
168211
},
169212
{
170-
"Definition": "The name of the product category. This can refer to either a top-level category or a subcategory, depending on the context.",
171-
"Name": "ProductCategoryName",
172-
"Type": "NVARCHAR(50)"
213+
"Name": "SalesOrderDetailID",
214+
"DataType": "int",
215+
"Definition": "The SalesOrderDetailID column in the SalesLT.SalesOrderDetail entity contains unique identifier values for each sales order detail record. The values are numeric and are used to distinguish each order detail entry within the database. These identifiers are essential for maintaining data integrity and enabling efficient querying and data manipulation within the sales order system.",
216+
"AllowedValues": null,
217+
"SampleValues": [
218+
110735,
219+
113231,
220+
110686,
221+
113257,
222+
113307
223+
]
173224
}
174225
]
175226
}

text_2_sql/data_dictionary/README.md

Lines changed: 56 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,63 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
88

99
```json
1010
{
11-
"EntityName": "Get All Categories",
12-
"Entity": "vGetAllCategories",
13-
"Description": "This view provides a comprehensive list of all product categories and their corresponding subcategories in the SalesLT schema of the AdventureWorksLT database. It is used to understand the hierarchical structure of product categories, facilitating product organization and categorization.",
14-
"Columns": [
11+
"Entity": "SalesLT.SalesOrderDetail",
12+
"Definition": "The SalesLT.SalesOrderDetail entity contains detailed information about individual items within sales orders. This entity includes data on the sales order ID, the specific details of each order item such as quantity, product ID, unit price, and any discounts applied. It also includes calculated fields such as the line total for each order item. This entity can be used to answer questions related to the specifics of sales transactions, such as which products were purchased in each order, the quantity of each product ordered, and the total price of each order item.",
13+
"EntityName": "Sales Line Items Information",
14+
"Database": "AdventureWorksLT",
15+
"Warehouse": null,
16+
"EntityRelationships": [
1517
{
16-
"Definition": "A unique identifier for each product category. This ID is used to reference specific categories.",
17-
"Name": "ProductCategoryID",
18-
"Type": "INT"
18+
"ForeignEntity": "SalesLT.Product",
19+
"ForeignKeys": [
20+
{
21+
"Column": "ProductID",
22+
"ForeignColumn": "ProductID"
23+
}
24+
]
1925
},
2026
{
21-
"Definition": "The name of the parent product category. This represents the top-level category under which subcategories are grouped.",
22-
"Name": "ParentProductCategoryName",
23-
"Type": "NVARCHAR(50)"
27+
"ForeignEntity": "SalesLT.SalesOrderHeader",
28+
"ForeignKeys": [
29+
{
30+
"Column": "SalesOrderID",
31+
"ForeignColumn": "SalesOrderID"
32+
}
33+
]
34+
}
35+
],
36+
"CompleteEntityRelationshipsGraph": [
37+
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductCategory",
38+
"SalesLT.SalesOrderDetail -> SalesLT.Product -> SalesLT.ProductModel -> SalesLT.ProductModelProductDescription -> SalesLT.ProductDescription",
39+
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Address -> SalesLT.CustomerAddress -> SalesLT.Customer",
40+
"SalesLT.SalesOrderDetail -> SalesLT.SalesOrderHeader -> SalesLT.Customer -> SalesLT.CustomerAddress -> SalesLT.Address"
41+
],
42+
"Columns": [
43+
{
44+
"Name": "SalesOrderID",
45+
"DataType": "int",
46+
"Definition": "The SalesOrderID column in the SalesLT.SalesOrderDetail entity contains unique numerical identifiers for each sales order. Each value represents a specific sales order, ensuring that each order can be individually referenced and tracked. The values are in a sequential numeric format, indicating the progression and uniqueness of each sales transaction within the database.",
47+
"AllowedValues": null,
48+
"SampleValues": [
49+
71938,
50+
71784,
51+
71935,
52+
71923,
53+
71946
54+
]
2455
},
2556
{
26-
"Definition": "The name of the product category. This can refer to either a top-level category or a subcategory, depending on the context.",
27-
"Name": "ProductCategoryName",
28-
"Type": "NVARCHAR(50)"
57+
"Name": "SalesOrderDetailID",
58+
"DataType": "int",
59+
"Definition": "The SalesOrderDetailID column in the SalesLT.SalesOrderDetail entity contains unique identifier values for each sales order detail record. The values are numeric and are used to distinguish each order detail entry within the database. These identifiers are essential for maintaining data integrity and enabling efficient querying and data manipulation within the sales order system.",
60+
"AllowedValues": null,
61+
"SampleValues": [
62+
110735,
63+
113231,
64+
110686,
65+
113257,
66+
113307
67+
]
2968
}
3069
]
3170
}
@@ -34,13 +73,15 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
3473
## Property Definitions
3574
- **EntityName** is a human readable name for the entity.
3675
- **Entity** is the actual name for the entity that is used in the SQL query.
37-
- **Description** provides a comprehensive description of what information the entity contains.
76+
- **Definition** provides a comprehensive description of what information the entity contains.
3877
- **Columns** contains a list of the columns exposed for querying. Each column contains:
3978
- **Definition** a short definition of what information the column contains. Here you can add extra metadata to **prompt engineer** the LLM to select the right columns or interpret the data in the column correctly.
4079
- **Name** is the actual column name.
41-
- **Type** is the datatype for the column.
80+
- **DataType** is the datatype for the column.
4281
- **SampleValues (optional)** is a list of sample values that are in the column. This is useful for instructing the LLM of what format the data may be in.
4382
- **AllowedValues (optional)** is a list of absolute allowed values for the column. This instructs the LLM only to use these values if filtering against this column.
83+
- **EntityRelationships** contains mapping of the immediate relationships to this entity. Contains details of the foreign keys to join against.
84+
- **CompleteEntityRelationshipsGraph** contains a directed graph of how this entity relates to all others in the database. The LLM can use this to work out the joins to make.
4485

4586
A full data dictionary must be built for all the views / tables you which to expose to the LLM. The metadata provide directly influences the accuracy of the Text2SQL component.
4687

0 commit comments

Comments
 (0)