Skip to content

Commit a67a695

Browse files
committed
Merge main
2 parents 1ef26cd + 2a013d8 commit a67a695

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+487
-1501
lines changed

.pre-commit-config.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,12 @@ repos:
4545
args: [--fix, --ignore, UP007]
4646
exclude: samples
4747

48-
- repo: https://github.com/astral-sh/uv-pre-commit
49-
# uv version.
50-
rev: 0.5.20
51-
hooks:
52-
# Update the uv lockfile
53-
- id: uv-lock
48+
# - repo: https://github.com/astral-sh/uv-pre-commit
49+
# # uv version.
50+
# rev: 0.5.20
51+
# hooks:
52+
# # Update the uv lockfile
53+
# - id: uv-lock
5454

5555
- repo: local
5656
hooks:

deploy_ai_search/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# AI Search Indexing Pre-built Index Setup
22

3-
The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.
3+
The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillsets needed for both Text2SQL and Image Processing.
44

5-
## Steps for Rag Documents Index Deployment (For Unstructured RAG)
5+
## Steps for Rag Documents Index Deployment (For Image Processing)
66

77
1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
88
2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
@@ -13,7 +13,7 @@ The associated scripts in this portion of the repository contains pre-built scri
1313
- `rebuild`. Whether to delete and rebuild the index.
1414
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
1515

16-
## Steps for Text2SQL Index Deployment (For Structured RAG)
16+
## Steps for Text2SQL Index Deployment (For Text2SQL)
1717

1818
### Schema Store Index
1919

deploy_ai_search/src/deploy_ai_search/text_2_sql_column_value_store.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,10 @@ def get_index_fields(self) -> list[SearchableField]:
8585
name="Warehouse",
8686
type=SearchFieldDataType.String,
8787
),
88-
SimpleField(
88+
SearchableField(
8989
name="Column",
9090
type=SearchFieldDataType.String,
91+
hidden=False,
9192
),
9293
SearchableField(
9394
name="Value",

image_processing/requirements.txt

Lines changed: 0 additions & 1238 deletions
Large diffs are not rendered by default.

text_2_sql/.env.example

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Environment variables for Text2SQL
2+
IdentityType=<identityType> # system_assigned or user_assigned or key
3+
4+
Text2Sql__DatabaseEngine=<DatabaseEngine> # TSQL or PostgreSQL or Snowflake or Databricks
5+
Text2Sql__UseQueryCache=<Determines if the Query Cache will be used to speed up query generation. Defaults to True.> # True or False
6+
Text2Sql__PreRunQueryCache=<Determines if the results from the Query Cache will be pre-run to speed up answer generation. Defaults to True.> # True or False
7+
Text2Sql__UseColumnValueStore=<Determines if the Column Value Store will be used for schema selection Defaults to True.> # True or False
8+
9+
# Open AI Connection Details
10+
OpenAI__CompletionDeployment=<openAICompletionDeploymentId. Used for data dictionary creator>
11+
OpenAI__MiniCompletionDeployment=<OpenAI__MiniCompletionDeploymentId. Used for agentic text2sql>
12+
OpenAI__Endpoint=<openAIEndpoint>
13+
OpenAI__ApiKey=<openAIKey if using non identity based connection>
14+
OpenAI__ApiVersion=<openAIApiVersion>
15+
16+
# Azure AI Search Connection Details
17+
AIService__AzureSearchOptions__Endpoint=<AI search endpoint>
18+
AIService__AzureSearchOptions__Key=<AI search key if using non identity based connection>
19+
AIService__AzureSearchOptions__Text2SqlSchemaStore__Index=<Schema store index name. Default is created as "text-2-sql-schema-store-index">
20+
AIService__AzureSearchOptions__Text2SqlSchemaStore__SemanticConfig=<Schema store semantic config. Default is created as "text-2-sql-schema-store-semantic-config">
21+
AIService__AzureSearchOptions__Text2SqlQueryCache__Index=<Query cache index name. Default is created as "text-2-sql-query-cache-index">
22+
AIService__AzureSearchOptions__Text2SqlQueryCache__SemanticConfig=<Query cache semantic config. Default is created as "text-2-sql-query-cache-semantic-config">
23+
AIService__AzureSearchOptions__Text2SqlColumnValueStore__Index=<Column value store index name. Default is created as "text-2-sql-column-value-store-index">
24+
25+
# TSQL
26+
Text2Sql__Tsql__ConnectionString=<Tsql databaseConnectionString if using Tsql Data Source>
27+
Text2Sql__Tsql__Database=<Tsql database if using Tsql Data Source>
28+
29+
# PostgreSQL Specific Connection Details
30+
Text2Sql__Postgresql__ConnectionString=<Postgresql databaseConnectionString if using Postgresql Data Source>
31+
Text2Sql__Postgresql__Database=<Postgresql database if using Postgresql Data Source>
32+
33+
# Snowflake Specific Connection Details
34+
Text2Sql__Snowflake__User=<snowflakeUser if using Snowflake Data Source>
35+
Text2Sql__Snowflake__Password=<snowflakePassword if using Snowflake Data Source>
36+
Text2Sql__Snowflake__Account=<snowflakeAccount if using Snowflake Data Source>
37+
Text2Sql__Snowflake__Warehouse=<snowflakeWarehouse if using Snowflake Data Source>
38+
Text2Sql__Snowflake__Database=<snowflakeDatabase if using Snowflake Data Source>
39+
40+
# Databricks Specific Connection Details
41+
Text2Sql__Databricks__Catalog=<databricksCatalog if using Databricks Data Source with Unity Catalog>
42+
Text2Sql__Databricks__ServerHostname=<databricksServerHostname if using Databricks Data Source with Unity Catalog>
43+
Text2Sql__Databricks__HttpPath=<databricksHttpPath if using Databricks Data Source with Unity Catalog>
44+
Text2Sql__Databricks__AccessToken=<databricks AccessToken if using Databricks Data Source with Unity Catalog>

text_2_sql/GETTING_STARTED.md

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,25 @@
22

33
To get started, perform the following steps:
44

5+
**Execute the following commands in the `deploy_ai_search` directory:**
6+
57
1. Setup Azure OpenAI in your subscription with **gpt-4o-mini** & an embedding model, alongside a SQL Server sample database, AI Search and a storage account.
6-
2. Clone this repository and deploy the AI Search text2sql indexes from `deploy_ai_search`.
7-
3. Run `uv sync` within the text_2_sql directory to install dependencies.
8-
4. Configure the .env file based on the provided sample
9-
5. Generate a data dictionary for your target server using the instructions in `data_dictionary`.
10-
6. Upload these data dictionaries to the relevant contains in your storage account. Wait for them to be automatically indexed.
11-
7. Navigate to `autogen` directory to view the AutoGen implementation. Follow the steps in `Iteration 5 - Agentic Vector Based Text2SQL.ipynb` to get started.
8+
2. Create your `.env` file based on the provided sample `deploy_ai_search/.env.example`. Place this file in the same place in `deploy_ai_search/.env`.
9+
3. Clone this repository and deploy the AI Search text2sql indexes from `deploy_ai_search`. See the instructions in the **Steps for Text2SQL Index Deployment (For Structured RAG)** section of the `deploy_ai_search/README.md`.
10+
11+
**Execute the following commands in the `text_2_sql_core` directory:**
12+
13+
4. Create your `.env` file based on the provided sample `text_2_sql/.env.example`. Place this file in the same place in `text_2_sql/.env`.
14+
5. Run `uv sync` within the text_2_sql directory to install dependencies.
15+
- Install the optional dependencies if you need a database connector other than TSQL. `uv sync --extra <DATABASE ENGINE>`
16+
- See the supported connectors in `text_2_sql_core/src/text_2_sql_core/connectors`.
17+
6. Create your `.env` file based on the provided sample `text_2_sql/.env.example`. Place this file in the same place in `text_2_sql/.env`.
18+
7. Generate a data dictionary for your target server using the instructions in the **Running** section of the `data_dictionary/README.md`.
19+
8. Upload these generated data dictionaries files to the relevant containers in your storage account. Wait for them to be automatically indexed with the included skillsets.
20+
21+
**Execute the following commands in the `autogen` directory:**
22+
23+
9. Run `uv sync` within the text_2_sql directory to install dependencies.
24+
- Install the optional dependencies if you need a database connector other than TSQL. `uv sync --extra <DATABASE ENGINE>`
25+
- See the supported connectors in `text_2_sql_core/src/text_2_sql_core/connectors`.
26+
10. Navigate to `autogen` directory to view the AutoGen implementation. Follow the steps in `Iteration 5 - Agentic Vector Based Text2SQL.ipynb` to get started.

text_2_sql/README.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,20 @@ As the query cache is shared between users (no data is stored in the cache), a n
5454

5555
![Vector Based with Query Cache Logical Flow.](./images/Agentic%20Text2SQL%20Query%20Cache.png "Agentic Vector Based with Query Cache Logical Flow")
5656

57-
#### Parallel execution
57+
## Agents
58+
59+
This agentic system contains the following agents:
60+
61+
- **Query Cache Agent:** Responsible for checking the cache for previously asked questions.
62+
- **Query Decomposition Agent:** Responsible for decomposing complex questions, into sub questions that can be answered with SQL.
63+
- **Schema Selection Agent:** Responsible for extracting key terms from the question and checking the index store for the queries.
64+
- **SQL Query Generation Agent:** Responsible for using the previously extracted schemas and generated SQL queries to answer the question. This agent can request more schemas if needed. This agent will run the query.
65+
- **SQL Query Verification Agent:** Responsible for verifying that the SQL query and results question will answer the question.
66+
- **Answer Generation Agent:** Responsible for taking the database results and generating the final answer for the user.
67+
68+
The combination of this agent allows the system to answer complex questions, whilst staying under the token limits when including the database schemas. The query cache ensures that previously asked questions, can be answered quickly to avoid degrading user experience.
69+
70+
### Parallel execution
5871

5972
After the first agent has rewritten and decomposed the user input, we execute each of the individual questions in parallel for the quickest time to generate an answer.
6073

@@ -189,22 +202,9 @@ Below is a sample entry for a view / table that we which to expose to the LLM. T
189202
}
190203
```
191204

192-
See `./data_dictionary` for more details on how the data dictionary is structured and ways to **automatically generate it**.
193-
194-
## Agentic Vector Based Approach (Iteration 5)
195-
196-
This approach builds on the the Vector Based SQL Plugin approach that was previously developed, but adds a agentic approach to the solution.
197-
198-
This agentic system contains the following agents:
199-
200-
- **Query Cache Agent:** Responsible for checking the cache for previously asked questions.
201-
- **Query Decomposition Agent:** Responsible for decomposing complex questions, into sub questions that can be answered with SQL.
202-
- **Schema Selection Agent:** Responsible for extracting key terms from the question and checking the index store for the queries.
203-
- **SQL Query Generation Agent:** Responsible for using the previously extracted schemas and generated SQL queries to answer the question. This agent can request more schemas if needed. This agent will run the query.
204-
- **SQL Query Verification Agent:** Responsible for verifying that the SQL query and results question will answer the question.
205-
- **Answer Generation Agent:** Responsible for taking the database results and generating the final answer for the user.
206-
207-
The combination of this agent allows the system to answer complex questions, whilst staying under the token limits when including the database schemas. The query cache ensures that previously asked questions, can be answered quickly to avoid degrading user experience.
205+
> [!NOTE]
206+
>
207+
> - See `./data_dictionary` for more details on how the data dictionary is structured and ways to **automatically generate it**.
208208
209209
## Tips for good Text2SQL performance.
210210

text_2_sql/__init__.py

Whitespace-only changes.

text_2_sql/autogen/Iteration 5 - Agentic Vector Based Text2SQL.ipynb

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,13 @@
3535
"\n",
3636
"### Dependencies\n",
3737
"\n",
38-
"To install dependencies for this demo:\n",
38+
"To install dependencies for this demo. Navigate to the autogen directory:\n",
3939
"\n",
40-
"`uv sync --package autogen_text_2_sql`\n",
40+
"`uv sync`\n",
4141
"\n",
42-
"`uv add --editable text_2_sql_core`"
42+
"If you need a differnet connector to TSQL.\n",
43+
"\n",
44+
"`uv sync --extra <DATABASE ENGINE>`"
4345
]
4446
},
4547
{
@@ -87,6 +89,13 @@
8789
"agentic_text_2_sql = AutoGenText2Sql(use_case=\"Analysing sales data\")"
8890
]
8991
},
92+
{
93+
"cell_type": "code",
94+
"execution_count": null,
95+
"metadata": {},
96+
"outputs": [],
97+
"source": []
98+
},
9099
{
91100
"cell_type": "markdown",
92101
"metadata": {},
@@ -100,7 +109,7 @@
100109
"metadata": {},
101110
"outputs": [],
102111
"source": [
103-
"async for message in agentic_text_2_sql.process_user_message(UserMessagePayload(user_message=\"What is the total number of sales?\")):\n",
112+
"async for message in agentic_text_2_sql.process_user_message(UserMessagePayload(user_message=\"what are the total sales\")):\n",
104113
" logging.info(\"Received %s Message from Text2SQL System\", message)"
105114
]
106115
},
@@ -128,7 +137,7 @@
128137
"name": "python",
129138
"nbconvert_exporter": "python",
130139
"pygments_lexer": "ipython3",
131-
"version": "3.12.7"
140+
"version": "3.12.8"
132141
}
133142
},
134143
"nbformat": 4,

text_2_sql/autogen/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ Each agent can be configured with specific parameters and prompts to optimize it
134134

135135
## Query Cache Implementation Details
136136

137-
The vector based with query cache uses the `fetch_queries_from_cache()` method to fetch the most relevant previous query and injects it into the prompt before the initial LLM call. The use of Auto-Function Calling here is avoided to reduce the response time as the cache index will always be used first.
137+
The vector based with query cache uses the `fetch_sql_queries_with_schemas_from_cache()` method to fetch the most relevant previous query and injects it into the prompt before the initial LLM call. The use of Auto-Function Calling here is avoided to reduce the response time as the cache index will always be used first.
138138

139139
If the score of the top result is higher than the defined threshold, the query will be executed against the target data source and the results included in the prompt. This allows us to prompt the LLM to evaluated whether it can use these results to answer the question, **without further SQL Query generation** to speed up the process.
140140

0 commit comments

Comments
 (0)