Skip to content

Commit f973c3e

Browse files
Add Vector Based Text2SQL Code and Approach (#14)
* Add deployment for text2sql index * Add entities for new version * Work in progress * Add deployment for text2sql index * Add entities for new version * Work in progress * refactor the location * Update envs * Update the envs * Update envs * Finish indexer building * Update the location of the dictionary * Update data dictionary * Update the plugin to load jsonl * Update content * Move to use a skillset for indexing * Update the scripts * Improve the readme * Update the naming * Fix bad replacement * Update the readmes * Update readme * Update the readme * Run the vector example * Update the env * Update the code * Update readme * Update main readme
1 parent 45f2023 commit f973c3e

31 files changed

+2395
-987
lines changed

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@ It is intended that the plugins and skills provided in this repository, are adap
66

77
## Components
88

9-
- `./text2sql` contains an Multi-Shot implementation for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base.
10-
- `./ai_search_with_adi_function_app` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these.
11-
- `./deploy_ai_search` provides an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search.
9+
- `./text_2_sql` contains an two Multi-Shot implementation for Text2SQL generation and querying which can be used to answer questions backed by a database as a knowledge base. A prompt based and vector based approach are shown, both of which exhibit great performance in answering sql queries. With these plugins, your RAG application can now access and pull data from any SQL table exposed to it to answer questions.
10+
- `./adi_function_app` contains code for linking Azure Document Intelligence with AI Search to process complex documents with charts and images, and uses multi-modal models (gpt4o) to interpret and understand these. With this custom skill, the RAG application can draw insights from complex charts and images during the vector search.
11+
- `./deploy_ai_search` provides an easy Python based utility for deploying an index, indexer and corresponding skillset for AI Search and for Text2SQL.
1212

13-
The above components have been successfully used on production RAG projects to increase the quality of responses. The code provided in this repo is a sample of the implementation and should be adjusted before being used in production.
13+
The above components have been successfully used on production RAG projects to increase the quality of responses.
14+
15+
_The code provided in this repo is a sample of the implementation and should be adjusted before being used in production._
1416

1517
## High Level Implementation
1618

adi_function_app/.env

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
FunctionApp__ClientId=<clientId of the function app if using user assigned managed identity>
2+
IdentityType=<identityType> # system_assigned or user_assigned or key
3+
OpenAI__ApiKey=<openAIKey if using non managed identity>
4+
OpenAI__Endpoint=<openAIEndpoint>
5+
OpenAI__MultiModalDeployment=<openAIEmbeddingDeploymentId>
6+
OpenAI__ApiVersion=<openAIApiVersion>
7+
AIService__DocumentIntelligence__Endpoint=<documentIntelligenceEndpoint>
8+
AIService__DocumentIntelligence__Key=<documentIntelligenceKey if not using identity>
9+
AIService__Language__Endpoint=<languageEndpoint>
10+
AIService__Language__Key=<languageKey if not using identity>
11+
StorageAccount__Endpoint=<Endpoint if using identity based connections>
12+
StorageAccount__ConnectionString=<connectionString if using non managed identity>

adi_function_app/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,11 @@ The properties returned from the ADI Custom Skill are then used to perform the f
4343

4444
## Deploying AI Search Setup
4545

46-
To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./ai_search/README.md`.
46+
To deploy the pre-built index and associated indexer / skillset setup, see instructions in `./deploy_ai_search/README.md`.
4747

4848
## ADI Custom Skill
4949

50-
Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_deploy_ai_search` HTTP endpoint.
50+
Deploy the associated function app and required resources. You can then experiment with the custom skill by sending an HTTP request in the AI Search JSON format to the `/adi_2_ai_search` HTTP endpoint.
5151

5252
To use with an index, either use the utility to configure a indexer in the provided form, or integrate the skill with your skillset pipeline.
5353

adi_function_app/adi_2_ai_search.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -188,11 +188,11 @@ async def understand_image_with_gptv(image_base64, caption, tries_left=3):
188188
"role": "user",
189189
"content": [
190190
{
191-
"type": "text",
191+
"Type": "text",
192192
"text": user_input,
193193
},
194194
{
195-
"type": "image_url",
195+
"Type": "image_url",
196196
"image_url": {
197197
"url": f"data:image/png;base64,{image_base64}"
198198
},
@@ -371,10 +371,12 @@ async def analyse_document(file_path: str) -> AnalyzeResult:
371371
managed_identity_client_id=os.environ["FunctionApp__ClientId"]
372372
)
373373
else:
374-
credential = AzureKeyCredential(os.environ["AIService__Services__Key"])
374+
credential = AzureKeyCredential(
375+
os.environ["AIService__DocumentIntelligence__Key"]
376+
)
375377

376378
async with DocumentIntelligenceClient(
377-
endpoint=os.environ["AIService__Services__Endpoint"],
379+
endpoint=os.environ["AIService__DocumentIntelligence__Endpoint"],
378380
credential=credential,
379381
) as document_intelligence_client:
380382
poller = await document_intelligence_client.begin_analyze_document(

adi_function_app/key_phrase_extraction.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@ async def extract_key_phrases_from_text(
4545
managed_identity_client_id=os.environ.get("FunctionApp__ClientId")
4646
)
4747
else:
48-
credential = AzureKeyCredential(os.environ.get("AIService__Services__Key"))
48+
credential = AzureKeyCredential(os.environ.get("AIService__Language__Key"))
4949
text_analytics_client = TextAnalyticsClient(
50-
endpoint=os.environ.get("AIService__Services__Endpoint"),
50+
endpoint=os.environ.get("AIService__Language__Endpoint"),
5151
credential=credential,
5252
)
5353

deploy_ai_search/.env

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
FunctionApp__Endpoint=<functionAppEndpoint>
2+
FunctionApp__Key=<functionAppKey>
3+
FunctionApp__PreEmbeddingCleaner__FunctionName=pre_embedding_cleaner
4+
FunctionApp__ADI__FunctionName=adi_2_ai_search
5+
FunctionApp__KeyPhraseExtractor__FunctionName=key_phrase_extractor
6+
FunctionApp__AppRegistrationResourceId=<App registration in form api://appRegistrationclientId if using identity based connections>
7+
IdentityType=<identityType> # system_assigned or user_assigned or key
8+
AIService__AzureSearchOptions__Endpoint=<searchServiceEndpoint>
9+
AIService__AzureSearchOptions__Identity__ClientId=<clientId if using user assigned identity>
10+
AIService__AzureSearchOptions__Key=<searchServiceKey if not using identity>
11+
AIService__AzureSearchOptions__UsePrivateEndpoint=<true/false>
12+
AIService__AzureSearchOptions__Identity__FQName=<fully qualified name of the identity if using user assigned identity>
13+
StorageAccount__FQEndpoint=<Fully qualified endpoint in form ResourceId=resourceId if using identity based connections>
14+
StorageAccount__ConnectionString=<connectionString if using non managed identity>
15+
StorageAccount__RagDocuments__Container=<containerName>
16+
StorageAccount__Text2Sql__Container=<containerName>
17+
OpenAI__ApiKey=<openAIKey if using non managed identity>
18+
OpenAI__Endpoint=<openAIEndpoint>
19+
OpenAI__EmbeddingModel=<openAIEmbeddingModelName>
20+
OpenAI__EmbeddingDeployment=<openAIEmbeddingDeploymentId>
21+
OpenAI__EmbeddingDimensions=1536
22+
Text2Sql__DatabaseName=<databaseName>

deploy_ai_search/README.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,28 @@
1-
# AI Search Indexing with Azure Document Intelligence - Pre-built Index Setup
1+
# AI Search Indexing Pre-built Index Setup
22

33
The associated scripts in this portion of the repository contains pre-built scripts to deploy the skillset with Azure Document Intelligence.
44

5-
## Steps
5+
## Steps for Rag Documents Index Deployment
66

77
1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
88
2. Adjust `rag_documents.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
99
3. Run `deploy.py` with the following args:
1010

11-
- `indexer_type rag`. This selects the `rag_documents` sub class.
11+
- `indexer_type rag`. This selects the `RagDocumentsAISearch` sub class.
1212
- `enable_page_chunking True`. This determines whether page wise chunking is applied in ADI, or whether the inbuilt skill is used for TextSplit. **Page wise analysis in ADI is recommended to avoid splitting tables / figures across multiple chunks, when the chunking is performed.**
1313
- `rebuild`. Whether to delete and rebuild the index.
1414
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
1515

16+
## Steps for Text2SQL Index Deployment
17+
18+
1. Update `.env` file with the associated values. Not all values are required dependent on whether you are using System / User Assigned Identities or a Key based authentication.
19+
2. Adjust `text_2_sql.py` with any changes to the index / indexer. The `get_skills()` method implements the skills pipeline. Make any adjustments here in the skills needed to enrich the data source.
20+
3. Run `deploy.py` with the following args:
21+
22+
- `indexer_type text_2_sql`. This selects the `Text2SQLAISearch` sub class.
23+
- `rebuild`. Whether to delete and rebuild the index.
24+
- `suffix`. Optional parameter that will apply a suffix onto the deployed index and indexer. This is useful if you want deploy a test version, before overwriting the main version.
25+
1626
## ai_search.py & environment.py
1727

1828
This includes a variety of helper files and scripts to deploy the index setup. This is useful for CI/CD to avoid having to write JSON files manually or use the UI to deploy the pipeline.

deploy_ai_search/ai_search.py

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
SynonymMap,
2626
SplitSkill,
2727
SearchIndexerIndexProjections,
28+
BlobIndexerParsingMode,
2829
)
2930
from azure.core.exceptions import HttpResponseError
3031
from azure.search.documents.indexes import SearchIndexerClient, SearchIndexClient
@@ -66,12 +67,16 @@ def __init__(
6667
self.environment = AISearchEnvironment(indexer_type=self.indexer_type)
6768

6869
self._search_indexer_client = SearchIndexerClient(
69-
self.environment.ai_search_endpoint, self.environment.ai_search_credential
70+
endpoint=self.environment.ai_search_endpoint,
71+
credential=self.environment.ai_search_credential,
7072
)
7173
self._search_index_client = SearchIndexClient(
72-
self.environment.ai_search_endpoint, self.environment.ai_search_credential
74+
endpoint=self.environment.ai_search_endpoint,
75+
credential=self.environment.ai_search_credential,
7376
)
7477

78+
self.parsing_mode = BlobIndexerParsingMode.DEFAULT
79+
7580
@property
7681
def indexer_name(self):
7782
"""Get the indexer name for the indexer."""
@@ -156,7 +161,16 @@ def get_data_source(self) -> SearchIndexerDataSourceConnection:
156161
if self.get_indexer() is None:
157162
return None
158163

159-
data_deletion_detection_policy = NativeBlobSoftDeleteDeletionDetectionPolicy()
164+
if self.parsing_mode in [
165+
BlobIndexerParsingMode.DEFAULT,
166+
BlobIndexerParsingMode.TEXT,
167+
BlobIndexerParsingMode.JSON,
168+
]:
169+
data_deletion_detection_policy = (
170+
NativeBlobSoftDeleteDeletionDetectionPolicy()
171+
)
172+
else:
173+
data_deletion_detection_policy = None
160174

161175
data_change_detection_policy = HighWaterMarkChangeDetectionPolicy(
162176
high_water_mark_column_name="metadata_storage_last_modified"
@@ -268,6 +282,10 @@ def get_text_split_skill(self, context, source) -> SplitSkill:
268282
def get_adi_skill(self, chunk_by_page=False) -> WebApiSkill:
269283
"""Get the custom skill for adi.
270284
285+
Args:
286+
-----
287+
chunk_by_page (bool, optional): Whether to chunk by page. Defaults to False.
288+
271289
Returns:
272290
--------
273291
WebApiSkill: The custom skill for adi"""
@@ -528,6 +546,11 @@ def run_indexer(self):
528546

529547
def reset_indexer(self):
530548
"""This function runs the indexer."""
549+
550+
if self.get_indexer() is None:
551+
logging.warning("Indexer not defined. Skipping reset operation.")
552+
553+
return
531554
self._search_indexer_client.reset_indexer(self.indexer_name)
532555

533556
logging.info("%s reset.", self.indexer_name)

deploy_ai_search/deploy.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
# Licensed under the MIT License.
33
import argparse
44
from rag_documents import RagDocumentsAISearch
5+
from text_2_sql import Text2SqlAISearch
56

67

78
def deploy_config(arguments: argparse.Namespace):
@@ -15,6 +16,10 @@ def deploy_config(arguments: argparse.Namespace):
1516
rebuild=arguments.rebuild,
1617
enable_page_by_chunking=arguments.enable_page_chunking,
1718
)
19+
elif arguments.indexer_type == "text_2_sql":
20+
index_config = Text2SqlAISearch(
21+
suffix=arguments.suffix, rebuild=arguments.rebuild
22+
)
1823
else:
1924
raise ValueError("Invalid Indexer Type")
2025

deploy_ai_search/environment.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ class IndexerType(Enum):
1212
"""The type of the indexer"""
1313

1414
RAG_DOCUMENTS = "rag-documents"
15+
TEXT_2_SQL = "text-2-sql"
1516

1617

1718
class IdentityType(Enum):

0 commit comments

Comments
 (0)