diff --git a/README.md b/README.md index 05a77ce9c..23501eb05 100644 --- a/README.md +++ b/README.md @@ -2,8 +2,7 @@ # Generic Build your own copilot Solution Accelerator -MENU: [**USER STORY**](#user-story) \| [**ONE-CLICK DEPLOY**](#one-click-deploy) \| [**SUPPORTING DOCUMENTS**](#supporting-documents) \| -[**CUSTOMER TRUTH**](#customer-truth) +MENU: [**USER STORY**](#user-story) \| [**ONE-CLICK DEPLOY**](#one-click-deploy) \| [**SUPPORTING DOCUMENTS**](#supporting-documents)

@@ -11,100 +10,94 @@ MENU: [**USER STORY**](#user-story) \| [**ONE-CLICK DEPLOY**](#one-click-deploy) User story

-**Solution accelerator overview** +### Overview This solution accelerator is a powerful tool that helps you create your own AI assistant(s). The accelerator can be used by any customer looking for reusable architecture and code snippets to build an AI assistant(s) with their own enterprise data. -It leverages Azure OpenAI Service and Azure AI Search, to identify relevant documents, summarize unstructured information, and generate Word document templates using your own data. +It leverages Azure AI Foundry, Azure OpenAI Service and Azure AI Search, to identify relevant documents, summarize unstructured information, and generate Word document templates using your own data. -**Scenario** +### Key features -This example focuses on a generic use case - chat with your own data, generate a document template using your own data, and exporting the document in a docx format. - -The sample data is sourced from generic AI-generated promissory notes. -The documents are intended for use as sample data only. - -
+![Key Features](/docs/images/keyfeatures.png) -**Key features** +Below is an image of the solution. -![Key Features](/docs/images/keyfeatures.png) +![Landing Page](/docs/images/landing_page.png) -
+### Scenario -**Below is an image of the solution accelerator.** +This example focuses on a generic use case - chat with your own data, generate a document template using your own data, and exporting the document in a docx format. -![Landing Page](/docs/images/landing_page.png) +The sample data is sourced from generic AI-generated promissory notes. +The documents are intended for use as sample data only. +### Solution accelerator architecture +![image](/docs/images/architecture.png) -

+


One-click deploy

+| [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/Generic-Build-your-own-copilot-Solution-Accelerator) | [![Open in Dev Containers](https://img.shields.io/static/v1?style=for-the-badge&label=Dev%20Containers&message=Open&color=blue&logo=visualstudiocode)](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/microsoft/Generic-Build-your-own-copilot-Solution-Accelerator) | [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2FGeneric-Build-your-own-copilot-Solution-Accelerator%2Fmain%2Finfra%2Fmain.json) | +|---|---|---| + ### Prerequisites -To use this solution accelerator, you will need access to an [Azure subscription](https://azure.microsoft.com/free/) with permission to create resource groups and resources. While not required, a prior understanding of Azure OpenAI and Azure AI Search will be helpful. +To deploy this solution accelerator, ensure you have access to an [Azure subscription](https://azure.microsoft.com/free/) with the necessary permissions to create **resource groups and resources**. Follow the steps in [Azure Account Set Up](./docs/AzureAccountSetUp.md) -For additional training and support, please see: +Check the [Azure Products by Region](https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/?products=all®ions=all) page and select a **region** where the following services are available: -1. [Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/) -2. [Azure AI Search](https://learn.microsoft.com/en-us/azure/search/) -3. [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-studio/) +- Azure AI Foundry +- Azure OpenAI Services +- Azure AI Search +- Embedding Deployment Capacity +- GPT Model Capacity +- [Azure Semantic Search](./docs/AzureSemanticSearchRegion.md) -### Solution accelerator architecture -![image](/docs/images/architecture.png) -

-
-QUICK DEPLOY -

+ + -[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/Generic-Build-your-own-copilot-Solution-Accelerator) -[![Open in Dev Containers](https://img.shields.io/static/v1?style=for-the-badge&label=Dev%20Containers&message=Open&color=blue&logo=visualstudiocode)](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/microsoft/Generic-Build-your-own-copilot-Solution-Accelerator) -[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2FGeneric-Build-your-own-copilot-Solution-Accelerator%2Fmain%2Finfra%2Fmain.json) > Note: Some features contained in this repository are in private preview. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/en-us/support/legal/preview-supplemental-terms). -### **How to install/deploy** +### Configurable Deployment Settings -1. Please check the link [Azure Products by Region]( -https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/?products=all®ions=all) and choose a region where Azure AI Search, Azure OpenAI Service, and Azure AI Foundry are available. If you are using the included sample data set, verify Document Intelligence (Form Recognizer) is available. +When you start the deployment, most parameters will have **default values**, but you can update the following settings: -2. Click the following deployment button to create the required resources for this accelerator in your Azure Subscription. +| **Setting** | **Description** | **Default value** | +|------------|----------------| ------------| +| **Azure Region** | The region where resources will be created. | East US| +| **Environment Name** | A **3-20 character alphanumeric value** used to generate a unique ID to prefix the resources. | byctemplate | +| **Secondary Location** | A **less busy** region for **Azure SQL and CosmosDB**, useful in case of availability constraints. | eastus2 | +| **Deployment Type** | Select from a drop-down list. | GlobalStandard | +| **GPT Model** | Choose from **gpt-4, gpt-4o** | gpt-4o | +| **GPT Model Deployment Capacity** | Configure capacity for **GPT models**. | 30k | +| **Embedding Model** | Default: **text-embedding-ada-002**. | text-embedding-ada-002 | +| **Embedding Model Capacity** | Set the capacity for **embedding models**. | 80k | - [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2FGeneric-Build-your-own-copilot-Solution-Accelerator%2Fmain%2Finfra%2Fmain.json) -3. You will need to select an Azure Subscription, create/select a Resource group, and Region. If your intention is to deploy this solution accelerator and the corresponding sample data set, the default settings will suffice. +### [Optional] Quota Recommendations +By default, the **GPT model capacity** in deployment is set to **30k tokens**. +> **We recommend increasing the capacity to 100k tokens for optimal performance.** -If you are using your own data, the next step is optional. +To adjust quota settings, follow these [steps](./docs/AzureGPTQuotaSettings.md) -4. Follow steps in [Sample data guide](./scripts/SAMPLE_DATA.md) to ingest the sample Promissory Note PDFs into the search index. -If you want to enable authentication, you will need to add an identity provider. +**โš ๏ธ Warning:** **Insufficient quota can cause deployment errors.** Please ensure you have the recommended capacity or request for additional capacity before deploying this solution. -#### Add an identity provider -After deployment, you will need to add an identity provider to provide authentication support in your app. See [this tutorial](https://learn.microsoft.com/en-us/azure/app-service/scenario-secure-app-authentication-app-service) for more information. - -If you don't add an identity provider, the chat functionality will allow anyone to access the chat functionality of your app. **This is not recommended for production apps.** To enable this restriction, you can add `AUTH_ENABLED=True` to the environment variables. This will enable authentication and prevent unauthorized access to the chat functionality of your app. - -To add further access controls, update the logic in `getUserInfoList` in `frontend/src/pages/chat/Chat.tsx`. - -#### Recommended practices -1. **For enhanced relevance and accuracy**, we recommend implementing [Azure hybrid search](https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview) over full-text search. Azure hybrid search provides superior relevance, accuracy, support for complex queries, improved user experience, scalability, performance, advanced features, and integration with AI services. These advantages make it the ideal choice for modern applications that require robust and intelligent search capabilities. -2. **Importance of prompt engineering**. Prompt engineering is a critical aspect of working with AI models, especially when leveraging advanced capabilities such as those provided by Azure AI services. Proper prompt engineering ensures that the AI models generate accurate, relevant, and contextually appropriate responses. It involves carefully crafting and refining prompts to guide the model's behavior and output effectively. Neglecting prompt engineering can result in suboptimal performance, irrelevant outputs, and increased frustration for users. Therefore, it is essential to invest time and effort in prompt engineering to fully harness the potential of AI models - -### **Options** +### Deployment Options Pick from the options below to see step-by-step instructions for: GitHub Codespaces, VS Code Dev Containers, Local Environments, and Bicep deployments.
Deploy in GitHub Codespaces - ### GitHub Codpespaces +### GitHub Codespaces -You can run this solution accelerator virtually by using GitHub Codespaces. The button will open a web-based VS Code instance in your browser: +You can run this solution using GitHub Codespaces. The button will open a web-based VS Code instance in your browser: 1. Open the solution accelerator (this may take several minutes): @@ -120,7 +113,7 @@ You can run this solution accelerator virtually by using GitHub Codespaces. The ### VS Code Dev Containers -A related option is VS Code Dev Containers, which will open the project in your local VS Code using the [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers): +You can run this solution in VS Code Dev Containers, which will open the project in your local VS Code using the [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers): 1. Start Docker Desktop (install it if not already installed) 2. Open the project: @@ -159,23 +152,102 @@ If you're not using one of the above options for opening the project, then you'l
-### Local deployment -Review the local deployment [README](./docs/README_LOCAL.md). -
-

-
-Supporting documents -

+
+ Deploy with Bicep/ARM template -Supporting documents coming soon. -
-

-
-Customer truth +### Bicep + + Click the following deployment button to create the required resources for this solution directly in your Azure Subscription. + + [![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2FGeneric-Build-your-own-copilot-Solution-Accelerator%2Fmain%2Finfra%2Fmain.json) + +

+ + +### Deploying + +Once you've opened the project in [Codespaces](#github-codespaces) or in [Dev Containers](#vs-code-dev-containers) or [locally](#local-environment), you can deploy it to Azure following the following steps. + +To change the azd parameters from the default values, follow the steps [here](./docs/CustomizingAzdParameters.md). + + +1. Login to Azure: + + ```shell + azd auth login + ``` + + #### To authenticate with Azure Developer CLI (`azd`), use the following command with your **Tenant ID**: + + ```sh + azd auth login --tenant-id + ``` + +2. Provision and deploy all the resources: + + ```shell + azd up + ``` + +3. Provide an `azd` environment name (like "bycapp") +4. Select a subscription from your Azure account, and select a location which has quota for all the resources. + * This deployment will take *7-10 minutes* to provision the resources in your account and set up the solution with sample data. + * If you get an error or timeout with deployment, changing the location can help, as there may be availability constraints for the resources. + +5. Once the deployment has completed successfully, open the [Azure Portal](https://portal.azure.com/), go to the deployed resource group, find the App Service and get the app URL from `Default domain`. + +6. You can now delete the resources by running `azd down`, if you are done trying out the application. + + +

+Additional Steps

-Customer stories coming soon. -
+1. **Add App Authentication** + + Follow steps in [App Authentication](./docs/AppAuthentication.md) to configure authenitcation in app service. + + Note: Authentication changes can take up to 10 minutes + +2. **Deleting Resources After a Failed Deployment** + + Follow steps in [Delete Resource Group](./docs/DeleteResourceGroup.md) If your deployment fails and you need to clean up the resources. + + + + +

@@ -183,11 +255,53 @@ Customer stories coming soon. Responsible AI Transparency FAQ

-Please refer to [Transparency FAQ](./docs/TRANSPARENCY_FAQ.md) for responsible AI transparency details of this solution accelerator. +Please refer to [Transparency FAQ](./TRANSPARENCY_FAQ.md) for responsible AI transparency details of this solution accelerator. -
-
---- +

+Supporting documentation +

+ +### Costs + +Pricing varies per region and usage, so it isn't possible to predict exact costs for your usage. +The majority of the Azure resources used in this infrastructure are on usage-based pricing tiers. +However, Azure Container Registry has a fixed cost per registry per day. + +You can try the [Azure pricing calculator](https://azure.microsoft.com/en-us/pricing/calculator) for the resources: + +* Azure AI Foundry: Free tier. [Pricing](https://azure.microsoft.com/pricing/details/ai-studio/) +* Azure AI Search: Standard tier, S1. Pricing is based on the number of documents and operations. [Pricing](https://azure.microsoft.com/pricing/details/search/) +* Azure Storage Account: Standard tier, LRS. Pricing is based on storage and operations. [Pricing](https://azure.microsoft.com/pricing/details/storage/blobs/) +* Azure Key Vault: Standard tier. Pricing is based on the number of operations. [Pricing](https://azure.microsoft.com/pricing/details/key-vault/) +* Azure AI Services: S0 tier, defaults to gpt-4o and text-embedding-ada-002 models. Pricing is based on token count. [Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/) +* Azure Container App: Consumption tier with 0.5 CPU, 1GiB memory/storage. Pricing is based on resource allocation, and each month allows for a certain amount of free usage. [Pricing](https://azure.microsoft.com/pricing/details/container-apps/) +* Azure Container Registry: Basic tier. [Pricing](https://azure.microsoft.com/pricing/details/container-registry/) +* Log analytics: Pay-as-you-go tier. Costs based on data ingested. [Pricing](https://azure.microsoft.com/pricing/details/monitor/) +* Azure Cosmos DB: [Pricing](https://azure.microsoft.com/en-us/pricing/details/cosmos-db/autoscale-provisioned/) +* Azure functions: Consumption tier [Pricing](https://azure.microsoft.com/en-us/pricing/details/functions/) + +โš ๏ธ To avoid unnecessary costs, remember to take down your app if it's no longer in use, +either by deleting the resource group in the Portal or running `azd down`. + +### Security guidelines + +This template uses Azure Key Vault to store all connections to communicate between resources. + +This template also uses [Managed Identity](https://learn.microsoft.com/entra/identity/managed-identities-azure-resources/overview) for local development and deployment. + +To ensure continued best practices in your own repository, we recommend that anyone creating solutions based on our templates ensure that the [Github secret scanning](https://docs.github.com/code-security/secret-scanning/about-secret-scanning) setting is enabled. + +You may want to consider additional security measures, such as: + +* Enabling Microsoft Defender for Cloud to [secure your Azure resources](https://learn.microsoft.com/azure/security-center/defender-for-cloud). +* Protecting the Azure Container Apps instance with a [firewall](https://learn.microsoft.com/azure/container-apps/waf-app-gateway) and/or [Virtual Network](https://learn.microsoft.com/azure/container-apps/networking?tabs=workload-profiles-env%2Cazure-cli). + + + +### Additional Resources +1. [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/) +2. [Azure AI Search](https://learn.microsoft.com/en-us/azure/search/) +3. [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-studio/) ## Disclaimers diff --git a/docs/TRANSPARENCY_FAQ.md b/TRANSPARENCY_FAQ.md similarity index 100% rename from docs/TRANSPARENCY_FAQ.md rename to TRANSPARENCY_FAQ.md diff --git a/docs/AzureAccountSetUp.md b/docs/AzureAccountSetUp.md new file mode 100644 index 000000000..22ffa836f --- /dev/null +++ b/docs/AzureAccountSetUp.md @@ -0,0 +1,14 @@ +## Azure account setup + +1. Sign up for a [free Azure account](https://azure.microsoft.com/free/) and create an Azure Subscription. +2. Check that you have the necessary permissions: + * Your Azure account must have `Microsoft.Authorization/roleAssignments/write` permissions, such as [Role Based Access Control Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#role-based-access-control-administrator-preview), [User Access Administrator](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#user-access-administrator), or [Owner](https://learn.microsoft.com/azure/role-based-access-control/built-in-roles#owner). + * Your Azure account also needs `Microsoft.Resources/deployments/write` permissions on the subscription level. + +You can view the permissions for your account and subscription by following the steps below: +- Navigate to the [Azure Portal](https://portal.azure.com/) and click on `Subscriptions` under 'Navigation' +- Select the subscription you are using for this accelerator from the list. + - If you try to search for your subscription and it does not come up, make sure no filters are selected. +- Select `Access control (IAM)` and you can see the roles that are assigned to your account for this subscription. + - If you want to see more information about the roles, you can go to the `Role assignments` + tab and search by your account name and then click the role you want to view more information about. \ No newline at end of file diff --git a/docs/AzureSemanticSearchRegion.md b/docs/AzureSemanticSearchRegion.md new file mode 100644 index 000000000..d35911400 --- /dev/null +++ b/docs/AzureSemanticSearchRegion.md @@ -0,0 +1,7 @@ +## Select a region where Semantic Search Availability is available before proceeding with the deployment. + +Steps to Check Semantic Search Availability +1. Open the [Semantic Search Availability](https://learn.microsoft.com/en-us/azure/search/search-region-support) page. +2. Scroll down to the **"Availability by Region"** section. +3. Use the table to find supported regions for **Azure AI Search** and its **Semantic Search** feature. +4. If your target region is not listed, choose a supported region for deployment. \ No newline at end of file diff --git a/docs/CustomizingAzdParameters.md b/docs/CustomizingAzdParameters.md new file mode 100644 index 000000000..91350eb79 --- /dev/null +++ b/docs/CustomizingAzdParameters.md @@ -0,0 +1,43 @@ +## [Optional]: Customizing resource names + +By default this template will use the environment name as the prefix to prevent naming collisions within Azure. The parameters below show the default values. You only need to run the statements below if you need to change the values. + + +> To override any of the parameters, run `azd env set ` before running `azd up`. On the first azd command, it will prompt you for the environment name. Be sure to choose 3-20 charaters alphanumeric unique name. + + +Change the Secondary Location (example: eastus2, westus2, etc.) + +```shell +azd env set AZURE_ENV_SECONDARY_LOCATION eastus2 +``` + +Change the Model Deployment Type (allowed values: Standard, GlobalStandard) + +```shell +azd env set AZURE_ENV_MODEL_DEPLOYMENT_TYPE GlobalStandard +``` + +Set the Model Name (allowed values: gpt-4o, gpt-4o, gpt-4) + +```shell +azd env set AZURE_ENV_MODEL_NAME gpt-4o +``` + +Change the Model Capacity (choose a number based on available GPT model capacity in your subscription) + +```shell +azd env set AZURE_ENV_MODEL_CAPACITY 30 +``` + +Change the Embedding Model + +```shell +azd env set AZURE_ENV_EMBEDDING_MODEL_NAME text-embedding-ada-002 +``` + +Change the Embedding Deployment Capacity (choose a number based on available embedding model capacity in your subscription) + +```shell +azd env set AZURE_ENV_EMBEDDING_MODEL_CAPACITY 80 +``` \ No newline at end of file diff --git a/docs/DeleteResourceGroup.md b/docs/DeleteResourceGroup.md new file mode 100644 index 000000000..0a3c3b351 --- /dev/null +++ b/docs/DeleteResourceGroup.md @@ -0,0 +1,53 @@ +# Deleting Resources After a Failed Deployment in Azure Portal + +If your deployment fails and you need to clean up the resources manually, follow these steps in the Azure Portal. + +--- + +## **1. Navigate to the Azure Portal** +1. Open [Azure Portal](https://portal.azure.com/). +2. Sign in with your Azure account. + +--- + +## **2. Find the Resource Group** +1. In the search bar at the top, type **"Resource groups"** and select it. +2. Locate the **resource group** associated with the failed deployment. + +![Resource Groups](Images/resourcegroup.png) + +![Resource Groups](Images/resource-groups.png) + +--- + +## **3. Delete the Resource Group** +1. Click on the **resource group name** to open it. +2. Click the **Delete resource group** button at the top. + +![Delete Resource Group](Images/DeleteRG.png) + +3. Type the resource group name in the confirmation box and click **Delete**. + +๐Ÿ“Œ **Note:** Deleting a resource group will remove all resources inside it. + +--- + +## **4. Delete Individual Resources (If Needed)** +If you donโ€™t want to delete the entire resource group, follow these steps: + +1. Open **Azure Portal** and go to the **Resource groups** section. +2. Click on the specific **resource group**. +3. Select the **resource** you want to delete (e.g., App Service, Storage Account). +4. Click **Delete** at the top. + +![Delete Individual Resource](Images/deleteservices.png) + +--- + +## **5. Verify Deletion** +- After a few minutes, refresh the **Resource groups** page. +- Ensure the deleted resource or group no longer appears. + +๐Ÿ“Œ **Tip:** If a resource fails to delete, check if it's **locked** under the **Locks** section and remove the lock. + + diff --git a/infra/scripts/index_scripts/01_create_search_index.py b/infra/scripts/index_scripts/01_create_search_index.py new file mode 100644 index 000000000..eb86aa924 --- /dev/null +++ b/infra/scripts/index_scripts/01_create_search_index.py @@ -0,0 +1,97 @@ +from azure.keyvault.secrets import SecretClient +from azure.identity import DefaultAzureCredential + +key_vault_name = 'kv_to-be-replaced' +index_name = "pdf_index" + +def get_secrets_from_kv(kv_name, secret_name): + + # Set the name of the Azure Key Vault + key_vault_name = kv_name + credential = DefaultAzureCredential() + + # Create a secret client object using the credential and Key Vault name + secret_client = SecretClient(vault_url=f"https://{key_vault_name}.vault.azure.net/", credential=credential) + + # Retrieve the secret value + return(secret_client.get_secret(secret_name).value) + +search_endpoint = get_secrets_from_kv(key_vault_name,"AZURE-SEARCH-ENDPOINT") +search_key = get_secrets_from_kv(key_vault_name,"AZURE-SEARCH-KEY") + +# Create the search index +def create_search_index(): + from azure.core.credentials import AzureKeyCredential + search_credential = AzureKeyCredential(search_key) + + from azure.search.documents.indexes import SearchIndexClient + from azure.search.documents.indexes.models import ( + SimpleField, + SearchFieldDataType, + SearchableField, + SearchField, + VectorSearch, + HnswAlgorithmConfiguration, + VectorSearchProfile, + SemanticConfiguration, + SemanticPrioritizedFields, + SemanticField, + SemanticSearch, + SearchIndex + ) + + # Create a search index + index_client = SearchIndexClient(endpoint=search_endpoint, credential=search_credential) + + # fields = [ + # SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True), + # SearchableField(name="chunk_id", type=SearchFieldDataType.String), + # SearchableField(name="content", type=SearchFieldDataType.String), + # SearchableField(name="sourceurl", type=SearchFieldDataType.String), + # SearchField(name="contentVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), + # searchable=True, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"), + # ] + + fields = [ + SimpleField(name="id", type=SearchFieldDataType.String, key=True), + SimpleField(name="chunk_id", type=SearchFieldDataType.String), + SearchField(name="content", type=SearchFieldDataType.String), + SimpleField(name="sourceurl", type=SearchFieldDataType.String), + SearchField(name="contentVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), \ + vector_search_dimensions=1536,vector_search_profile_name="myHnswProfile" + ) + ] + + # Configure the vector search configuration + vector_search = VectorSearch( + algorithms=[ + HnswAlgorithmConfiguration( + name="myHnsw" + ) + ], + profiles=[ + VectorSearchProfile( + name="myHnswProfile", + algorithm_configuration_name="myHnsw", + ) + ] + ) + + semantic_config = SemanticConfiguration( + name="my-semantic-config", + prioritized_fields=SemanticPrioritizedFields( + keywords_fields=[SemanticField(field_name="chunk_id")], + content_fields=[SemanticField(field_name="content")] + ) + ) + + # Create the semantic settings with the configuration + semantic_search = SemanticSearch(configurations=[semantic_config]) + + # Create the search index with the semantic settings + index = SearchIndex(name=index_name, fields=fields, + vector_search=vector_search, semantic_search=semantic_search) + result = index_client.create_or_update_index(index) + print(f' {result.name} created') + +create_search_index() \ No newline at end of file diff --git a/infra/scripts/index_scripts/02_process_data.py b/infra/scripts/index_scripts/02_process_data.py new file mode 100644 index 000000000..9583bb436 --- /dev/null +++ b/infra/scripts/index_scripts/02_process_data.py @@ -0,0 +1,203 @@ +import json +from azure.core.credentials import AzureKeyCredential +from azure.identity import DefaultAzureCredential, get_bearer_token_provider +from azure.keyvault.secrets import SecretClient +from openai import AzureOpenAI +import pandas as pd +import re +import time + +key_vault_name = 'kv_to-be-replaced' + +file_system_client_name = "data" +directory = 'pdf' + + + +def get_secrets_from_kv(kv_name, secret_name): + # Set the name of the Azure Key Vault + key_vault_name = kv_name + credential = DefaultAzureCredential() + + # Create a secret client object using the credential and Key Vault name + secret_client = SecretClient(vault_url=f"https://{key_vault_name}.vault.azure.net/", credential=credential) + return(secret_client.get_secret(secret_name).value) + + +search_endpoint = get_secrets_from_kv(key_vault_name,"AZURE-SEARCH-ENDPOINT") +search_key = get_secrets_from_kv(key_vault_name,"AZURE-SEARCH-KEY") + +openai_api_key = get_secrets_from_kv(key_vault_name,"AZURE-OPENAI-KEY") +openai_api_base = get_secrets_from_kv(key_vault_name,"AZURE-OPENAI-ENDPOINT") +openai_api_version = get_secrets_from_kv(key_vault_name,"AZURE-OPENAI-PREVIEW-API-VERSION") +deployment = get_secrets_from_kv(key_vault_name,"AZURE-OPEN-AI-DEPLOYMENT-MODEL") #"gpt-4o-mini" + + +# Function: Get Embeddings +def get_embeddings(text: str,openai_api_base,openai_api_version,openai_api_key): + model_id = "text-embedding-ada-002" + client = AzureOpenAI( + api_version=openai_api_version, + azure_endpoint=openai_api_base, + api_key = openai_api_key + ) + + embedding = client.embeddings.create(input=text, model=model_id).data[0].embedding + + return embedding + +# Function: Clean Spaces with Regex - +def clean_spaces_with_regex(text): + # Use a regular expression to replace multiple spaces with a single space + cleaned_text = re.sub(r'\s+', ' ', text) + # Use a regular expression to replace consecutive dots with a single dot + cleaned_text = re.sub(r'\.{2,}', '.', cleaned_text) + return cleaned_text + +def chunk_data(text): + tokens_per_chunk = 1024 #500 + text = clean_spaces_with_regex(text) + SENTENCE_ENDINGS = [".", "!", "?"] + WORDS_BREAKS = ['\n', '\t', '}', '{', ']', '[', ')', '(', ' ', ':', ';', ','] + + sentences = text.split('. ') # Split text into sentences + chunks = [] + current_chunk = '' + current_chunk_token_count = 0 + + # Iterate through each sentence + for sentence in sentences: + # Split sentence into tokens + tokens = sentence.split() + + # Check if adding the current sentence exceeds tokens_per_chunk + if current_chunk_token_count + len(tokens) <= tokens_per_chunk: + # Add the sentence to the current chunk + if current_chunk: + current_chunk += '. ' + sentence + else: + current_chunk += sentence + current_chunk_token_count += len(tokens) + else: + # Add current chunk to chunks list and start a new chunk + chunks.append(current_chunk) + current_chunk = sentence + current_chunk_token_count = len(tokens) + + # Add the last chunk + if current_chunk: + chunks.append(current_chunk) + + return chunks + +from azure.search.documents import SearchClient +from azure.storage.filedatalake import ( + DataLakeServiceClient, + DataLakeDirectoryClient, + FileSystemClient +) + + +account_name = get_secrets_from_kv(key_vault_name, "ADLS-ACCOUNT-NAME") + +account_url = f"https://{account_name}.dfs.core.windows.net" + +credential = DefaultAzureCredential() +service_client = DataLakeServiceClient(account_url, credential=credential,api_version='2023-01-03') + +file_system_client = service_client.get_file_system_client(file_system_client_name) + +directory_name = directory +paths = file_system_client.get_paths(path=directory_name) +print(paths) + +index_name = "pdf_index" + + +from azure.search.documents.indexes import SearchIndexClient +from azure.search.documents.indexes.models import ( + SimpleField, + SearchFieldDataType, + SearchableField, + SearchField, + VectorSearch, + HnswAlgorithmConfiguration, + VectorSearchProfile, + SemanticConfiguration, + SemanticPrioritizedFields, + SemanticField, + SemanticSearch, + SearchIndex +) +search_credential = AzureKeyCredential(search_key) + +search_client = SearchClient(search_endpoint, index_name, search_credential) +index_client = SearchIndexClient(endpoint=search_endpoint, credential=search_credential) + + +def prepare_search_doc(content, document_id): + chunks = chunk_data(content) + chunk_num = 0 + for chunk in chunks: + chunk_num += 1 + chunk_id = document_id + '_' + str(chunk_num).zfill(2) + + try: + v_contentVector = get_embeddings(str(chunk),openai_api_base,openai_api_version,openai_api_key) + except: + time.sleep(30) + try: + v_contentVector = get_embeddings(str(chunk),openai_api_base,openai_api_version,openai_api_key) + except: + v_contentVector = [] + result = { + "id": chunk_id, + "chunk_id": chunk_id, + "content": chunk, + "sourceurl": path.name.split('/')[-1], + "contentVector": v_contentVector + } + return result + +# conversationIds = [] +docs = [] +counter = 0 +from datetime import datetime, timedelta +import pypdf +from io import BytesIO + +for path in paths: + file_client = file_system_client.get_file_client(path.name) + pdf_file = file_client.download_file() + + stream = BytesIO() + pdf_file.readinto(stream) + pdf_reader = pypdf.PdfReader(stream) + filename = path.name.split('/')[-1] + document_id = filename.split('_')[1].replace('.pdf','') + + + text = '' + num_pages = len(pdf_reader.pages) + for page_num in range(num_pages): + + page = pdf_reader.pages[page_num] + text += page.extract_text() + + + + result = prepare_search_doc(text, document_id) + docs.append(result) + + counter += 1 + if docs != [] and counter % 10 == 0: + result = search_client.upload_documents(documents=docs) + docs = [] + print(f' {str(counter)} uploaded') + +if docs != []: + results = search_client.upload_documents(documents=docs) + + + + \ No newline at end of file