Skip to content

Commit a5cbe18

Browse files
committed
Remove duplicate samples
1 parent 023dc1b commit a5cbe18

File tree

8 files changed

+92
-805
lines changed

8 files changed

+92
-805
lines changed

README.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,25 @@
1-
# ChatGPT-like app with your data using Azure OpenAI and Azure AI Search (Python)
1+
<!--
2+
---
3+
name: RAG chat app with your data (Python)
4+
description: Chat with your domain data using Azure OpenAI and Azure AI Search.
5+
languages:
6+
- python
7+
- typescript
8+
- bicep
9+
- azdeveloper
10+
products:
11+
- azure-openai
12+
- azure-cognitive-search
13+
- azure-app-service
14+
- azure
15+
page_type: sample
16+
urlFragment: azure-search-openai-demo
17+
---
18+
-->
19+
20+
# RAG chat app with Azure OpenAI and Azure AI Search (Python)
21+
22+
This solution creates a ChatGPT-like frontend experience over your own documents using RAG (Retrieval Augmented Generation). It uses Azure OpenAI Service to access GPT models, and Azure AI Search for data indexing and retrieval.
223

324
This solution's backend is written in Python. There are also [**JavaScript**](https://aka.ms/azai/js/code), [**.NET**](https://aka.ms/azai/net/code), and [**Java**](https://aka.ms/azai/java/code) samples based on this one. Learn more about [developing AI apps using Azure AI Services](https://aka.ms/azai).
425

docs/data_ingestion.md

Lines changed: 26 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,34 @@
1-
# Indexing documents for the Chat App
1+
# RAG chat: Data ingestion
22

3-
This guide provides more details for using the `prepdocs` script to index documents for the Chat App.
3+
The [azure-search-openai-demo](/) project can set up a full RAG chat app on Azure AI Search and OpenAI so that you can chat on custom data, like internal enterprise data or domain-specific knowledge sets. For full instructions on setting up the project, consult the [main README](/README.md), and then return here for detailed instructions on the data ingestion component.
44

5-
- [Supported document formats](#supported-document-formats)
6-
- [Overview of the manual indexing process](#overview-of-the-manual-indexing-process)
5+
The chat app provides two ways to ingest data: manual indexing and integrated vectorization. This document explains the differences between the two approaches and provides an overview of the manual indexing process.
6+
7+
- [Manual indexing process](#manual-indexing-process)
8+
- [Supported document formats](#supported-document-formats)
79
- [Chunking](#chunking)
810
- [Categorizing data for enhanced search](#enhancing-search-functionality-with-data-categorization)
911
- [Indexing additional documents](#indexing-additional-documents)
1012
- [Removing documents](#removing-documents)
11-
- [Overview of Integrated Vectorization](#overview-of-integrated-vectorization)
13+
- [Integrated Vectorization](#integrated-vectorization)
1214
- [Indexing of additional documents](#indexing-of-additional-documents)
1315
- [Removal of documents](#removal-of-documents)
1416
- [Scheduled indexing](#scheduled-indexing)
1517

16-
## Supported document formats
18+
## Manual indexing process
19+
20+
The [`prepdocs.py`](../app/backend/prepdocs.py) script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. You can pass additional arguments directly to the script, for example `scripts/prepdocs.ps1 --removeall`. Whenever `azd up` or `azd provision` is run, the script is called automatically.
21+
22+
![Diagram of the indexing process](images/diagram_prepdocs.png)
23+
24+
The script uses the following steps to index documents:
25+
26+
1. If it doesn't yet exist, create a new index in Azure AI Search.
27+
2. Upload the PDFs to Azure Blob Storage.
28+
3. Split the PDFs into chunks of text.
29+
4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.
30+
31+
### Supported document formats
1732

1833
In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.
1934

@@ -29,33 +44,20 @@ In order to ingest a document format, we need a tool that can turn it into text.
2944

3045
The Blob indexer used by the Integrated Vectorization approach also supports a few [additional formats](https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-document-formats).
3146

32-
## Overview of the manual indexing process
33-
34-
The [`prepdocs.py`](../app/backend/prepdocs.py) script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. You can pass additional arguments directly to the script, for example `scripts/prepdocs.ps1 --removeall`. Whenever `azd up` or `azd provision` is run, the script is called automatically.
47+
### Chunking
3548

36-
![Diagram of the indexing process](images/diagram_prepdocs.png)
49+
We're often asked why we need to break up the PDFs into chunks when Azure AI Search supports searching large documents.
3750

38-
The script uses the following steps to index documents:
51+
Chunking allows us to limit the amount of information we send to OpenAI due to token limits. By breaking up the content, it allows us to easily find potential chunks of text that we can inject into OpenAI. The method of chunking we use leverages a sliding window of text such that sentences that end one chunk will start the next. This allows us to reduce the chance of losing the context of the text.
3952

40-
1. If it doesn't yet exist, create a new index in Azure AI Search.
41-
2. Upload the PDFs to Azure Blob Storage.
42-
3. Split the PDFs into chunks of text.
43-
4. Upload the chunks to Azure AI Search. If using vectors (the default), also compute the embeddings and upload those alongside the text.
53+
If needed, you can modify the chunking algorithm in `scripts/prepdocslib/textsplitter.py`.
4454

4555
### Enhancing search functionality with data categorization
4656

4757
To enhance search functionality, categorize data during the ingestion process with the `--category` argument, for example `scripts/prepdocs.ps1 --category ExampleCategoryName`. This argument specifies the category to which the data belongs, enabling you to filter search results based on these categories.
4858

4959
After running the script with the desired category, ensure these categories are added to the 'Include Category' dropdown list. This can be found in the developer settings of [`Chat.tsx`](../app/frontend/src/pages/chat/Chat.tsx) and [`Ask.tsx`](../app/frontend/src/pages/ask/Ask.tsx). The default option for this dropdown is "All". By including specific categories, you can refine your search results more effectively.
5060

51-
### Chunking
52-
53-
We're often asked why we need to break up the PDFs into chunks when Azure AI Search supports searching large documents.
54-
55-
Chunking allows us to limit the amount of information we send to OpenAI due to token limits. By breaking up the content, it allows us to easily find potential chunks of text that we can inject into OpenAI. The method of chunking we use leverages a sliding window of text such that sentences that end one chunk will start the next. This allows us to reduce the chance of losing the context of the text.
56-
57-
If needed, you can modify the chunking algorithm in `scripts/prepdocslib/textsplitter.py`.
58-
5961
### Indexing additional documents
6062

6163
To upload more PDFs, put them in the data/ folder and run `./scripts/prepdocs.sh` or `./scripts/prepdocs.ps1`.
@@ -70,7 +72,7 @@ To remove all documents, use `scripts/prepdocs.sh --removeall` or `scripts/prepd
7072

7173
You can also remove individual documents by using the `--remove` flag. Open either `scripts/prepdocs.sh` or `scripts/prepdocs.ps1` and replace `/data/*` with `/data/YOUR-DOCUMENT-FILENAME-GOES-HERE.pdf`. Then run `scripts/prepdocs.sh --remove` or `scripts/prepdocs.ps1 --remove`.
7274

73-
## Overview of Integrated Vectorization
75+
## Integrated Vectorization
7476

7577
Azure AI Search includes an [integrated vectorization feature](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in/ba-p/3960809#:~:text=Integrated%20vectorization%20is%20a%20new%20feature%20of%20Azure,pull-indexers%2C%20and%20vectorization%20of%20text%20queries%20through%20vectorizers), a cloud-based approach to data ingestion. Integrated vectorization takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
7678

docs/deploy_private.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,25 @@
1-
2-
# Deploying with private access
1+
<!--
2+
---
3+
name: RAG chat with private endpoints
4+
description: Configure access to a chat app so that it's only accessible from private endpoints.
5+
languages:
6+
- python
7+
- typescript
8+
- bicep
9+
- azdeveloper
10+
products:
11+
- azure-openai
12+
- azure-cognitive-search
13+
- azure-app-service
14+
- azure
15+
page_type: sample
16+
urlFragment: azure-search-openai-demo-private-access
17+
---
18+
-->
19+
20+
# RAG chat: Deploying with private access
21+
22+
The [azure-search-openai-demo](/) project can set up a full RAG chat app on Azure AI Search and OpenAI so that you can chat on custom data, like internal enterprise data or domain-specific knowledge sets. For full instructions on setting up the project, consult the [main README](/README.md), and then return here for detailed instructions on configuring private endpoints.
323

424
⚠️ This feature is not yet compatible with Azure Container Apps, so you will need to [deploy to Azure App Service](./azure_app_service.md) instead.
525

docs/login_and_acl.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,25 @@
1-
# Setting up optional login and document level access control
1+
<!--
2+
---
3+
name: RAG chat with document security
4+
description: This guide demonstrates how to add an optional login and document level access control system to a RAG chat app for your domain data. This system can be used to restrict access to indexed data to specific users.
5+
languages:
6+
- python
7+
- typescript
8+
- bicep
9+
- azdeveloper
10+
products:
11+
- azure-openai
12+
- azure-cognitive-search
13+
- azure-app-service
14+
- azure
15+
page_type: sample
16+
urlFragment: azure-search-openai-demo-document-security
17+
---
18+
-->
19+
20+
# RAG chat: Setting up optional login and document level access control
21+
22+
The [azure-search-openai-demo](/) project can set up a full RAG chat app on Azure AI Search and OpenAI so that you can chat on custom data, like internal enterprise data or domain-specific knowledge sets. For full instructions on setting up the project, consult the [main README](/README.md), and then return here for detailed instructions on configuring login and access control.
223

324
## Table of Contents
425

0 commit comments

Comments
 (0)