Skip to content

Commit 127620a

Browse files
authored
Update other_samples.md to reflect current features/tech (#1388)
* Update other_samples.md * Add formats list * Update docs/data_ingestion.md * Apply suggestions from code review * Apply suggestions from code review
1 parent 1ab1984 commit 127620a

File tree

2 files changed

+23
-3
lines changed

2 files changed

+23
-3
lines changed

docs/data_ingestion.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
This guide provides more details for using the `prepdocs` script to index documents for the Chat App.
44

5+
- [Supported document formats](#supported-document-formats)
56
- [Overview of the manual indexing process](#overview-of-the-manual-indexing-process)
67
- [Chunking](#chunking)
78
- [Indexing additional documents](#indexing-additional-documents)
@@ -11,6 +12,22 @@ This guide provides more details for using the `prepdocs` script to index docume
1112
- [Removal of documents](#removal-of-documents)
1213
- [Scheduled indexing](#scheduled-indexing)
1314

15+
## Supported document formats
16+
17+
In order to ingest a document format, we need a tool that can turn it into text. By default, use Azure Document Intelligence (DI in the table below), but we also have local parsers for several formats. The local parsers are not as sophisticated as Azure Document Intelligence, but they can be used to decrease charges.
18+
19+
| Format | Manual indexing | Integrated Vectorization |
20+
| ------ | ------------------------------------ | ------------------------ |
21+
| PDF | Yes (DI or local with PyPDF) | Yes |
22+
| HTML | Yes (DI or local with BeautifulSoup) | Yes |
23+
| DOCX, PPTX, XLSX | Yes (DI) | Yes |
24+
| Images (JPG, PNG, BPM, TIFF, HEIFF)| Yes (DI) | Yes |
25+
| TXT | Yes (Local) | Yes |
26+
| JSON | Yes (Local) | Yes |
27+
| CSV | Yes (Local) | Yes |
28+
29+
The Blob indexer used by the Integrated Vectorization approach also supports a few [additional formats](https://learn.microsoft.com/azure/search/search-howto-indexing-azure-blob-storage#supported-document-formats).
30+
1431
## Overview of the manual indexing process
1532

1633
The `scripts/prepdocs.py` script is responsible for both uploading and indexing documents. The typical usage is to call it using `scripts/prepdocs.sh` (Mac/Linux) or `scripts/prepdocs.ps1` (Windows), as these scripts will set up a Python virtual environment and pass in the required parameters based on the current `azd` environment. Whenever `azd up` or `azd provision` is run, the script is called automatically.

docs/other_samples.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,17 @@ Feature comparison:
3030
| --- | --- | --- |
3131
| RAG approach | Multiple approaches | Only via ChatCompletion API data_sources |
3232
| Vector support | ✅ Yes | ✅ Yes |
33-
| Data ingestion | ✅ Yes (PDF) | ✅ Yes (PDF, TXT, MD, HTML) |
33+
| Data ingestion | ✅ Yes ([Many formats](data_ingestion.md#supported-document-formats)) | ✅ Yes ([Many formats](https://learn.microsoft.com/azure/ai-services/openai/concepts/use-your-data?tabs=ai-search#data-formats-and-file-types)) |
3434
| Persistent chat history | ❌ No (browser tab only) | ✅ Yes, in CosmosDB |
35+
| User feedback | ❌ No | ✅ Yes |
36+
| GPT-4-vision | ✅ Yes | ❌ No |
37+
| Auth + ACL | ✅ Yes | ✅ Yes |
3538

3639
Technology comparison:
3740

3841
| Tech | azure-search-openai-demo | sample-app-aoai-chatGPT |
3942
| --- | --- | --- |
4043
| Frontend | React | React |
41-
| Backend | Python (Quart) | Python (Flask) |
42-
| Vector DB | Azure AI Search | Azure AI Search |
44+
| Backend | Python (Quart) | Python (Quart) |
45+
| Vector DB | Azure AI Search | Azure AI Search, CosmosDB Mongo vCore, ElasticSearch, Pinecone, AzureML |
4346
| Deployment | Azure Developer CLI (azd) | Azure Portal, az, azd |

0 commit comments

Comments
 (0)