Skip to content

Commit b733d20

Browse files
committed
Clean up vectorization docs and refs
1 parent be98004 commit b733d20

File tree

7 files changed

+8
-12
lines changed

7 files changed

+8
-12
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
6060
- Chat (multi-turn) and Q&A (single turn) interfaces
6161
- Renders citations and thought process for each answer
6262
- Includes settings directly in the UI to tweak the behavior and experiment with options
63-
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud-based data ingestion](/docs/data_ingestion.md#overview-of-cloud-based-vectorization)
63+
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud-based data ingestion](/docs/data_ingestion.md#cloud-based-ingestion)
6464
- Optional usage of [multimodal models](/docs/multimodal.md) to reason over image-heavy documents
6565
- Optional addition of [speech input/output](/docs/deploy_features.md#enabling-speech-inputoutput) for accessibility
6666
- Optional automation of [user login and data access](/docs/login_and_acl.md) via Microsoft Entra

app/backend/prepdocslib/cloudingestionstrategy.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ async def setup(self) -> None:
266266
logger.info("Setting up search index and skillset for cloud ingestion")
267267

268268
if not self.embeddings.azure_endpoint or not self.embeddings.azure_deployment_name:
269-
raise ValueError("Integrated vectorization requires Azure OpenAI endpoint and deployment")
269+
raise ValueError("Cloud ingestion requires Azure OpenAI endpoint and deployment")
270270

271271
if not isinstance(self.embeddings, OpenAIEmbeddings):
272272
raise TypeError("Cloud ingestion requires Azure OpenAI embeddings to configure the search index.")

app/backend/prepdocslib/searchmanager.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,9 +69,6 @@ def __init__(
6969
search_info: SearchInfo,
7070
search_analyzer_name: Optional[str] = None,
7171
use_acls: bool = False,
72-
# Renamed from use_int_vectorization to use_parent_index_projection to reflect
73-
# that this flag controls parent/child index projection (adding parent_id and
74-
# enhanced key field settings) rather than any specific vectorization mode.
7572
use_parent_index_projection: bool = False,
7673
embeddings: Optional[OpenAIEmbeddings] = None,
7774
field_name_embedding: Optional[str] = None,

app/functions/document_extractor/function_app.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ async def extract_document(req: func.HttpRequest) -> func.HttpResponse:
6868
Azure Search Custom Skill: Extract document content
6969
7070
Input format (single record; file data only):
71-
# https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-intelligence-layout#skill-inputs
71+
# https://learn.microsoft.com/azure/search/cognitive-search-skill-document-intelligence-layout#skill-inputs
7272
{
7373
"values": [
7474
{

docs/data_ingestion.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ In order to ingest a document format, we need a tool that can turn it into text.
3636

3737
## Ingestion stages
3838

39-
The ingestion pipeline consists of three main stages that transform raw documents into searchable content in Azure AI Search. These stages apply to both local ingestion (using `prepdocs.py`) and cloud-based ingestion (using Azure Functions as custom skills).
39+
The ingestion pipeline consists of three main stages that transform raw documents into searchable content in Azure AI Search. These stages apply to both [local ingestion](#local-ingestion) and [cloud-based ingestion](#cloud-based-ingestion).
4040

4141
### Document extraction
4242

@@ -153,6 +153,8 @@ The cloud ingestion pipeline uses four Azure Functions as custom skills within a
153153
- **Text Processor** (Skill #4): Combines text with enriched figures, chunks content, and generates embeddings
154154
4. **Azure AI Search Index** receives the final processed chunks with embeddings
155155

156+
The functions are defined in the `app/functions/` directory, and the custom skillset is configured in the `app/backend/setup_cloud_ingestion.py` script.
157+
156158
#### [Document Extractor Function](app/functions/document_extractor/)
157159

158160
- Implements the [document extraction](#document-extraction) stage
@@ -163,7 +165,7 @@ The cloud ingestion pipeline uses four Azure Functions as custom skills within a
163165
- Implements the [figure processing](#figure-processing) stage
164166
- Emits enriched figure metadata with descriptions, URLs, and embeddings
165167

166-
#### [Shaper Skill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-shaper)
168+
#### [Shaper Skill](https://learn.microsoft.com/azure/search/cognitive-search-skill-shaper)
167169

168170
- Consolidates enrichments from the figure processor back into the main document context
169171
- Required because Azure AI Search's enrichment tree isolates data by context

docs/deploy_features.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ You should typically enable these features before running `azd up`. Once you've
1212
* [Enabling persistent chat history with Azure Cosmos DB](#enabling-persistent-chat-history-with-azure-cosmos-db)
1313
* [Enabling language picker](#enabling-language-picker)
1414
* [Enabling speech input/output](#enabling-speech-inputoutput)
15-
* [Enabling Integrated Vectorization](#enabling-integrated-vectorization)
1615
* [Enabling authentication](#enabling-authentication)
1716
* [Enabling login and document level access control](#enabling-login-and-document-level-access-control)
1817
* [Enabling user document upload](#enabling-user-document-upload)
@@ -236,8 +235,7 @@ Learn more in the [multimodal guide](./multimodal.md).
236235
237236
## Enabling media description with Azure Content Understanding
238237
239-
⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization).
240-
It is compatible with the [multimodal feature](./multimodal.md), but this feature enables only a subset of multimodal capabilities,
238+
⚠️ This feature is compatible with the [multimodal feature](./multimodal.md), but this feature enables only a subset of multimodal capabilities,
241239
so you may want to enable the multimodal feature instead or as well.
242240
243241
By default, if your documents contain image-like figures, the data ingestion process will ignore those figures,

docs/multimodal.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,5 +112,4 @@ and you may still see good results with just text inputs, since the inputs conta
112112

113113
## Compatibility
114114

115-
* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we have not integrated them in this project. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
116115
* This feature *is* compatible with the [reasoning models](./reasoning.md) feature, as long as you use a model that [supports image inputs](https://learn.microsoft.com/azure/ai-services/openai/how-to/reasoning?tabs=python-secure%2Cpy#api--feature-support).

0 commit comments

Comments
 (0)