Clean up vectorization docs and refs

pamelafox · pamelafox · commit b733d20eca20 · 2025-11-11T15:01:21.000-08:00
diff --git a/README.md b/README.md
@@ -60,7 +60,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
 - Chat (multi-turn) and Q&A (single turn) interfaces
 - Renders citations and thought process for each answer
 - Includes settings directly in the UI to tweak the behavior and experiment with options
-- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud-based data ingestion](/docs/data_ingestion.md#overview-of-cloud-based-vectorization)
+- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud-based data ingestion](/docs/data_ingestion.md#cloud-based-ingestion)
 - Optional usage of [multimodal models](/docs/multimodal.md) to reason over image-heavy documents
 - Optional addition of [speech input/output](/docs/deploy_features.md#enabling-speech-inputoutput) for accessibility
 - Optional automation of [user login and data access](/docs/login_and_acl.md) via Microsoft Entra
diff --git a/app/backend/prepdocslib/cloudingestionstrategy.py b/app/backend/prepdocslib/cloudingestionstrategy.py
@@ -266,7 +266,7 @@ async def setup(self) -> None:
         logger.info("Setting up search index and skillset for cloud ingestion")
 
         if not self.embeddings.azure_endpoint or not self.embeddings.azure_deployment_name:
-            raise ValueError("Integrated vectorization requires Azure OpenAI endpoint and deployment")
+            raise ValueError("Cloud ingestion requires Azure OpenAI endpoint and deployment")
 
         if not isinstance(self.embeddings, OpenAIEmbeddings):
             raise TypeError("Cloud ingestion requires Azure OpenAI embeddings to configure the search index.")
diff --git a/app/backend/prepdocslib/searchmanager.py b/app/backend/prepdocslib/searchmanager.py
@@ -69,9 +69,6 @@ def __init__(
         search_info: SearchInfo,
         search_analyzer_name: Optional[str] = None,
         use_acls: bool = False,
-        # Renamed from use_int_vectorization to use_parent_index_projection to reflect
-        # that this flag controls parent/child index projection (adding parent_id and
-        # enhanced key field settings) rather than any specific vectorization mode.
         use_parent_index_projection: bool = False,
         embeddings: Optional[OpenAIEmbeddings] = None,
         field_name_embedding: Optional[str] = None,
diff --git a/app/functions/document_extractor/function_app.py b/app/functions/document_extractor/function_app.py
@@ -68,7 +68,7 @@ async def extract_document(req: func.HttpRequest) -> func.HttpResponse:
     Azure Search Custom Skill: Extract document content
 
     Input format (single record; file data only):
-    # https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-intelligence-layout#skill-inputs
+    # https://learn.microsoft.com/azure/search/cognitive-search-skill-document-intelligence-layout#skill-inputs
     {
         "values": [
             {
diff --git a/docs/data_ingestion.md b/docs/data_ingestion.md
@@ -36,7 +36,7 @@ In order to ingest a document format, we need a tool that can turn it into text.
 
 ## Ingestion stages
 
-The ingestion pipeline consists of three main stages that transform raw documents into searchable content in Azure AI Search. These stages apply to both local ingestion (using `prepdocs.py`) and cloud-based ingestion (using Azure Functions as custom skills).
+The ingestion pipeline consists of three main stages that transform raw documents into searchable content in Azure AI Search. These stages apply to both [local ingestion](#local-ingestion) and [cloud-based ingestion](#cloud-based-ingestion).
 
 ### Document extraction
 
@@ -153,6 +153,8 @@ The cloud ingestion pipeline uses four Azure Functions as custom skills within a
    - **Text Processor** (Skill #4): Combines text with enriched figures, chunks content, and generates embeddings
 4. **Azure AI Search Index** receives the final processed chunks with embeddings
 
+The functions are defined in the `app/functions/` directory, and the custom skillset is configured in the `app/backend/setup_cloud_ingestion.py` script.
+
 #### [Document Extractor Function](app/functions/document_extractor/)
 
 - Implements the [document extraction](#document-extraction) stage
@@ -163,7 +165,7 @@ The cloud ingestion pipeline uses four Azure Functions as custom skills within a
 - Implements the [figure processing](#figure-processing) stage
 - Emits enriched figure metadata with descriptions, URLs, and embeddings
 
-#### [Shaper Skill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-shaper)
+#### [Shaper Skill](https://learn.microsoft.com/azure/search/cognitive-search-skill-shaper)
 
 - Consolidates enrichments from the figure processor back into the main document context
 - Required because Azure AI Search's enrichment tree isolates data by context
diff --git a/docs/deploy_features.md b/docs/deploy_features.md
@@ -12,7 +12,6 @@ You should typically enable these features before running `azd up`. Once you've
 * [Enabling persistent chat history with Azure Cosmos DB](#enabling-persistent-chat-history-with-azure-cosmos-db)
 * [Enabling language picker](#enabling-language-picker)
 * [Enabling speech input/output](#enabling-speech-inputoutput)
-* [Enabling Integrated Vectorization](#enabling-integrated-vectorization)
 * [Enabling authentication](#enabling-authentication)
 * [Enabling login and document level access control](#enabling-login-and-document-level-access-control)
 * [Enabling user document upload](#enabling-user-document-upload)
@@ -236,8 +235,7 @@ Learn more in the [multimodal guide](./multimodal.md).
 
 ## Enabling media description with Azure Content Understanding
 
-⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization).
-It is compatible with the [multimodal feature](./multimodal.md), but this feature enables only a subset of multimodal capabilities,
+⚠️ This feature is compatible with the [multimodal feature](./multimodal.md), but this feature enables only a subset of multimodal capabilities,
 so you may want to enable the multimodal feature instead or as well.
 
 By default, if your documents contain image-like figures, the data ingestion process will ignore those figures,
diff --git a/docs/multimodal.md b/docs/multimodal.md
@@ -112,5 +112,4 @@ and you may still see good results with just text inputs, since the inputs conta
 
 ## Compatibility
 
-* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we have not integrated them in this project. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
 * This feature *is* compatible with the [reasoning models](./reasoning.md) feature, as long as you use a model that [supports image inputs](https://learn.microsoft.com/azure/ai-services/openai/how-to/reasoning?tabs=python-secure%2Cpy#api--feature-support).

Original file line number	Diff line number	Diff line change
`@@ -68,7 +68,7 @@ async def extract_document(req: func.HttpRequest) -> func.HttpResponse:`
`68`	`68`	`Azure Search Custom Skill: Extract document content`
`69`	`69`
`70`	`70`	`Input format (single record; file data only):`
`71`		`- # https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-intelligence-layout#skill-inputs`
	`71`	`+ # https://learn.microsoft.com/azure/search/cognitive-search-skill-document-intelligence-layout#skill-inputs`
`72`	`72`	`{`
`73`	`73`	`"values": [`
`74`	`74`	`{`
Original file line number	Diff line number	Diff line change
`@@ -112,5 +112,4 @@ and you may still see good results with just text inputs, since the inputs conta`
`112`	`112`
`113`	`113`	`## Compatibility`
`114`	`114`
`115`		-* This feature is not compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we have not integrated them in this project. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
`116`	`115`	`* This feature is compatible with the [reasoning models](./reasoning.md) feature, as long as you use a model that [supports image inputs](https://learn.microsoft.com/azure/ai-services/openai/how-to/reasoning?tabs=python-secure%2Cpy#api--feature-support).`