|
| 1 | +--- |
| 2 | +title: Utilize the content generation capabilities of language models as part of content ingestion pipeline |
| 3 | +titleSuffix: Azure AI Search |
| 4 | +description: Use language models to caption your images and facilitate an image search through your data. |
| 5 | +author: amitkalay |
| 6 | +ms.author: amitkalay |
| 7 | +ms.service: azure-ai-search |
| 8 | +ms.topic: how-to |
| 9 | +ms.date: 05/05/2025 |
| 10 | +ms.custom: |
| 11 | + - devx-track-csharp |
| 12 | + - build-2025 |
| 13 | +--- |
| 14 | + |
| 15 | +# Generate captions for images in another language |
| 16 | + |
| 17 | +In this article, learn how to generate captions using AI enrichment and a skillset. Images often contain useful information that's relevant in search scenarios. You can [vectorize images](search-get-started-portal-image-search.md) to represent visual content in your search index. Or, you can use [AI enrichment and skillsets](cognitive-search-concept-intro.md) to create and extract searchable *text* from images. |
| 18 | + |
| 19 | +The GenAI Prompt skill (preview) generates a description of each image in your data source and the indexer pushes that description into a search index. To view the descriptions, you can run a query that includes them in the response. |
| 20 | + |
| 21 | +## Prerequisites |
| 22 | + |
| 23 | +To work with image content in a skillset, you need: |
| 24 | + |
| 25 | ++ A supported data source |
| 26 | ++ Files or blobs containing images |
| 27 | ++ Read access on the supported data source. This article uses key-based authentication, but indexers can also connect using the search service identity and Microsoft Entra ID authentication. For role-based access control, assign roles on the data source to allow read access by the service identity. If you're testing on a local development machine, make sure you also have read access on the supported data source. |
| 28 | ++ A search indexer, configured for image actions |
| 29 | ++ A skillset with the new custom genAI prompt skill |
| 30 | ++ A search index with fields to receive the verbalized text output, plus output field mappings in the indexer that establish association |
| 31 | + |
| 32 | +Optionally, you can define projections to accept image-analyzed output into a [knowledge store](knowledge-store-concept-intro.md) for data mining scenarios. |
| 33 | + |
| 34 | +<a name="get-normalized-images"></a> |
| 35 | + |
| 36 | +## Configure indexers for image processing |
| 37 | + |
| 38 | +After the source files are set up, enable image normalization by setting the `imageAction` parameter in indexer configuration. Image normalization helps make images more uniform for downstream processing. Image normalization includes the following operations: |
| 39 | + |
| 40 | ++ Large images are resized to a maximum height and width to make them uniform. |
| 41 | ++ For images that have metadata that specifies orientation, image rotation is adjusted for vertical loading. |
| 42 | + |
| 43 | +Note that enabling `imageAction` (setting this parameter to other than `none`) will incur an additional charge for image extraction according to [Azure AI Search pricing](https://azure.microsoft.com/pricing/details/search/). |
| 44 | + |
| 45 | +1. [Create or update an indexer](/rest/api/searchservice/indexers/create-or-update) to set the configuration properties: |
| 46 | + |
| 47 | + ```json |
| 48 | + { |
| 49 | + "parameters": |
| 50 | + { |
| 51 | + "configuration": |
| 52 | + { |
| 53 | + "dataToExtract": "contentAndMetadata", |
| 54 | + "parsingMode": "default", |
| 55 | + "imageAction": "generateNormalizedImages" |
| 56 | + } |
| 57 | + } |
| 58 | + } |
| 59 | + ``` |
| 60 | + |
| 61 | +1. Set `dataToExtract` to `contentAndMetadata` (required). |
| 62 | + |
| 63 | +1. Verify that the `parsingMode` is set to *default* (required). |
| 64 | + |
| 65 | + This parameter determines the granularity of search documents created in the index. The default mode sets up a one-to-one correspondence so that one blob results in one search document. If documents are large, or if skills require smaller chunks of text, you can add the Text Split skill that subdivides a document into paging for processing purposes. But for search scenarios, one blob per document is required if enrichment includes image processing. |
| 66 | + |
| 67 | +1. Set `imageAction` to enable the `normalized_images` node in an enrichment tree (required): |
| 68 | + |
| 69 | + + `generateNormalizedImages` to generate an array of normalized images as part of document cracking. |
| 70 | + |
| 71 | + + `generateNormalizedImagePerPage` (applies to PDF only) to generate an array of normalized images where each page in the PDF is rendered to one output image. For non-PDF files, the behavior of this parameter is similar as if you had set `generateNormalizedImages`. However, setting `generateNormalizedImagePerPage` can make indexing operation less performant by design (especially for large documents) since several images would have to be generated. |
| 72 | + |
| 73 | +1. Optionally, adjust the width or height of the generated normalized images: |
| 74 | + |
| 75 | + + `normalizedImageMaxWidth` in pixels. Default is 2,000. Maximum value is 10,000. |
| 76 | + |
| 77 | + + `normalizedImageMaxHeight` in pixels. Default is 2,000. Maximum value is 10,000. |
| 78 | + |
| 79 | +### About normalized images |
| 80 | + |
| 81 | +When `imageAction` is set to a value other than *none*, the new `normalized_images` field contains an array of images. Each image is a complex type that has the following members: |
| 82 | + |
| 83 | +| Image member | Description | |
| 84 | +|--------------------|-----------------------------------------| |
| 85 | +| data | BASE64 encoded string of the normalized image in JPEG format. | |
| 86 | +| width | Width of the normalized image in pixels. | |
| 87 | +| height | Height of the normalized image in pixels. | |
| 88 | +| originalWidth | The original width of the image before normalization. | |
| 89 | +| originalHeight | The original height of the image before normalization. | |
| 90 | +| rotationFromOriginal | Counter-clockwise rotation in degrees that occurred to create the normalized image. A value between 0 degrees and 360 degrees. This step reads the metadata from the image that is generated by a camera or scanner. Usually a multiple of 90 degrees. | |
| 91 | +| contentOffset | The character offset within the content field where the image was extracted from. This field is only applicable for files with embedded images. The `contentOffset` for images extracted from PDF documents is always at the end of the text on the page it was extracted from in the document. This means images appear after all text on that page, regardless of the original location of the image in the page. | |
| 92 | +| pageNumber | If the image was extracted or rendered from a PDF, this field contains the page number in the PDF it was extracted or rendered from, starting from 1. If the image isn't from a PDF, this field is 0. | |
| 93 | + |
| 94 | + Sample value of `normalized_images`: |
| 95 | + |
| 96 | +```json |
| 97 | +[ |
| 98 | + { |
| 99 | + "data": "BASE64 ENCODED STRING OF A JPEG IMAGE", |
| 100 | + "width": 500, |
| 101 | + "height": 300, |
| 102 | + "originalWidth": 5000, |
| 103 | + "originalHeight": 3000, |
| 104 | + "rotationFromOriginal": 90, |
| 105 | + "contentOffset": 500, |
| 106 | + "pageNumber": 2 |
| 107 | + } |
| 108 | +] |
| 109 | +``` |
| 110 | + |
| 111 | +## Define skillsets for image processing |
| 112 | + |
| 113 | +This section supplements the [skill reference](cognitive-search-defining-skillset.md) articles by providing context for working with skill inputs, outputs, and patterns, as they relate to image processing. |
| 114 | + |
| 115 | ++ Create or update a skillset to add skills. |
| 116 | + |
| 117 | +Once the basic framework of your skillset is created and Azure AI services is configured, you can focus on each individual image skill, defining inputs and source context, and mapping outputs to fields in either an index or knowledge store. |
| 118 | + |
| 119 | +> [!NOTE] |
| 120 | +> For an example skillset that combines image processing with downstream natural language processing, see [REST Tutorial: Use REST and AI to generate searchable content from Azure blobs](cognitive-search-tutorial-blob.md). It shows how to feed skill imaging output into entity recognition and key phrase extraction. |
| 121 | +
|
| 122 | +### Example inputs for image processing |
| 123 | + |
| 124 | +As noted, images are extracted during document cracking and then normalized as a preliminary step. The normalized images are the inputs to any image processing skill, and are always represented in an enriched document tree in either one of two ways: |
| 125 | + |
| 126 | ++ `/document/normalized_images/*` is for documents that are processed whole. |
| 127 | + |
| 128 | +```json |
| 129 | + { |
| 130 | + "@odata.type": "#Microsoft.Skills.Custom.ChatCompletionSkill", |
| 131 | + "context": "/document/normalized_images/*", |
| 132 | + "uri": "https://contoso.openai.azure.com/openai/deployments/contoso-gpt-4o/chat/completions?api-version=2025-01-01-preview", |
| 133 | + "timeout": "PT1M", |
| 134 | + "apiKey": "<YOUR-API-KEY here>" |
| 135 | + "inputs": [ |
| 136 | + { |
| 137 | + "name": "image", |
| 138 | + "source": "/document/normalized_images/*/data" |
| 139 | + }, |
| 140 | + { |
| 141 | + "name": "systemMessage", |
| 142 | + "source": "='You are a useful artificial intelligence assistant that helps people.'" |
| 143 | + }, |
| 144 | + { |
| 145 | + "name": "userMessage", |
| 146 | + "source": "='Describe what you see in this image in 20 words or less in Spanish.'" |
| 147 | + } |
| 148 | + ], |
| 149 | + "outputs": [ |
| 150 | + { |
| 151 | + "name": "response", |
| 152 | + "targetName": "captionedImage" |
| 153 | + } |
| 154 | + ] |
| 155 | + }, |
| 156 | +``` |
| 157 | + |
| 158 | +### Example using json schema responses with text inputs |
| 159 | + |
| 160 | +This example illustrates how you can use structured outputs for language models. Note that this capability is mainly supported mostly by OpenAI language models, although that may change in the future. |
| 161 | + |
| 162 | +```json |
| 163 | + { |
| 164 | + "@odata.type": "#Microsoft.Skills.Custom.ChatCompletionSkill", |
| 165 | + "context": "/document/content", |
| 166 | + "uri": "https://contoso.openai.azure.com/openai/deployments/contoso-gpt-4o/chat/completions?api-version=2025-01-01-preview", |
| 167 | + "timeout": "PT1M", |
| 168 | + "apiKey": "<YOUR-API-KEY here>" |
| 169 | + "inputs": [ |
| 170 | + { |
| 171 | + "name": "systemMessage", |
| 172 | + "source": "='You are a useful artificial intelligence assistant that helps people.'" |
| 173 | + }, |
| 174 | + { |
| 175 | + "name": "userMessage", |
| 176 | + "source": "='How many languages are there in the world and what are they?'" |
| 177 | + } |
| 178 | + ], |
| 179 | + "response_format": { |
| 180 | + "type": "json_schema", |
| 181 | + "json_schema": { |
| 182 | + "name": "structured_output", |
| 183 | + "strict": true, |
| 184 | + "schema": { |
| 185 | + "type": "object", |
| 186 | + "properties": { |
| 187 | + "total": { "type": "number" }, |
| 188 | + "languages": { |
| 189 | + "type": "array", |
| 190 | + "items": { |
| 191 | + "type": "string" |
| 192 | + } |
| 193 | + } |
| 194 | + }, |
| 195 | + "required": ["total", "languages"], |
| 196 | + "additionalProperties": false |
| 197 | + } |
| 198 | + }, |
| 199 | + "outputs": [ |
| 200 | + { |
| 201 | + "name": "response", |
| 202 | + "targetName": "responseJsonForLanguages" |
| 203 | + } |
| 204 | + ] |
| 205 | + }, |
| 206 | +``` |
| 207 | + |
| 208 | +<a name="output-field-mappings"></a> |
| 209 | + |
| 210 | +## Map outputs to search fields |
| 211 | + |
| 212 | +Output text is represented as nodes in an internal enriched document tree, and each node must be mapped to fields in a search index, or to projections in a knowledge store, to make the content available in your app. |
| 213 | + |
| 214 | +1. [Create or update a search index](/rest/api/searchservice/indexes/create-or-update) to add fields to accept the skill outputs. |
| 215 | + |
| 216 | + In the following fields collection example, *content* is blob content. *Metadata_storage_name* contains the name of the file (set `retrievable` to *true*). *Metadata_storage_path* is the unique path of the blob and is the default document key. *Merged_content* is output from Text Merge (useful when images are embedded). |
| 217 | + |
| 218 | + *captionedImage* is the skill outputs and must be a string-type in order to the capture all of the language model output in the search index. |
| 219 | + |
| 220 | + ```json |
| 221 | + "fields": [ |
| 222 | + { |
| 223 | + "name": "content", |
| 224 | + "type": "Edm.String", |
| 225 | + "filterable": false, |
| 226 | + "retrievable": true, |
| 227 | + "searchable": true, |
| 228 | + "sortable": false |
| 229 | + }, |
| 230 | + { |
| 231 | + "name": "metadata_storage_name", |
| 232 | + "type": "Edm.String", |
| 233 | + "filterable": true, |
| 234 | + "retrievable": true, |
| 235 | + "searchable": true, |
| 236 | + "sortable": false |
| 237 | + }, |
| 238 | + { |
| 239 | + "name": "metadata_storage_path", |
| 240 | + "type": "Edm.String", |
| 241 | + "filterable": false, |
| 242 | + "key": true, |
| 243 | + "retrievable": true, |
| 244 | + "searchable": false, |
| 245 | + "sortable": false |
| 246 | + }, |
| 247 | + { |
| 248 | + "name": "captioned_image", |
| 249 | + "type": "Edm.String", |
| 250 | + "filterable": false, |
| 251 | + "retrievable": true, |
| 252 | + "searchable": true, |
| 253 | + "sortable": false |
| 254 | + } |
| 255 | + ] |
| 256 | + ``` |
| 257 | + |
| 258 | +1. [Update the indexer](/rest/api/searchservice/indexers/create-or-update) to map skillset output (nodes in an enrichment tree) to index fields. |
| 259 | + |
| 260 | + Enriched documents are internal. To externalize the nodes in an enriched document tree, set up an output field mapping that specifies which index field receives node content. Enriched data is accessed by your app through an index field. The following example shows a *text* node (OCR output) in an enriched document that's mapped to a *text* field in a search index. |
| 261 | + |
| 262 | + ```json |
| 263 | + "outputFieldMappings": [ |
| 264 | + { |
| 265 | + "sourceFieldName": "/document/normalized_images/*/captionedImage", |
| 266 | + "targetFieldName": "captioned_image" |
| 267 | + } |
| 268 | + ] |
| 269 | + ``` |
| 270 | + |
| 271 | +1. Run the indexer to invoke source document retrieval, image processing via language model captions, and indexing. |
| 272 | + |
| 273 | +### Verify results |
| 274 | + |
| 275 | +Run a query against the index to check the results of image processing. Use [Search Explorer](search-explorer.md) as a search client, or any tool that sends HTTP requests. The following query selects fields that contain the output of image processing. |
| 276 | + |
| 277 | +```http |
| 278 | +POST /indexes/[index name]/docs/search?api-version=[api-version] |
| 279 | +{ |
| 280 | + "search": "A cat in a picture", |
| 281 | + "select": "metadata_storage_name, captioned_image" |
| 282 | +} |
| 283 | +``` |
| 284 | + |
| 285 | +## Related content |
| 286 | ++ [Create indexer (REST)](/rest/api/searchservice/indexers/create) |
| 287 | ++ [GenAI Prompt skill](cognitive-search-skill-genai-prompt.md) |
| 288 | ++ [How to create a skillset](cognitive-search-defining-skillset.md) |
| 289 | ++ [Map enriched output to fields](cognitive-search-output-field-mapping.md) |
0 commit comments