Skip to content

Commit 735ceac

Browse files
authored
Merge pull request #203790 from HeidiSteen/heidist-fresh
minor edits
2 parents 10f3a2d + db86ea4 commit 735ceac

File tree

4 files changed

+27
-16
lines changed

4 files changed

+27
-16
lines changed

articles/search/cognitive-search-concept-image-scenarios.md

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -35,28 +35,28 @@ Optionally, you can define projections to accept image-analyzed output into a [k
3535

3636
## Set up source files
3737

38-
Image processing is indexer-driven, which means that the raw inputs must be a supported file type (as determined by the skills you choose) from a [supported data source](search-indexer-overview.md#supported-data-sources).
38+
Image processing is indexer-driven, which means that the raw inputs must be in a [supported data source](search-indexer-overview.md#supported-data-sources).
3939

4040
+ Image analysis supports JPEG, PNG, GIF, and BMP
4141
+ OCR supports JPEG, PNG, BMP, and TIF
4242

4343
Images are either standalone binary files or embedded in documents (PDF, RTF, and Microsoft application files). A maximum of 1000 images will be extracted from a given document. If there are more than 1000 images in a document, the first 1000 will be extracted and a warning will be generated.
4444

45-
Azure Blob Storage is the most frequently used storage for image processing in Cognitive Search. There are three main tasks related to retrieving images from the source:
45+
Azure Blob Storage is the most frequently used storage for image processing in Cognitive Search. There are three main tasks related to retrieving images from a blob container:
4646

47-
+ Access rights on the container. If you're using a full access connection string that includes a key, the key gives you access to the content. Alternatively, you can [authenticate using Azure Active Directory (Azure AD)](search-howto-managed-identities-data-sources.md) or [connect as a trusted service](search-indexer-howto-access-trusted-service-exception.md).
47+
+ Enable access to content in the container. If you're using a full access connection string that includes a key, the key gives you permission to the content. Alternatively, you can [authenticate using Azure Active Directory (Azure AD)](search-howto-managed-identities-data-sources.md) or [connect as a trusted service](search-indexer-howto-access-trusted-service-exception.md).
4848

4949
+ [Create a data source](search-howto-indexing-azure-blob-storage.md) of type "azureblob" that connects to the blob container storing your files.
5050

51-
+ Optionally, [set file type criteria](search-blob-storage-integration.md#PartsOfBlobToIndex) if the workload targets a specific file type. Blob indexer configuration includes file inclusion and exclusion settings. You can filter out files you don't want.
51+
+ Review [service tier limits](search-limits-quotas-capacity.md) to make sure that your source data is under maximum size and quantity limits for indexers and enrichment.
5252

5353
<a name="get-normalized-images"></a>
5454

5555
## Configure indexers for image processing
5656

57-
Image extraction is the first step of indexer processing. Extracted images are queued for image processing. Extracted text is queued for text processing, if applicable.
57+
Extracting images from the source content files is the first step of indexer processing. Extracted images are queued for image processing. Extracted text is queued for text processing, if applicable.
5858

59-
Image processing requires image normalization to make images more uniform for downstream processing. This step occurs automatically and is internal to indexer processing. As a developer, you enable image normalization by setting the `"imageAction"` parameter in indexer configuration.
59+
Image processing requires image normalization to make images more uniform for downstream processing. This second step occurs automatically and is internal to indexer processing. As a developer, you enable image normalization by setting the `"imageAction"` parameter in indexer configuration.
6060

6161
Image normalization includes the following operations:
6262

@@ -90,7 +90,7 @@ Metadata adjustments are captured in a complex type created for each image. You
9090
1. Set `"imageAction"` to enable the *normalized_images* node in an enrichment tree (required):
9191

9292
+ `"generateNormalizedImages"` to generate an array of normalized images as part of document cracking.
93-
93+
9494
+ `"generateNormalizedImagePerPage"` (applies to PDF only) to generate an array of normalized images where each page in the PDF is rendered to one output image. For non-PDF files, the behavior of this parameter is same as if you had set "generateNormalizedImages".
9595

9696
1. Optionally, adjust the width or height of the generated normalized images:
@@ -101,6 +101,19 @@ Metadata adjustments are captured in a complex type created for each image. You
101101

102102
The default of 2000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the [OCR skill](cognitive-search-skill-ocr.md) and the [image analysis skill](cognitive-search-skill-image-analysis.md). The [OCR skill](cognitive-search-skill-ocr.md) supports a maximum width and height of 4200 for non-English languages, and 10000 for English. If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.
103103

104+
+ Optionally, [set file type criteria](search-blob-storage-integration.md#PartsOfBlobToIndex) if the workload targets a specific file type. Blob indexer configuration includes file inclusion and exclusion settings. You can filter out files you don't want.
105+
106+
```json
107+
{
108+
"parameters" : {
109+
"configuration" : {
110+
"indexedFileNameExtensions" : ".pdf, .docx",
111+
"excludedFileNameExtensions" : ".png, .jpeg"
112+
}
113+
}
114+
}
115+
```
116+
104117
### About normalized images
105118

106119
When "imageAction" is set to a value other than "none", the new *normalized_images* field will contain an array of images. Each image is a complex type that has the following members:
@@ -143,6 +156,8 @@ This section supplements the [skill reference](cognitive-search-predefined-skill
143156

144157
1. If necessary, [include multi-service key](cognitive-search-attach-cognitive-services.md) in the Cognitive Services property of the skillset. Cognitive Search makes calls to a billable Azure Cognitive Services resource for OCR and image analysis for transactions that exceed the free limit (20 per indexer per day). Cognitive Services must be in the same region as your search service.
145158

159+
1. If original images are embedded in PDF or application files like PPTX or DOCX, you'll need to add a Text Merge skill if you want image output and text output together. Working with embedded images is discussed further on in this article.
160+
146161
Once the basic framework of your skillset is created and Cognitive Services is configured, you can focus on each individual image skill, defining inputs and source context, and mapping outputs to fields in either an index or knowledge store.
147162

148163
> [!NOTE]
@@ -391,7 +406,7 @@ Image analysis output is illustrated in the JSON below (search result). The skil
391406

392407
When the images you want to process are embedded in other files, such as PDF or DOCX, the enrichment pipeline will extract just the images and then pass them to OCR or image analysis for processing. Separation of image from text content occurs during the document cracking phase, and once the images are separated, they remain separate unless you explicitly merge the processed output back into the source text.
393408

394-
[**Text Merge**](cognitive-search-skill-textmerger.md) is used to put image processing output back into the document. Although Text Merge is not a hard requirement, it's frequently invoked so that image output (OCR text, OCR layoutText, image tags, image captions) can be reintroduced into the document at the same location where the image was found. Essentially, the goal is to replace an embedded binary image with an in-place text equivalent.
409+
[**Text Merge**](cognitive-search-skill-textmerger.md) is used to put image processing output back into the document. Although Text Merge is not a hard requirement, it's frequently invoked so that image output (OCR text, OCR layoutText, image tags, image captions) can be reintroduced into the document. Depending on the skill, the image output replaces an embedded binary image with an in-place text equivalent. Image Analysis output can be merged at image location. OCR output always appears at the end of each page.
395410

396411
The following workflow outlines the process of image extraction, analysis, merging, and how to extend the pipeline to push image-processed output into other text-based skills such as Entity Recognition or Text Translation.
397412

articles/search/cognitive-search-concept-intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ms.custom: references_regions
1717

1818
Because Azure Cognitive Search is a full text search solution, the purpose of AI enrichment is to improve the utility of your content in search-related scenarios:
1919

20-
+ Machine translation and language detection, in support of multi-lingual search
20+
+ Translation and language detection for multi-lingual search
2121
+ Entity recognition extracts people, places, and other entities from large chunks of text
2222
+ Key phrase extraction identifies and then outputs important terms
2323
+ Optical Character Recognition (OCR) recognizes printed and handwritten text in binary files

articles/search/cognitive-search-skill-ocr.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,7 @@ In previous versions, there was a parameter called "textExtractionAlgorithm" to
5858
| `text` | Plain text extracted from the image. |
5959
| `layoutText` | Complex type that describes the extracted text and the location where the text was found.|
6060

61-
62-
The OCR skill always extracts images at the end of each page. This is by design.
63-
61+
If you call OCR on images embedded in PDFs or other application files, the OCR output will be located at the bottom of the page, after any text that was extracted and processed.
6462

6563
## Sample definition
6664

articles/search/search-what-is-azure-search.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,10 @@ ms.custom: contperf-fy21q1
1515

1616
Azure Cognitive Search ([formerly known as "Azure Search"](whats-new.md#new-service-name)) is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.
1717

18-
Search is foundational to any app that surfaces text content to users, with common scenarios including catalog or document search, online retail, or data exploration over proprietary content.
19-
20-
When you create a search service, you'll work with the following capabilities:
18+
Search is foundational to any app that surfaces text content to users, with common scenarios including catalog or document search, online retail, or data exploration over proprietary content. When you create a search service, you'll work with the following capabilities:
2119

2220
+ A search engine for full text search over a search index containing your user-owned content
23-
+ Rich indexing, with [text analysis](search-analyzers.md) and [optional AI enrichment](cognitive-search-concept-intro.md) for advanced content extraction and transformation
21+
+ Rich indexing, with [text analysis](search-analyzers.md) and [optional AI enrichment](cognitive-search-concept-intro.md) for content extraction and transformation
2422
+ Rich query syntax that supplements free text search with filters, autocomplete, regex, geo-search and more
2523
+ Programmability through REST APIs and client libraries in Azure SDKs for .NET, Python, Java, and JavaScript
2624
+ Azure integration at the data layer, machine learning layer, and AI (Cognitive Services)

0 commit comments

Comments
 (0)