Skip to content

Commit 72b0bda

Browse files
Merge pull request #91105 from HeidiSteen/heidist-master
AzS: Use AI with blobs
2 parents e813d9b + 9847793 commit 72b0bda

File tree

4 files changed

+150
-23
lines changed

4 files changed

+150
-23
lines changed

articles/search/TOC.yml

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@
169169
href: search-howto-complex-data-types.md
170170
- name: Model relational data
171171
href: index-sql-relational-data.md
172-
- name: Load data
172+
- name: Indexing any data
173173
items:
174174
- name: Data import overview
175175
href: search-what-is-data-import.md
@@ -181,7 +181,21 @@
181181
href: search-howto-large-index.md
182182
- name: Handle concurrent updates
183183
href: search-howto-concurrency.md
184-
- name: Load data with indexers
184+
- name: Indexing Azure Blob data
185+
items:
186+
- name: Use AI with blob data
187+
href: search-blob-ai-integration.md
188+
- name: Add full text search
189+
href: search-blob-storage-integration.md
190+
- name: Set up a blob indexer
191+
href: search-howto-indexing-azure-blob-storage.md
192+
- name: Index one-to-many blobs
193+
href: search-howto-index-one-to-many-blobs.md
194+
- name: Index CSV blobs
195+
href: search-howto-index-csv-blobs.md
196+
- name: Index JSON blobs
197+
href: search-howto-index-json-blobs.md
198+
- name: Indexing with "indexers"
185199
items:
186200
- name: Indexers overview
187201
href: search-indexer-overview.md
@@ -191,16 +205,6 @@
191205
href: search-howto-connecting-azure-sql-database-to-azure-search-using-indexers.md
192206
- name: Azure Cosmos DB indexer
193207
href: search-howto-index-cosmosdb.md
194-
- name: Azure Blob Storage indexer
195-
items:
196-
- name: Set up a blob indexer
197-
href: search-howto-indexing-azure-blob-storage.md
198-
- name: Index one-to-many blobs
199-
href: search-howto-index-one-to-many-blobs.md
200-
- name: Index CSV blobs
201-
href: search-howto-index-csv-blobs.md
202-
- name: Index JSON blobs
203-
href: search-howto-index-json-blobs.md
204208
- name: Schedule indexers
205209
href: search-howto-schedule-indexers.md
206210
- name: Map fields

articles/search/index.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,22 @@ landingContent:
5050
- text: Introduction to Azure Search
5151
url: https://docs.microsoft.com/learn/modules/intro-to-azure-search/
5252

53+
# Card
54+
- title: Use with Blob storage
55+
linkLists:
56+
- linkListType: how-to-guide
57+
links:
58+
- text: Use AI to understand blob data
59+
url: search-blob-ai-integration.md
60+
- text: Add full text search to blob data
61+
url: search-blob-storage-integration.md
62+
- text: Store AI enrichments
63+
url: knowledge-store-create-portal.md
64+
- linkListType: tutorial
65+
links:
66+
- text: Index semi-structured blob data
67+
url: search-semi-structured-data.md
68+
5369
# Card
5470
- title: Index your data
5571
linkLists:
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
title: Use AI to understand Blob data
3+
titleSuffix: Azure Search
4+
description: Add semantic, natural language processing and image analysis to Azure blobs using an AI enrichment pipeline in Azure Search.
5+
6+
manager: nitinme
7+
author: HeidiSteen
8+
ms.author: heidist
9+
ms.service: search
10+
ms.topic: conceptual
11+
ms.date: 10/09/2019
12+
---
13+
14+
# Use AI to understand Blob data
15+
16+
Data in Azure Blob storage is often a variety of unstructured content such as images, long text, PDFs, and Office documents. By using the AI capabilities in Azure Search, you can understand and extract valuable information from blobs in a variety of ways. Examples of applying AI to blob content include:
17+
18+
+ Extract text from images using optical character recognition (OCR)
19+
+ Produce a scene description or tags from a photo
20+
+ Detect language and translate text into different languages
21+
+ Process text with named entity recognition (NER) to find references to people, dates, places, or organizations
22+
23+
While you might need just one of these AI capabilities, it’s common to combine multiple of them into the same pipeline (for example, extracting text from a scanned image and then finding all the dates and places referenced in it).
24+
25+
AI enrichment creates new information, captured as text, stored in fields. Post-enrichment, you can access this information from a search index through full text search, or send enriched documents back to Azure storage to power new application experiences that include exploring data for discovery or analytics scenarios.
26+
27+
In this article, we view AI enrichment through a wide lens so that you can quickly grasp the entire process, from transforming raw data in blobs, to queryable information in either a search index or a knowledge store.
28+
29+
## What it means to "enrich" blob data
30+
31+
*AI enrichment* is part of the indexing architecture of Azure Search that integrates built-in AI from Microsoft or custom AI that you provide. It helps you implement end-to-end scenarios where you need to process blobs (both existing ones and new ones as they come in or are updated), crack open all file formats to extract images and text, extract the desired information using various AI capabilities, and index them in an Azure Search index for fast search, retrieval and exploration.
32+
33+
Inputs are your blobs, in a single container, in Azure Blob storage. Blobs can be almost any kind of text or image data.
34+
35+
Output is always an Azure Search index, used for fast text search, retrieval, and exploration in client applications. Additionally, output can also be a *knowledge store* that projects enriched documents into Azure blobs or Azure tables for downstream analysis in tools like Power BI or in data science workloads.
36+
37+
In between is the pipeline architecture itself. The pipeline is based on the *indexer* feature, to which you can assign a *skillset*, which is composed of one or more *skills* providing the AI. The purpose of the pipeline is to produce *enriched documents* that enter as raw content but pick up additional structure, context, and information while moving through the pipeline. Enriched documents are consumed during indexing to create inverted indexes and other structures used in full text search or exploration and analytics.
38+
39+
## How to get started
40+
41+
You can start directly in your storage account portal page. Click **Add Azure Search** and create a new Azure Search service or select an existing one. If you already have an existing search service in the same subscription, clicking **Add Azure Search** opens the Import data wizard so that you can immediately step through indexing, enrichment, and index definition.
42+
43+
Once you add Azure Search to your storage account, you can follow the standard process to enrich data in any Azure data source. Assuming you already have blob content, you can use the Import data wizard in Azure Search for an easy initial introduction to AI enrichment. This quickstart explains the steps: [Create an AI enrichment pipeline in the portal](cognitive-search-quickstart-blob.md).
44+
45+
In the following sections, we'll explore more components and concepts.
46+
47+
## Use Blob indexers
48+
49+
AI enrichment is an add-on to an indexing pipeline, and in Azure Search, those pipelines are built on top of an *indexer*. An indexer is a data-source-aware subservice equipped with internal logic for sampling data, reading metadata data, retrieving data, and serializing data from native formats into JSON documents for subsequent import. Indexers are often used by themselves for import, separate from AI, but if you want to build an AI enrichment pipeline, you will need an indexer and a skillset to go with it. In this section, we'll focus on the indexer itself.
50+
51+
Blobs in Azure Storage are indexed using the [Azure Search Blob storage indexer](search-howto-indexing-azure-blob-storage.md). You invoke this indexer by setting the type, and by providing connection information that includes an Azure Storage account along with a blob container. Unless you've previously organized blobs into a virtual directory, which you can then pass as a parameter, the Blob indexer pulls from the entire container.
52+
53+
An indexer does the "document cracking", and after connecting to the data source, it's the first step in the pipeline. For blob data, this is where PDF, office docs, image, and other content types are detected. Document cracking with text extraction is no charge. Document cracking with image extraction is charged at rates you can find on the Azure Search [pricing page](https://azure.microsoft.com/pricing/details/search/).
54+
55+
Although all documents will be cracked, enrichment only occurs if you explicitly provide the skills to do so. For example, if your pipeline consists exclusively of text analytics, any images in your container or documents will be ignored.
56+
57+
The Blob indexer comes with configuration parameters and supports change tracking if the underlying data provides sufficient information. You can learn more about the core functionality in [Azure Search Blob storage indexer](search-howto-indexing-azure-blob-storage.md).
58+
59+
## Add AI
60+
61+
*Skills* are the individual components of AI processing that you can use standalone or in combination with other skills for sequential processing.
62+
63+
+ Built-in skills are backed by Cognitive Services, with image analysis based on Computer Vision, and natural language processing based on Text Analytics. A few examples are [OCR](cognitive-search-skill-ocr.md), [Entity Recognition](cognitive-search-skill-entity-recognition.md), and [Image Analysis](cognitive-search-skill-image-analysis.md). You can review the full list of built-in skills in [Predefined skills for content enrichment](cognitive-search-predefined-skills.md).
64+
65+
+ Custom skills are custom code, wrapped in an interface definition that allows for integration into the pipeline. In customer solutions, it's common practice to use both, with custom skills providing open-source, third-party, or first-party AI modules.
66+
67+
A *skillset* is the collection of skills used in a pipeline, and it's invoked after the document cracking phase makes content available. An indexer can consume exactly one skillset, but that skillset exists independently of an indexer so that you can reuse it in other scenarios.
68+
69+
Custom skills might sound complex but can be simple and straightforward in terms of implementation. If you have existing packages that provide pattern matching or classification models, the content you extract from blobs could be passed to these models for processing. Since AI enrichment is Azure-based, your model should be on Azure also. Some common hosting methodologies include using [Azure Functions](cognitive-search-create-custom-skill-example.md) or [Containers](https://github.com/Microsoft/SkillsExtractorCognitiveSearch).
70+
71+
Built-in skills backed by Cognitive Services require an [attached Cognitive Services](cognitive-search-attach-cognitive-services.md) all-in-one subscription key that gives you access to the resource. An all-in-one key gives you image analysis, language detection, text translation, and text analytics. Other built-in skills are features of Azure Search and require no additional service or key. Text shaper, splitter, and merger are examples of helper skills that are sometimes necessary when designing the pipeline.
72+
73+
If you use only custom skills and built-in utility skills, there is no dependency or costs related to Cognitive Services.
74+
75+
## Order of operations
76+
77+
Now we've covered indexers, content extraction, and skills, we can take a closer look at pipeline mechanisms and order of operations.
78+
79+
A skillset is a composition of one or more skills. When multiple skills are involved, the skillset operates as sequential pipeline, producing dependency graphs, where output from one skill becomes input to another.
80+
81+
For example, given a large blob of unstructured text, a sample order of operations for text analytics might be as follows:
82+
83+
1. Use Text Splitter to break the blob into smaller parts.
84+
1. Use Language Detection to determine if content is English or another language.
85+
1. Use Text Translator to get all text into a common language.
86+
1. Run Entity Recognition, Key Phrase Extraction, or Sentiment Analysis on chunks of text. In this step, new fields are created and populated. Entities might be location, people, organization, dates. Key phrases are short combinations of words that appear to belong together. Sentiment score is a rating on continuum of negative (0) to positive (1) sentiment.
87+
1. Use Text Merger to reconstitute the document from the smaller chunks..
88+
89+
90+
## Outputs and use cases
91+
92+
An enriched document at the end of the pipeline differs from its original input version by the presence of additional fields containing new information that was extracted or generated during enrichment. As such, you can work with a combination of original and created values in several ways.
93+
94+
The output formations are a search index on Azure Search, or a knowledge store in Azure Storage.
95+
96+
In Azure Search, enriched documents are formatted in JSON and can be indexed in the same way all documents are indexed, with the benefits an indexer provides. Fields from enriched documents are mapped to an index schema. During indexing, the blob indexer refers to configuration parameters and settings to utilize any field mappings or change detection logic that you've specified. Post-indexing, when content is stored on Azure Search, you can build rich queries and filter expressions to understand your content.
97+
98+
In Azure Storage, a knowledge store has two manifestations: a blob container, or tables in Table storage. A blob container captures enriched documents in their entirety, which is useful if you want to feed into other processes. In contrast, Table storage can accommodate physical projections of enriched documents. You can create slices or layers of enriched documents that include or exclude specific parts. For analysis in Power BI, the tables in Azure Table storage become the data source for further visualization and exploration.
99+
100+
## Next steps
101+
102+
There’s a lot more you can do with AI enrichment to get the most out of your data in Azure Storage, including combining Cognitive Services in different ways, and authoring custom skills for cases where there’s no existing Cognitive Service for the scenario. You can learn more by following the links below.
103+
104+
> [!div class="nextstepaction"]
105+
> [AI enrichment overview](cognitive-search-concept-intro.md)
106+
> [Create a skillset](cognitive-search-defining-skillset.md)
107+
> [Map nodes in an annotation tree](cognitive-search-output-field-mapping.md)

articles/search/search-blob-storage-integration.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
---
2-
title: Add full text search to Azure Blob Storage - Azure Search
3-
description: Crawl text content in Azure Blob storage for Azure Search indexing, in code using the HTTP REST API.
4-
services: search
2+
title: Add full text search to Azure Blob Storage
3+
titleSuffix: Azure Search
4+
description: Extract content and add structure to Azure blobs when building a full text search index in Azure Search.
5+
6+
manager: nitinme
7+
author: HeidiSteen
8+
ms.author: heidist
59
ms.service: search
610
ms.topic: conceptual
7-
ms.date: 03/01/2019
8-
author: mgottein
9-
manager: nitinme
10-
ms.author: magottei
11-
ms.custom: seodec2018
11+
ms.date: 10/09/2019
1212
---
1313

14-
# Searching Blob storage with Azure Search
14+
# Add full text search to Azure blob data using Azure Search
1515

1616
Searching across the variety of content types stored in Azure Blob storage can be a difficult problem to solve. However, you can index and search the content of your Blobs in just a few clicks by using Azure Search. Searching over Blob storage requires provisioning an Azure Search service. The various service limits and pricing tiers of Azure Search can be found on the [pricing page](https://aka.ms/azspricing).
1717

@@ -40,12 +40,12 @@ Azure Search can be configured to extract structured content found in blobs that
4040

4141
JSON parsing is not currently configurable through the portal. [Learn more about JSON parsing in Azure Search.](https://aka.ms/azsjsonblobindexing)
4242

43-
## Quick start
43+
## Quickstart
4444
Azure Search can be added to blobs directly from the Blob storage portal page.
4545

4646
![](./media/search-blob-storage-integration/blob-blade.png)
4747

4848
Click **Add Azure Search** to launch a flow where you can select an existing Azure Search service or create a new service. If you create a new service, you are navigated out of your Storage account's portal experience. You can navigate back to the Storage portal page and re-select the **Add Azure Search** option, where you can select the existing service.
4949

50-
## Next Steps
50+
## Next steps
5151
Learn more about the Azure Search Blob Indexer in the full [documentation](https://aka.ms/azsblobindexer).

0 commit comments

Comments
 (0)