Skip to content

Commit 80d3c1d

Browse files
authored
Merge pull request #66295 from arv100kri/jsonlines
[Azure Search] Expose jsonLines parsing mode in indexer configuration
2 parents e68cbfc + d1c5453 commit 80d3c1d

File tree

4 files changed

+163
-12
lines changed

4 files changed

+163
-12
lines changed

articles/search/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,8 @@
159159
href: search-howto-connecting-azure-sql-database-to-azure-search-using-indexers.md
160160
- name: Cosmos DB indexer
161161
href: search-howto-index-cosmosdb.md
162+
- name: One-to-many blob indexing
163+
href: search-howto-index-one-to-many-blobs.md
162164
- name: CSV blob indexing
163165
href: search-howto-index-csv-blobs.md
164166
- name: JSON blob indexing

articles/search/search-howto-index-csv-blobs.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@ In this article, you will learn how to parse CSV blobs with an Azure Search blob
2626
> CSV blob indexing is currently in public preview and should not be used in production environments. For more information, see [REST api-version=2017-11-11-Preview](search-api-2017-11-11-preview.md).
2727
>
2828
29+
> [!NOTE]
30+
> Follow the indexer configuration recommendations in [One-to-many indexing](search-howto-index-one-to-many-blobs.md) to output multiple search documents from one Azure blob.
31+
2932
## Setting up CSV indexing
3033
To index CSV blobs, create or update an indexer definition with the `delimitedText` parsing mode:
3134

articles/search/search-howto-index-json-blobs.md

Lines changed: 49 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,13 @@ This article shows you how to configure an Azure Search blob indexer to extract
1919

2020
You can use the [portal](#json-indexer-portal), [REST APIs](#json-indexer-rest), or [.NET SDK](#json-indexer-dotnet) to index JSON content. Common to all approaches is that JSON documents are located in a blob container in an Azure Storage account. For guidance on pushing JSON documents from other non-Azure platforms, see [Data import in Azure Search](search-what-is-data-import.md).
2121

22-
JSON blobs in Azure Blob storage are typically either a single JSON document or a JSON array. The blob indexer in Azure Search can parse either construction depending on how you set the **parsingMode** parameter on the request.
22+
JSON blobs in Azure Blob storage are typically either a single JSON document or a collection of JSON entities. For JSON collections, the blob could have an **array** of well-formed JSON elements. Blobs could also be composed of multiple individual JSON entities separated by a newline. The blob indexer in Azure Search can parse any such construction, depending on how you set the **parsingMode** parameter on the request.
23+
24+
> [!IMPORTANT]
25+
> `json` and `jsonArray` parsing modes are generally available, but `jsonLines` parsing mode is in public preview and should not be used in production environments. For more information, see [REST api-version=2017-11-11-Preview](search-api-2017-11-11-preview.md).
26+
27+
> [!NOTE]
28+
> Follow the indexer configuration recommendations in [One-to-many indexing](search-howto-index-one-to-many-blobs.md) to output multiple search documents from one Azure blob.
2329
2430
<a name="json-indexer-portal"></a>
2531

@@ -48,11 +54,13 @@ In the **data source** page, the source must be **Azure Blob Storage**, with the
4854

4955
+ **Data to extract** should be *Content and metadata*. Choosing this option allows the wizard to infer an index schema and map the fields for import.
5056

51-
+ **Parsing mode** should be set to *JSON* or *JSON array*.
57+
+ **Parsing mode** should be set to *JSON*, *JSON array* or *JSON lines*.
5258

5359
*JSON* articulates each blob as a single search document, showing up as an independent item in search results.
5460

55-
*JSON array* is for blobs composed of multiple elements, where you want each element to be articulated as a standalone, independent search document. If blobs are complex, and you don't choose *JSON array* the entire blob is ingested as a single document.
61+
*JSON array* is for blobs that contain well-formed JSON data - the well-formed JSON corresponds to an array of objects, or has a property which is an array of objects and you want each element to be articulated as a standalone, independent search document. If blobs are complex, and you don't choose *JSON array* the entire blob is ingested as a single document.
62+
63+
*JSON lines* is for blobs composed of multiple JSON entities separated by a new-line, where you want each entity to be articulated as a standalone independent search document. If blobs are complex, and you don't choose *JSON lines* parsing mode, then the entire blob is ingested as a single document.
5664

5765
+ **Storage container** must specify your storage account and container, or a connection string that resolves to the container. You can get connection strings on the Blob service portal page.
5866

@@ -113,12 +121,13 @@ For code-based JSON indexing, use [Postman](search-fiddler.md) and the REST API
113121

114122
In contrast with the portal wizard, a code approach requires that you have an index in-place, ready to accept the JSON documents when you send the **Create Indexer** request.
115123

116-
JSON blobs in Azure Blob storage are typically either a single JSON document or a JSON array. The blob indexer in Azure Search can parse either construction, depending on how you set the **parsingMode** parameter on the request.
124+
JSON blobs in Azure Blob storage are typically either a single JSON document or a JSON "array". The blob indexer in Azure Search can parse either construction, depending on how you set the **parsingMode** parameter on the request.
117125

118126
| JSON document | parsingMode | Description | Availability |
119127
|--------------|-------------|--------------|--------------|
120-
| One per blob | `json` | Parses JSON blobs as a single chunk of text. Each JSON blob becomes a single Azure Search document. | Generally available in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) APIs. |
121-
| Multiple per blob | `jsonArray` | Parses a JSON array in the blob, where each element of the array becomes a separate Azure Search document. | Generally available in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) APIs. |
128+
| One per blob | `json` | Parses JSON blobs as a single chunk of text. Each JSON blob becomes a single Azure Search document. | Generally available in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) API and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) SDK. |
129+
| Multiple per blob | `jsonArray` | Parses a JSON array in the blob, where each element of the array becomes a separate Azure Search document. | Available in preview in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) API and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) SDK. |
130+
| Multiple per blob | `jsonLines` | Parses a blob which contains multiple JSON entities (an "array") separated by a newline, where each entity becomes a separate Azure Search document. | Available in preview in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) API and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) SDK. |
122131

123132
### 1 - Assemble inputs for the request
124133

@@ -205,12 +214,16 @@ Until now, definitions for the data source and index have been parsingMode agnos
205214

206215
+ Set **parsingMode** to `json` to index each blob as a single document.
207216

208-
+ Set **parsingMode** to `jsonArray` if your blobs consist of JSON arrays, and you need each element of the array to become a separate document in Azure Search. You can think of a document as a single item in search results. If you want each element in the array to show up in search results as an independent item, then use the `jsonArray` option.
217+
+ Set **parsingMode** to `jsonArray` if your blobs consist of JSON arrays, and you need each element of the array to become a separate document in Azure Search.
218+
219+
+ Set **parsingMode** to `jsonLines` if your blobs consist of multiple JSON entities, that are separated by a new line, and you need each entity to become a separate document in Azure Search.
209220

210-
For JSON arrays, if the array exists as a lower-level property, you can set a document root indicating where the array is placed within the blob.
221+
You can think of a document as a single item in search results. If you want each element in the array to show up in search results as an independent item, then use the `jsonArray` or `jsonLines` option as appropriate.
222+
223+
Within the indexer definition, you can optionally use [field mappings](search-indexer-field-mappings.md) to choose which properties of the source JSON document are used to populate your target search index. For `jsonArray` parsing mode, if the array exists as a lower-level property, you can set a document root indicating where the array is placed within the blob.
211224

212225
> [!IMPORTANT]
213-
> When you use `json` or `jsonArray` parsing mode, Azure Search assumes that all blobs in your data source contain JSON. If you need to support a mix of JSON and non-JSON blobs in the same data source, let us know on [our UserVoice site](https://feedback.azure.com/forums/263029-azure-search).
226+
> When you use `json`, `jsonArray` or `jsonLines` parsing mode, Azure Search assumes that all blobs in your data source contain JSON. If you need to support a mix of JSON and non-JSON blobs in the same data source, let us know on [our UserVoice site](https://feedback.azure.com/forums/263029-azure-search).
214227
215228

216229
### How to parse single JSON blobs
@@ -229,7 +242,7 @@ The blob indexer parses the JSON document into a single Azure Search document. T
229242

230243
As noted, field mappings are not required. Given an index with "text", "datePublished, and "tags" fields, the blob indexer can infer the correct mapping without a field mapping present in the request.
231244

232-
### How to parse JSON arrays
245+
### How to parse JSON arrays in a well-formed JSON document
233246

234247
Alternatively, you can opt for the JSON array feature. This capability is useful when blobs contain an *array of JSON objects*, and you want each element to become a separate Azure Search document. For example, given the following JSON blob, you can populate your Azure Search index with three separate documents, each with "id" and "text" fields.
235248

@@ -239,7 +252,7 @@ Alternatively, you can opt for the JSON array feature. This capability is useful
239252
{ "id" : "3", "text" : "example 3" }
240253
]
241254

242-
For a JSON array, the indexer definition should look similar to the following example. Notice that the parsingMode parameter specifies the `jsonArray` parser. Specifying the right parser and having the right data input are the only two array-specific requirements for indexing JSON blobs.
255+
For a JSON array, the indexer definition should look similar to the following example. Notice that the parsingMode parameter specifies the `jsonArray` parser. Specifying the right parser and having the right data input are the only two array-specific requirements for indexing JSON blobs.
243256

244257
POST https://[service name].search.windows.net/indexers?api-version=2017-11-11
245258
Content-Type: application/json
@@ -278,7 +291,31 @@ Use this configuration to index the array contained in the `level2` property:
278291
"parameters" : { "configuration" : { "parsingMode" : "jsonArray", "documentRoot" : "/level1/level2" } }
279292
}
280293

281-
### Field mappings
294+
### How to parse blobs with multiple JSON entities separated by newlines
295+
296+
If your blob contains multiple JSON entities separated by a newline, and you want each element to become a separate Azure Search document, you can opt for the JSON lines feature. For example, given the following blob (where there are three different JSON entities), you can populate your Azure Search index with three seprate documents, each with "id" and "text" fields.
297+
298+
{ "id" : "1", "text" : "example 1" }
299+
{ "id" : "2", "text" : "example 2" }
300+
{ "id" : "3", "text" : "example 3" }
301+
302+
For a JSON lines, the indexer definition should look similar to the following example. Notice that the parsingMode parameter specifies the `jsonLines` parser.
303+
304+
POST https://[service name].search.windows.net/indexers?api-version=2017-11-11
305+
Content-Type: application/json
306+
api-key: [admin key]
307+
308+
{
309+
"name" : "my-json-indexer",
310+
"dataSourceName" : "my-blob-datasource",
311+
"targetIndexName" : "my-target-index",
312+
"schedule" : { "interval" : "PT2H" },
313+
"parameters" : { "configuration" : { "parsingMode" : "jsonLines" } }
314+
}
315+
316+
Again, notice that field mappings can be omitted, similar to the `jsonArray` parsing mode.
317+
318+
### Using field mappings to build search documents
282319

283320
When source and target fields are not perfectly aligned, you can define a field mapping section in the request body for explicit field-to-field associations.
284321

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title: Index blobs containing multiple search index documents from Azure Blob indexer for full text search - Azure Search
3+
description: Crawl Azure blobs for text content using the Azure Search Blob indexer. Each blob might contain one or more Azure Search index documents.
4+
5+
ms.date: 02/12/2019
6+
author: arv100kri
7+
manager: briansmi
8+
ms.author: arjagann
9+
10+
services: search
11+
ms.service: search
12+
ms.devlang: rest-api
13+
ms.topic: conceptual
14+
ms.custom: seofeb2018
15+
---
16+
17+
# Indexing blobs producing multiple search documents
18+
By default, a blob indexer will treat the contents of a blob as a single search document. Certain **parsingMode** values support scenarios where an individual blob can result in multiple search documents. The different types of **parsingMode** that allow an indexer to extract more than one search document from a blob are:
19+
+ `delimitedText`
20+
+ `jsonArray`
21+
+ `jsonLines`
22+
23+
> [!IMPORTANT]
24+
> These parsing modes are in public preview and should not be used in production environments. For more information, see [REST api-version=2017-11-11-Preview](search-api-2017-11-11-preview.md).
25+
26+
## One-to-many document key
27+
Each document that shows up in an Azure Search index is uniquely identified by a document key.
28+
29+
When no parsing mode is specified, and if there is no explicit mapping for the key field in the index Azure Search automatically [maps](search-indexer-field-mappings.md) the `metadata_storage_path` property as the key. This mapping ensures that each blob appears as a distinct search document.
30+
31+
When using any of the parsing modes listed above, one blob maps to "many" search documents, making a document key solely based on blob metadata unsuitable. To overcome this constraint, Azure Search is capable of generating a "one-to-many" document key for each individual entity extracted from a blob. This property is named `AzureSearch_DocumentKey` and is added to each individual entity extracted from the blob. The value of this property is guaranteed to be unique for each individual entity _across blobs_ and the entities will show up as separate search documents.
32+
33+
By default, when no explicit field mappings for the key index field are specified, the `AzureSearch_DocumentKey` is mapped to it, using the `base64Encode` field-mapping function.
34+
35+
## Example
36+
Assume you've an index definition with the following fields:
37+
+ `id`
38+
+ `temperature`
39+
+ `pressure`
40+
+ `timestamp`
41+
42+
And your blob container has blobs with the following structure:
43+
44+
_Blob1.json_
45+
46+
{ "temperature": 100, "pressure": 100, "timestamp": "2019-02-13T00:00:00Z" }
47+
{ "temperature" : 33, "pressure" : 30, "timestamp": "2019-02-14T00:00:00Z" }
48+
49+
_Blob2.json_
50+
51+
{ "temperature": 1, "pressure": 1, "timestamp": "2018-01-12T00:00:00Z" }
52+
{ "temperature" : 120, "pressure" : 3, "timestamp": "2013-05-11T00:00:00Z" }
53+
54+
When you create an indexer and set the **parsingMode** to `jsonLines` - without specifying any explicit field mappings for the key field, the following mapping will be applied implicitly
55+
56+
{
57+
"sourceFieldName" : "AzureSearch_DocumentKey",
58+
"targetFieldName": "id",
59+
"mappingFunction": { "name" : "base64Encode" }
60+
}
61+
62+
This setup will result in the Azure Search index containing the following information (base64 encoded id shortened for brevity)
63+
64+
| id | temperature | pressure | timestamp |
65+
|----|-------------|----------|-----------|
66+
| aHR0 ... YjEuanNvbjsx | 100 | 100 | 2019-02-13T00:00:00Z |
67+
| aHR0 ... YjEuanNvbjsy | 33 | 30 | 2019-02-14T00:00:00Z |
68+
| aHR0 ... YjIuanNvbjsx | 1 | 1 | 2018-01-12T00:00:00Z |
69+
| aHR0 ... YjIuanNvbjsy | 120 | 3 | 2013-05-11T00:00:00Z |
70+
71+
## Custom field mapping for index key field
72+
73+
Assuming the same index definition as the previous example, say your blob container has blobs with the following structure:
74+
75+
_Blob1.json_
76+
77+
recordid, temperature, pressure, timestamp
78+
1, 100, 100,"2019-02-13T00:00:00Z"
79+
2, 33, 30,"2019-02-14T00:00:00Z"
80+
81+
_Blob2.json_
82+
83+
recordid, temperature, pressure, timestamp
84+
1, 1, 1,"2018-01-12T00:00:00Z"
85+
2, 120, 3,"2013-05-11T00:00:00Z"
86+
87+
When you create an indexer with `delimitedText` **parsingMode**, it might feel natural to set up a field-mapping function to the key field as follows:
88+
89+
{
90+
"sourceFieldName" : "recordid",
91+
"targetFieldName": "id"
92+
}
93+
94+
However, this mapping will _not_ result in 4 documents showing up in the index, because the `recordid` field is not unique _across blobs_. Hence, we recommend you to make use of the implicit field mapping applied from the `AzureSearch_DocumentKey` property to the key index field for "one-to-many" parsing modes.
95+
96+
If you do want to set up an explicit field mapping, make sure that the _sourceField_ is distinct for each individual entity **across all blobs**.
97+
98+
> [!NOTE]
99+
> The approach used by `AzureSearch_DocumentKey` of ensuring uniqueness per extracted entity is subject to change and therefore you should not rely on it's value for your application's needs.
100+
101+
## See also
102+
103+
+ [Indexers in Azure Search](search-indexer-overview.md)
104+
+ [Indexing Azure Blob Storage with Azure Search](search-howto-index-json-blobs.md)
105+
+ [Indexing CSV blobs with Azure Search blob indexer](search-howto-index-csv-blobs.md)
106+
+ [Indexing JSON blobs with Azure Search blob indexer](search-howto-index-csv-blobs.md)
107+
108+
## <a name="NextSteps"></a>Next steps
109+
* To learn more about Azure Search, see the [Search service page](https://azure.microsoft.com/services/search/).

0 commit comments

Comments
 (0)