You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/search/search-howto-index-csv-blobs.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,9 @@ In this article, you will learn how to parse CSV blobs with an Azure Search blob
26
26
> CSV blob indexing is currently in public preview and should not be used in production environments. For more information, see [REST api-version=2017-11-11-Preview](search-api-2017-11-11-preview.md).
27
27
>
28
28
29
+
> [!NOTE]
30
+
> Follow the indexer configuration recommendations in [One-to-many indexing](search-howto-index-one-to-many-blobs.md) to output multiple search documents from one Azure blob.
31
+
29
32
## Setting up CSV indexing
30
33
To index CSV blobs, create or update an indexer definition with the `delimitedText` parsing mode:
Copy file name to clipboardExpand all lines: articles/search/search-howto-index-json-blobs.md
+49-12Lines changed: 49 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,13 @@ This article shows you how to configure an Azure Search blob indexer to extract
19
19
20
20
You can use the [portal](#json-indexer-portal), [REST APIs](#json-indexer-rest), or [.NET SDK](#json-indexer-dotnet) to index JSON content. Common to all approaches is that JSON documents are located in a blob container in an Azure Storage account. For guidance on pushing JSON documents from other non-Azure platforms, see [Data import in Azure Search](search-what-is-data-import.md).
21
21
22
-
JSON blobs in Azure Blob storage are typically either a single JSON document or a JSON array. The blob indexer in Azure Search can parse either construction depending on how you set the **parsingMode** parameter on the request.
22
+
JSON blobs in Azure Blob storage are typically either a single JSON document or a collection of JSON entities. For JSON collections, the blob could have an **array** of well-formed JSON elements. Blobs could also be composed of multiple individual JSON entities separated by a newline. The blob indexer in Azure Search can parse any such construction, depending on how you set the **parsingMode** parameter on the request.
23
+
24
+
> [!IMPORTANT]
25
+
> `json` and `jsonArray` parsing modes are generally available, but `jsonLines` parsing mode is in public preview and should not be used in production environments. For more information, see [REST api-version=2017-11-11-Preview](search-api-2017-11-11-preview.md).
26
+
27
+
> [!NOTE]
28
+
> Follow the indexer configuration recommendations in [One-to-many indexing](search-howto-index-one-to-many-blobs.md) to output multiple search documents from one Azure blob.
23
29
24
30
<aname="json-indexer-portal"></a>
25
31
@@ -48,11 +54,13 @@ In the **data source** page, the source must be **Azure Blob Storage**, with the
48
54
49
55
+**Data to extract** should be *Content and metadata*. Choosing this option allows the wizard to infer an index schema and map the fields for import.
50
56
51
-
+**Parsing mode** should be set to *JSON*or *JSON array*.
57
+
+**Parsing mode** should be set to *JSON*, *JSON array*or *JSON lines*.
52
58
53
59
*JSON* articulates each blob as a single search document, showing up as an independent item in search results.
54
60
55
-
*JSON array* is for blobs composed of multiple elements, where you want each element to be articulated as a standalone, independent search document. If blobs are complex, and you don't choose *JSON array* the entire blob is ingested as a single document.
61
+
*JSON array* is for blobs that contain well-formed JSON data - the well-formed JSON corresponds to an array of objects, or has a property which is an array of objects and you want each element to be articulated as a standalone, independent search document. If blobs are complex, and you don't choose *JSON array* the entire blob is ingested as a single document.
62
+
63
+
*JSON lines* is for blobs composed of multiple JSON entities separated by a new-line, where you want each entity to be articulated as a standalone independent search document. If blobs are complex, and you don't choose *JSON lines* parsing mode, then the entire blob is ingested as a single document.
56
64
57
65
+**Storage container** must specify your storage account and container, or a connection string that resolves to the container. You can get connection strings on the Blob service portal page.
58
66
@@ -113,12 +121,13 @@ For code-based JSON indexing, use [Postman](search-fiddler.md) and the REST API
113
121
114
122
In contrast with the portal wizard, a code approach requires that you have an index in-place, ready to accept the JSON documents when you send the **Create Indexer** request.
115
123
116
-
JSON blobs in Azure Blob storage are typically either a single JSON document or a JSON array. The blob indexer in Azure Search can parse either construction, depending on how you set the **parsingMode** parameter on the request.
124
+
JSON blobs in Azure Blob storage are typically either a single JSON document or a JSON "array". The blob indexer in Azure Search can parse either construction, depending on how you set the **parsingMode** parameter on the request.
| One per blob |`json`| Parses JSON blobs as a single chunk of text. Each JSON blob becomes a single Azure Search document. | Generally available in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) APIs. |
121
-
| Multiple per blob |`jsonArray`| Parses a JSON array in the blob, where each element of the array becomes a separate Azure Search document. | Generally available in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) APIs. |
128
+
| One per blob |`json`| Parses JSON blobs as a single chunk of text. Each JSON blob becomes a single Azure Search document. | Generally available in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) API and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) SDK. |
129
+
| Multiple per blob |`jsonArray`| Parses a JSON array in the blob, where each element of the array becomes a separate Azure Search document. | Available in preview in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) API and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) SDK. |
130
+
| Multiple per blob |`jsonLines`| Parses a blob which contains multiple JSON entities (an "array") separated by a newline, where each entity becomes a separate Azure Search document. | Available in preview in both [REST](https://docs.microsoft.com/rest/api/searchservice/indexer-operations) API and [.NET](https://docs.microsoft.com/dotnet/api/microsoft.azure.search.models.indexer) SDK. |
122
131
123
132
### 1 - Assemble inputs for the request
124
133
@@ -205,12 +214,16 @@ Until now, definitions for the data source and index have been parsingMode agnos
205
214
206
215
+ Set **parsingMode** to `json` to index each blob as a single document.
207
216
208
-
+ Set **parsingMode** to `jsonArray` if your blobs consist of JSON arrays, and you need each element of the array to become a separate document in Azure Search. You can think of a document as a single item in search results. If you want each element in the array to show up in search results as an independent item, then use the `jsonArray` option.
217
+
+ Set **parsingMode** to `jsonArray` if your blobs consist of JSON arrays, and you need each element of the array to become a separate document in Azure Search.
218
+
219
+
+ Set **parsingMode** to `jsonLines` if your blobs consist of multiple JSON entities, that are separated by a new line, and you need each entity to become a separate document in Azure Search.
209
220
210
-
For JSON arrays, if the array exists as a lower-level property, you can set a document root indicating where the array is placed within the blob.
221
+
You can think of a document as a single item in search results. If you want each element in the array to show up in search results as an independent item, then use the `jsonArray` or `jsonLines` option as appropriate.
222
+
223
+
Within the indexer definition, you can optionally use [field mappings](search-indexer-field-mappings.md) to choose which properties of the source JSON document are used to populate your target search index. For `jsonArray` parsing mode, if the array exists as a lower-level property, you can set a document root indicating where the array is placed within the blob.
211
224
212
225
> [!IMPORTANT]
213
-
> When you use `json`or `jsonArray` parsing mode, Azure Search assumes that all blobs in your data source contain JSON. If you need to support a mix of JSON and non-JSON blobs in the same data source, let us know on [our UserVoice site](https://feedback.azure.com/forums/263029-azure-search).
226
+
> When you use `json`, `jsonArray`or `jsonLines` parsing mode, Azure Search assumes that all blobs in your data source contain JSON. If you need to support a mix of JSON and non-JSON blobs in the same data source, let us know on [our UserVoice site](https://feedback.azure.com/forums/263029-azure-search).
214
227
215
228
216
229
### How to parse single JSON blobs
@@ -229,7 +242,7 @@ The blob indexer parses the JSON document into a single Azure Search document. T
229
242
230
243
As noted, field mappings are not required. Given an index with "text", "datePublished, and "tags" fields, the blob indexer can infer the correct mapping without a field mapping present in the request.
231
244
232
-
### How to parse JSON arrays
245
+
### How to parse JSON arrays in a well-formed JSON document
233
246
234
247
Alternatively, you can opt for the JSON array feature. This capability is useful when blobs contain an *array of JSON objects*, and you want each element to become a separate Azure Search document. For example, given the following JSON blob, you can populate your Azure Search index with three separate documents, each with "id" and "text" fields.
235
248
@@ -239,7 +252,7 @@ Alternatively, you can opt for the JSON array feature. This capability is useful
239
252
{ "id" : "3", "text" : "example 3" }
240
253
]
241
254
242
-
For a JSON array, the indexer definition should look similar to the following example. Notice that the parsingMode parameter specifies the `jsonArray` parser. Specifying the right parser and having the right data input are the only two array-specific requirements for indexing JSON blobs.
255
+
For a JSON array, the indexer definition should look similar to the following example. Notice that the parsingMode parameter specifies the `jsonArray` parser. Specifying the right parser and having the right data input are the only two array-specific requirements for indexing JSON blobs.
243
256
244
257
POST https://[service name].search.windows.net/indexers?api-version=2017-11-11
245
258
Content-Type: application/json
@@ -278,7 +291,31 @@ Use this configuration to index the array contained in the `level2` property:
### How to parse blobs with multiple JSON entities separated by newlines
295
+
296
+
If your blob contains multiple JSON entities separated by a newline, and you want each element to become a separate Azure Search document, you can opt for the JSON lines feature. For example, given the following blob (where there are three different JSON entities), you can populate your Azure Search index with three seprate documents, each with "id" and "text" fields.
297
+
298
+
{ "id" : "1", "text" : "example 1" }
299
+
{ "id" : "2", "text" : "example 2" }
300
+
{ "id" : "3", "text" : "example 3" }
301
+
302
+
For a JSON lines, the indexer definition should look similar to the following example. Notice that the parsingMode parameter specifies the `jsonLines` parser.
303
+
304
+
POST https://[service name].search.windows.net/indexers?api-version=2017-11-11
Again, notice that field mappings can be omitted, similar to the `jsonArray` parsing mode.
317
+
318
+
### Using field mappings to build search documents
282
319
283
320
When source and target fields are not perfectly aligned, you can define a field mapping section in the request body for explicit field-to-field associations.
By default, a blob indexer will treat the contents of a blob as a single search document. Certain **parsingMode** values support scenarios where an individual blob can result in multiple search documents. The different types of **parsingMode** that allow an indexer to extract more than one search document from a blob are:
19
+
+`delimitedText`
20
+
+`jsonArray`
21
+
+`jsonLines`
22
+
23
+
> [!IMPORTANT]
24
+
> These parsing modes are in public preview and should not be used in production environments. For more information, see [REST api-version=2017-11-11-Preview](search-api-2017-11-11-preview.md).
25
+
26
+
## One-to-many document key
27
+
Each document that shows up in an Azure Search index is uniquely identified by a document key.
28
+
29
+
When no parsing mode is specified, and if there is no explicit mapping for the key field in the index Azure Search automatically [maps](search-indexer-field-mappings.md) the `metadata_storage_path` property as the key. This mapping ensures that each blob appears as a distinct search document.
30
+
31
+
When using any of the parsing modes listed above, one blob maps to "many" search documents, making a document key solely based on blob metadata unsuitable. To overcome this constraint, Azure Search is capable of generating a "one-to-many" document key for each individual entity extracted from a blob. This property is named `AzureSearch_DocumentKey` and is added to each individual entity extracted from the blob. The value of this property is guaranteed to be unique for each individual entity _across blobs_ and the entities will show up as separate search documents.
32
+
33
+
By default, when no explicit field mappings for the key index field are specified, the `AzureSearch_DocumentKey` is mapped to it, using the `base64Encode` field-mapping function.
34
+
35
+
## Example
36
+
Assume you've an index definition with the following fields:
37
+
+`id`
38
+
+`temperature`
39
+
+`pressure`
40
+
+`timestamp`
41
+
42
+
And your blob container has blobs with the following structure:
When you create an indexer and set the **parsingMode** to `jsonLines` - without specifying any explicit field mappings for the key field, the following mapping will be applied implicitly
55
+
56
+
{
57
+
"sourceFieldName" : "AzureSearch_DocumentKey",
58
+
"targetFieldName": "id",
59
+
"mappingFunction": { "name" : "base64Encode" }
60
+
}
61
+
62
+
This setup will result in the Azure Search index containing the following information (base64 encoded id shortened for brevity)
Assuming the same index definition as the previous example, say your blob container has blobs with the following structure:
74
+
75
+
_Blob1.json_
76
+
77
+
recordid, temperature, pressure, timestamp
78
+
1, 100, 100,"2019-02-13T00:00:00Z"
79
+
2, 33, 30,"2019-02-14T00:00:00Z"
80
+
81
+
_Blob2.json_
82
+
83
+
recordid, temperature, pressure, timestamp
84
+
1, 1, 1,"2018-01-12T00:00:00Z"
85
+
2, 120, 3,"2013-05-11T00:00:00Z"
86
+
87
+
When you create an indexer with `delimitedText`**parsingMode**, it might feel natural to set up a field-mapping function to the key field as follows:
88
+
89
+
{
90
+
"sourceFieldName" : "recordid",
91
+
"targetFieldName": "id"
92
+
}
93
+
94
+
However, this mapping will _not_ result in 4 documents showing up in the index, because the `recordid` field is not unique _across blobs_. Hence, we recommend you to make use of the implicit field mapping applied from the `AzureSearch_DocumentKey` property to the key index field for "one-to-many" parsing modes.
95
+
96
+
If you do want to set up an explicit field mapping, make sure that the _sourceField_ is distinct for each individual entity **across all blobs**.
97
+
98
+
> [!NOTE]
99
+
> The approach used by `AzureSearch_DocumentKey` of ensuring uniqueness per extracted entity is subject to change and therefore you should not rely on it's value for your application's needs.
100
+
101
+
## See also
102
+
103
+
+[Indexers in Azure Search](search-indexer-overview.md)
104
+
+[Indexing Azure Blob Storage with Azure Search](search-howto-index-json-blobs.md)
105
+
+[Indexing CSV blobs with Azure Search blob indexer](search-howto-index-csv-blobs.md)
106
+
+[Indexing JSON blobs with Azure Search blob indexer](search-howto-index-csv-blobs.md)
107
+
108
+
## <aname="NextSteps"></a>Next steps
109
+
* To learn more about Azure Search, see the [Search service page](https://azure.microsoft.com/services/search/).
0 commit comments