Skip to content

Commit e63c30f

Browse files
committed
Custom entity lookup skill
1 parent f315623 commit e63c30f

File tree

5 files changed

+319
-13
lines changed

5 files changed

+319
-13
lines changed

articles/search/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -389,6 +389,8 @@
389389
href: cognitive-search-predefined-skills.md
390390
- name: Conditional
391391
href: cognitive-search-skill-conditional.md
392+
- name: Custom Entity Lookup
393+
href: cognitive-search-skill-custom-entity-lookup.md
392394
- name: Document Extraction
393395
href: cognitive-search-skill-document-extraction.md
394396
- name: Entity Recognition

articles/search/cognitive-search-predefined-skills.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ Several skills are flexible in what they consume or produce. In general, most sk
4040
| [Microsoft.Skills.Util.DocumentExtractionSkill](cognitive-search-skill-document-extraction.md) | Extracts content from a file within the enrichment pipeline. |
4141
| [Microsoft.Skills.Util.ShaperSkill](cognitive-search-skill-shaper.md) | Maps output to a complex type (a multi-part data type, which might be used for a full name, a multi-line address, or a combination of last name and a personal identifier.) |
4242
| [Microsoft.Skills.Custom.WebApiSkill](cognitive-search-custom-skill-web-api.md) | Allows extensibility of an AI enrichment pipeline by making an HTTP call into a custom Web API |
43+
|[Microsoft.Skills.Text.CustomEntityLookupSkill](cognitive-search-skill-custom-entity-lookup.md)| Looks for text from a custom, user-defined list of words and phrases.|
4344

4445

4546
For guidance on creating a [custom skill](cognitive-search-custom-skill-web-api.md), see [How to define a custom interface](cognitive-search-custom-skill-interface.md) and [Example: Creating a custom skill for AI enrichment](cognitive-search-create-custom-skill-example.md).
Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
---
2+
title: Custom Entity Lookup cognitive search skill
3+
titleSuffix: Azure Cognitive Search
4+
description: Extract different custom entities from text in an Azure Cognitive Search cognitive search pipeline. This skill is currently in public preview.
5+
6+
manager: nitinme
7+
author: luiscabrer
8+
ms.author: luisca
9+
ms.service: cognitive-search
10+
ms.topic: conceptual
11+
ms.date: 01/30/2020
12+
---
13+
14+
# Custom Entity Lookup cognitive skill (Preview)
15+
16+
> [!IMPORTANT]
17+
> This skill is currently in public preview. Preview functionality is provided without a service level agreement, and is not recommended for production workloads. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). There is currently no portal or .NET SDK support.
18+
19+
The **Custom Entity Lookup** skill looks for text from a custom, user-defined list of words and phrases. Using this list, it labels all documents with any matching entities. The skill also supports a degree of fuzzy matching that can be applied to find matches that are similar but not quite exact.
20+
21+
This skill is not bound to a Cognitive Services API and can be used free of charge during the preview period. You should still [attach a Cognitive Services resource](https://docs.microsoft.com/azure/search/cognitive-search-attach-cognitive-services), however, to override the daily enrichment limit. The daily limit applies to free access to Cognitive Services when accessed through Azure Cognitive Search.
22+
23+
## @odata.type
24+
Microsoft.Skills.Text.CustomEntityLookupSkill
25+
26+
## Data limits
27+
1. The maximum input record size supported is 256 MB. If you need to break up your data before sending it to the custom entity lookup skill, consider using the [Text Split skill](cognitive-search-skill-textsplit.md).
28+
1. The maximum entities definition table supported is 10 MB if it is provided using the *entitiesDefitionUri* parameter.
29+
1. If the entities are defined inline, using the *inlineEntitiesDefinition* parameter, the maximum supported size is 10 KB.
30+
31+
## Skill parameters
32+
33+
Parameters are case-sensitive.
34+
35+
| Parameter name | Description |
36+
|--------------------|-------------|
37+
| entitiesDefinitionUri | Path to a JSON or CSV file containing all the target text to match against. This entity definition is read at the beginning of an indexer run; any updates to this file mid-run won’t be realized until subsequent runs. This config must be accessible over HTTPS. See [Custom Entity Definition](#custom-entity-definition-format) Format” below for expected CSV or JSON schema.|
38+
|inlineEntitiesDefinition | Inline JSON entity definitions. This parameter supersedes the entitiesDefinitionUri parameter if present. No more than 10 KB of configuration may be provided inline. See [Custom Entity Definition](#custom-entity-definition-format) below for expected JSON schema. |
39+
|defaultLanguageCode | (Optional) Language code of the input text used to tokenize and delineate input text. The following languages are supported: `da, de, en, es, fi, fr, it, ko, pt`. The default is English (`en`). If you pass a languagecode-countrycode format, only the languagecode part of the format is used. |
40+
41+
42+
## Skill inputs
43+
44+
| Input name | Description |
45+
|---------------|-------------------------------|
46+
| text | The text to analyze. |
47+
| languageCode | Optional. Default is `"en"`. |
48+
49+
50+
## Skill outputs
51+
52+
53+
| Output name | Description |
54+
|---------------|-------------------------------|
55+
| entities | An array of objects that contain information about the matches that were found, and related metadata. Each of the entities identified may contain the following fields: <ul> <li> *name*: The top-level entity identified. The entity represents the "normalized" form. </li> <li> *id*: A unique identifier for the entity as defined by the user in the "Custom Entity Definition Format".</li> <li> *description*: Entity description as defined by the user in the "Custom Entity Definition Format". </li> <li> *type:* Entity type as defined by the user in the "Custom Entity Definition Format".</li> <li> *subtype:* Entity subtype as defined by the user in the "Custom Entity Definition Format".</li> <li> *matches*: Collection that describes each of the matches for that entity on the source text. Each match will have the following members: </li> <ul> <li> *text*: The raw text match from the source document. </li> <li> *offset*: The location where the match was found in the text. </li> <li> *length*: The length of the matched text. </li> <li> *matchDistance*: The number of characters different this match was from original entity name or alias. </li> </ul> </ul>
56+
|
57+
58+
## Custom Entity Definition Format
59+
60+
There are 3 different ways to provide the list of custom entities to the Custom Entity Lookup skill. You can provide the list in a .CSV file, a .JSON file or as an inline definition as part of the skill definition.
61+
62+
If the definition file is a .CSV or .JSON file, the path of the file needs to be provided as part of the *entitiesDefitionUri* parameter. In this case, the file is downloaded once at the beginning of each indexer run. The file must be accessible as long as the indexer is intended to run.
63+
64+
If the definition is provided inline, it should be provided as inline as the content of the *inlineEntitiesDefinition* skill parameter.
65+
66+
### CSV format
67+
68+
You can provide the definition of the custom entities to look for in a Comma-Separated Value (CSV) file by providing the path to the file and setting it in the *entitiesDefitionUri* skill parameter. The path should be at an https location. The definition file can be up to 10 MB in size.
69+
70+
The CSV format is simple. Each line represents a unique entity, as shown below:
71+
72+
```
73+
Bill Gates, BillG, William H. Gates
74+
Microsoft, MSFT
75+
Satya Nadella
76+
```
77+
78+
In this case, there are three entities that can be returned as entities found (Bill Gates, Satya Nadella, Microsoft), but they will be identified if any of the terms on the line (aliases) are matched on the text. For instance, if the string "William H. Gates" is found in a document, a match for the "Bill Gates" entity will be returned.
79+
80+
### JSON format
81+
82+
You can provide the definition of the custom entities to look for in a JSON file as well. The JSON format gives you a bit more flexibility since it allows you to define matching rules per term. For instance, you can specify the fuzzy matching distance (Damerau-Levenshtein distance) for each term or whether the matching should be case-sensitive or not.
83+
84+
Just like with CSV files, you need to provide the path to the JSON file and set it in the *entitiesDefitionUri* skill parameter. The path should be at an https location. The definition file can be up to 10 MB in size.
85+
86+
The most basic JSON custom entity list definition can be a list of entities to match:
87+
88+
```json
89+
[
90+
{
91+
"name" : "Bill Gates"
92+
},
93+
{
94+
"name" : "Microsoft"
95+
},
96+
{
97+
"name" : "Satya Nadella"
98+
}
99+
]
100+
```
101+
102+
A more complex example of a JSON definition can optionally provide the id, description, type and subtype of each entity -- as well as other *aliases*. If an alias term is matched, the entity will be returned as well:
103+
104+
```json
105+
[
106+
{
107+
"name" : "Bill Gates",
108+
"description" : "Microsoft founder." ,
109+
"aliases" : [
110+
{ "text" : "William H. Gates", "caseSensitive" : false },
111+
{ "text" : "BillG", "caseSensitive" : true }
112+
]
113+
},
114+
{
115+
"name" : "Xbox One",
116+
"type": "Harware",
117+
"subtype" : "Gaming Device",
118+
"id" : "4e36bf9d-5550-4396-8647-8e43d7564a76",
119+
"description" : "The Xbox One product"
120+
},
121+
{
122+
"name" : "LinkedIn" ,
123+
"description" : "The LinkedIn company",
124+
"id" : "differentIdentifyingScheme123",
125+
"fuzzyEditDistance" : 0
126+
},
127+
{
128+
"name" : "Microsoft" ,
129+
"description" : "Microsoft Corporation",
130+
"id" : "differentIdentifyingScheme987",
131+
"defaultCaseSensitive" : false,
132+
"defaultFuzzyEditDistance" : 1,
133+
"aliases" : [
134+
{ "text" : "MSFT", "caseSensitive" : true }
135+
]
136+
}
137+
]
138+
```
139+
140+
The tables below describe in more details the different configuration parameters you can set when defining the entities to match:
141+
142+
| Field name | Description |
143+
|--------------|----------------------|
144+
| name | The top-level entity descriptor. Matches in the skill output will be grouped by this name, and it should represent the "normalized" form of the text being found. |
145+
| description | (Optional) This field can be used as a passthrough for custom metadata about the matched text(s). The value of this field will appear with every match of its entity in the skill output. |
146+
| type | (Optional) This field can be used as a passthrough for custom metadata about the matched text(s). The value of this field will appear with every match of its entity in the skill output. |
147+
| subtype | (Optional) This field can be used as a passthrough for custom metadata about the matched text(s). The value of this field will appear with every match of its entity in the skill output. |
148+
| id | (Optional) This field can be used as a passthrough for custom metadata about the matched text(s). The value of this field will appear with every match of its entity in the skill output. |
149+
| caseSensitive | (Optional) Defaults to false. Boolean value denoting whether comparisons with the entity name should be sensitive to character casing. Sample case insensitive matches of "Microsoft" could be: microsoft, microSoft, MICROSOFT |
150+
| fuzzyEditDistance | (Optional) Defaults to 0. Maximum value of 5. Denotes the acceptable number of divergent characters that would still constitute a match with the entity name. The smallest possible fuzziness for any given match is returned. For instance, if the edit distance is set to 3, "Windows 10" would still match "Windows", "Windows10" and "windows 7". <br/> When case sensitivity is set to false, case differences do NOT count towards fuzziness tolerance, but otherwise do. |
151+
| defaultCaseSensitive | (Optional) Changes the default case sensitivity value for this entity. It be used to change the default value of all aliases caseSensitive values. |
152+
| defaultFuzzyEditDistance | (Optional) Changes the default fuzzy edit distance value for this entity. It can be used to change the default value of all aliases fuzzyEditDistance values. |
153+
| aliases | (Optional) An array of complex objects that can be used to specify alternative spellings or synonyms to the root entity name. |
154+
155+
| Alias properties | Description |
156+
|------------------|-------------|
157+
| text | The alternative spelling or representation of some target entity name. |
158+
| caseSensitive | (Optional) Acts the same as root entity “caseSensitive” parameter above, but applies to only this one alias. |
159+
| fuzzyEditDistance | (Optional) Acts the same as root entity “fuzzyEditDistance” parameter above, but applies to only this one alias. |
160+
161+
162+
### Inline format
163+
164+
In some cases, it may be more convenient to provide the list of custom entities to match inline directly into the skill definition. In that case you can use a similar JSON format to the one described above, but it is inlined in the skill definition.
165+
Only configurations that are less than 10 KB in size (serialized size) can be defined inline.
166+
167+
## Sample definition
168+
169+
A sample skill definition using an inline format is shown below:
170+
171+
```json
172+
{
173+
"@odata.type": "#Microsoft.Skills.Text.CustomEntityLookupSkill",
174+
"context": "/document",
175+
"inlineEntitiesDefinition":
176+
[
177+
{
178+
"name" : "Bill Gates",
179+
"description" : "Microsoft founder." ,
180+
"aliases" : [
181+
{ "text" : "William H. Gates", "caseSensitive" : false },
182+
{ "text" : "BillG", "caseSensitive" : true }
183+
]
184+
},
185+
{
186+
"name" : "Xbox One",
187+
"type": "Harware",
188+
"subtype" : "Gaming Device",
189+
"id" : "4e36bf9d-5550-4396-8647-8e43d7564a76",
190+
"description" : "The Xbox One product"
191+
}
192+
],
193+
"inputs": [
194+
{
195+
"name": "text",
196+
"source": "/document/content"
197+
}
198+
],
199+
"outputs": [
200+
{
201+
"name": "entities",
202+
"targetName": "matchedEntities"
203+
}
204+
]
205+
}
206+
```
207+
Alternatively, if you decide to provide a pointer to the entities definition file, a sample skill definition using the entitiesDefinitionUri format is shown below:
208+
209+
```json
210+
{
211+
"@odata.type": "#Microsoft.Skills.Text.CustomEntityLookupSkill",
212+
"context": "/document",
213+
"entitiesDefinitionUri": "https://myblobhost.net/keyWordsConfig.csv",
214+
"inputs": [
215+
{
216+
"name": "text",
217+
"source": "/document/content"
218+
}
219+
],
220+
"outputs": [
221+
{
222+
"name": "entities",
223+
"targetName": "matchedEntities"
224+
}
225+
]
226+
}
227+
228+
```
229+
230+
## Sample input
231+
232+
```json
233+
{
234+
"values": [
235+
{
236+
"recordId": "1",
237+
"data":
238+
{
239+
"text": "The company microsoft was founded by Bill Gates. Microsoft's gaming console is called Xbox",
240+
"languageCode": "en"
241+
}
242+
}
243+
]
244+
}
245+
```
246+
247+
## Sample output
248+
249+
```json
250+
{
251+
"values" :
252+
[
253+
{
254+
"recordId": "1",
255+
"data" : {
256+
"entities": [
257+
{
258+
"name" : "Microsoft",
259+
"description" : "This document refers to Microsoft the company",
260+
"id" : "differentIdentifyingScheme987",
261+
"matches" : [
262+
{
263+
"text" : "microsoft",
264+
"offset" : 13,
265+
"length" : 9,
266+
"matchDistance" : 0
267+
},
268+
{
269+
"text" : "Microsoft",
270+
"offset" : 49,
271+
"length" : 9,
272+
"matchDistance" : 0
273+
}
274+
]
275+
},
276+
{
277+
"name" : "Bill Gates",
278+
"description" : "William Henry Gates III, founder of Microsoft.",
279+
"matches" : [
280+
{
281+
"text" : "Bill Gates",
282+
"offset" : 37,
283+
"length" : 10,
284+
"matchDistance" : 0
285+
}
286+
]
287+
}
288+
]
289+
}
290+
}
291+
]
292+
}
293+
```
294+
295+
## See also
296+
297+
+ [Built-in skills](cognitive-search-predefined-skills.md)
298+
+ [How to define a skillset](cognitive-search-defining-skillset.md)
299+
+ [Entity Recognition skill (to search for well known entities)](cognitive-search-skill-entity-recognition.md)

articles/search/search-api-preview.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,29 @@
11
---
2-
title: REST API version 2019-05-06-Preview
2+
title: Preview features in REST API
33
titleSuffix: Azure Cognitive Search
4-
description: Azure Cognitive Search service REST API Version 2019-05-06-Preview includes experimental features such as knowledge store and indexer caching for incremental enrichment..
4+
description: Azure Cognitive Search service REST API Version 2019-05-06-Preview includes experimental features such as knowledge store and indexer caching for incremental enrichment.
55

66
manager: nitinme
77
author: brjohnstmsft
88
ms.author: brjohnst
99
ms.service: cognitive-search
1010
ms.topic: conceptual
11-
ms.date: 01/15/2020
11+
ms.date: 01/30/2020
1212
---
13-
# Azure Cognitive Search service REST api-version 2019-05-06-Preview
13+
# Preview features in Azure Cognitive Search
1414

15-
This article describes the `api-version=2019-05-06-Preview` version of Search service REST API, offering experimental features not yet generally available.
15+
This article lists features currently in preview. Features that transition from preview to general availability are removed from this list. You can check [Service Updates](https://azure.microsoft.com/updates/?product=search) or [What's New](whats-new.md) for announcements regarding general availability.
1616

17-
> [!NOTE]
18-
> Preview features are available for testing and experimentation with the goal of gathering feedback and are subject to change. We strongly advise against using preview APIs in production applications.
17+
While some preview features might be available in the portal and .NET SDK, the REST API always has preview features. The current preview API version is `2019-05-06-Preview`.
1918

20-
## Features in 2019-05-06-Preview
19+
> [!IMPORTANT]
20+
> Preview functionality is provided without a service level agreement, and is not recommended for production workloads. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
2121
22-
This section lists features having preview status. Most were added in the current 2019-05-06-Preview API, but some like `moreLikeThis` are from earlier preview versions that rolled into the latest preview API.
22+
## Features in public preview
2323

24-
Once a preview feature becomes generally available, it is removed from this list. You can check [Service Updates](https://azure.microsoft.com/updates/?product=search) or [What's New](whats-new.md) for announcements regarding general availability.
24+
+ [Custom Entity Lookup (preview)](cognitive-search-skill-custom-entity-lookup.md ) looks for text from a custom, user-defined list of words and phrases. Using this list, it labels all documents with any matching entities. The skill also supports a degree of fuzzy matching that can be applied to find matches that are similar but not quite exact.
25+
26+
+ [PII Detection (preview)](cognitive-search-skill-pii-detection.md) is a cognitive skill used during indexing that extracts personally identifiable information from an input text and gives you the option to mask it from that text in various ways.
2527

2628
+ [Incremental enrichment(preview)](cognitive-search-incremental-indexing-conceptual.md) adds caching to an enrichment pipeline, allowing you to reuse existing output if a targeted modification, such as an update to a skillset or another object, does not change the content. Caching applies only to enriched documents produced by a skillset.
2729

0 commit comments

Comments
 (0)