Skip to content

Commit 2486dbc

Browse files
authored
Merge pull request #109059 from Careyjmac/taOffsets
[Azure Cognitive Search] Add note on how offsets are computed for TA skills
2 parents 420e56c + af65d58 commit 2486dbc

File tree

2 files changed

+22
-20
lines changed

2 files changed

+22
-20
lines changed

articles/search/cognitive-search-skill-entity-recognition.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ms.topic: conceptual
1111
ms.date: 11/04/2019
1212
---
1313

14-
# Entity Recognition cognitive skill
14+
# Entity Recognition cognitive skill
1515

1616
The **Entity Recognition** skill extracts entities of different types from text. This skill uses the machine learning models provided by [Text Analytics](https://docs.microsoft.com/azure/cognitive-services/text-analytics/overview) in Cognitive Services.
1717

@@ -31,29 +31,29 @@ The maximum size of a record should be 50,000 characters as measured by [`String
3131

3232
Parameters are case-sensitive and are all optional.
3333

34-
| Parameter name | Description |
34+
| Parameter name | Description |
3535
|--------------------|-------------|
36-
| categories | Array of categories that should be extracted. Possible category types: `"Person"`, `"Location"`, `"Organization"`, `"Quantity"`, `"Datetime"`, `"URL"`, `"Email"`. If no category is provided, all types are returned.|
37-
|defaultLanguageCode | Language code of the input text. The following languages are supported: `ar, cs, da, de, en, es, fi, fr, hu, it, ja, ko, nl, no, pl, pt-BR, pt-PT, ru, sv, tr, zh-hans`. Not all entity categories are supported for all languages; see note below.|
36+
| categories | Array of categories that should be extracted. Possible category types: `"Person"`, `"Location"`, `"Organization"`, `"Quantity"`, `"Datetime"`, `"URL"`, `"Email"`. If no category is provided, all types are returned.|
37+
|defaultLanguageCode | Language code of the input text. The following languages are supported: `ar, cs, da, de, en, es, fi, fr, hu, it, ja, ko, nl, no, pl, pt-BR, pt-PT, ru, sv, tr, zh-hans`. Not all entity categories are supported for all languages; see note below.|
3838
|minimumPrecision | A value between 0 and 1. If the confidence score (in the `namedEntities` output) is lower than this value, the entity is not returned. The default is 0. |
3939
|includeTypelessEntities | Set to `true` if you want to recognize well-known entities that don't fit the current categories. Recognized entities are returned in the `entities` complex output field. For example, "Windows 10" is a well-known entity (a product), but since "Products" is not a supported category, this entity would be included in the entities output field. Default is `false` |
4040

4141

4242
## Skill inputs
4343

44-
| Input name | Description |
44+
| Input name | Description |
4545
|---------------|-------------------------------|
46-
| languageCode | Optional. Default is `"en"`. |
46+
| languageCode | Optional. Default is `"en"`. |
4747
| text | The text to analyze. |
4848

4949
## Skill outputs
5050

5151
> [!NOTE]
5252
> Not all entity categories are supported for all languages. The `"Person"`, `"Location"`, and `"Organization"` entity category types are supported for the full list of languages above. Only _de_, _en_, _es_, _fr_, and _zh-hans_ support extraction of `"Quantity"`, `"Datetime"`, `"URL"`, and `"Email"` types. For more information, see [Language and region support for the Text Analytics API](https://docs.microsoft.com/azure/cognitive-services/text-analytics/language-support).
5353
54-
| Output name | Description |
54+
| Output name | Description |
5555
|---------------|-------------------------------|
56-
| persons | An array of strings where each string represents the name of a person. |
56+
| persons | An array of strings where each string represents the name of a person. |
5757
| locations | An array of strings where each string represents a location. |
5858
| organizations | An array of strings where each string represents an organization. |
5959
| quantities | An array of strings where each string represents a quantity. |
@@ -63,7 +63,7 @@ Parameters are case-sensitive and are all optional.
6363
| namedEntities | An array of complex types that contains the following fields: <ul><li>category</li> <li>value (The actual entity name)</li><li>offset (The location where it was found in the text)</li><li>confidence (Higher value means it's more to be a real entity)</li></ul> |
6464
| entities | An array of complex types that contains rich information about the entities extracted from text, with the following fields <ul><li> name (the actual entity name. This represents a "normalized" form)</li><li> wikipediaId</li><li>wikipediaLanguage</li><li>wikipediaUrl (a link to Wikipedia page for the entity)</li><li>bingId</li><li>type (the category of the entity recognized)</li><li>subType (available only for certain categories, this gives a more granular view of the entity type)</li><li> matches (a complex collection that contains)<ul><li>text (the raw text for the entity)</li><li>offset (the location where it was found)</li><li>length (the length of the raw entity text)</li></ul></li></ul> |
6565

66-
## Sample definition
66+
## Sample definition
6767

6868
```json
6969
{
@@ -93,7 +93,7 @@ Parameters are case-sensitive and are all optional.
9393
]
9494
}
9595
```
96-
## Sample input
96+
## Sample input
9797

9898
```json
9999
{
@@ -110,7 +110,7 @@ Parameters are case-sensitive and are all optional.
110110
}
111111
```
112112

113-
## Sample output
113+
## Sample output
114114

115115
```json
116116
{
@@ -183,6 +183,7 @@ Parameters are case-sensitive and are all optional.
183183
}
184184
```
185185

186+
Note that the offsets returned for entities in the output of this skill are directly returned from the [Text Analytics API](https://docs.microsoft.com/azure/cognitive-services/text-analytics/overview), which means if you are using them to index into the original string, you should use the [StringInfo](https://docs.microsoft.com/dotnet/api/system.globalization.stringinfo?view=netframework-4.8) class in .NET in order to extract the correct content. [More details can be found here.](https://docs.microsoft.com/azure/cognitive-services/text-analytics/concepts/text-offsets)
186187

187188
## Error cases
188189
If the language code for the document is unsupported, an error is returned and no entities are extracted.

articles/search/cognitive-search-skill-pii-detection.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ms.topic: conceptual
1111
ms.date: 1/27/2020
1212
---
1313

14-
# PII Detection cognitive skill
14+
# PII Detection cognitive skill
1515

1616
> [!IMPORTANT]
1717
> This skill is currently in public preview. Preview functionality is provided without a service level agreement, and is not recommended for production workloads. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). There is currently no portal or .NET SDK support.
@@ -34,29 +34,29 @@ The maximum size of a record should be 50,000 characters as measured by [`String
3434

3535
Parameters are case-sensitive and all are optional.
3636

37-
| Parameter name | Description |
37+
| Parameter name | Description |
3838
|--------------------|-------------|
39-
| defaultLanguageCode | Language code of the input text. For now, only `en` is supported. |
39+
| defaultLanguageCode | Language code of the input text. For now, only `en` is supported. |
4040
| minimumPrecision | A value between 0.0 and 1.0. If the confidence score (in the `piiEntities` output) is lower than the set `minimumPrecision` value, the entity is not returned or masked. The default is 0.0. |
4141
| maskingMode | A parameter that provides various ways to mask the detected PII in the input text. The following options are supported: <ul><li>`none` (default): This means that no masking will be performed and the `maskedText` output will not be returned. </li><li> `redact`: This option will remove the detected entities from the input text and not replace them with anything. Note that in this case, the offset in the `piiEntities` output will be in relation to the original text, and not the masked text. </li><li> `replace`: This option will replace the detected entities with the character given in the `maskingCharacter` parameter. The character will be repeated to the length of the detected entity so that the offsets will correctly correspond to both the input text as well as the output `maskedText`.</li></ul> |
4242
| maskingCharacter | The character that will be used to masked the text if the `maskingMode` parameter is set to `replace`. The following options are supported: `*` (default), `#`, `X`. This parameter can only be `null` if `maskingMode` is not set to `replace`. |
4343

4444

4545
## Skill inputs
4646

47-
| Input name | Description |
47+
| Input name | Description |
4848
|---------------|-------------------------------|
49-
| languageCode | Optional. Default is `en`. |
49+
| languageCode | Optional. Default is `en`. |
5050
| text | The text to analyze. |
5151

5252
## Skill outputs
5353

54-
| Output name | Description |
54+
| Output name | Description |
5555
|---------------|-------------------------------|
5656
| piiEntities | An array of complex types that contains the following fields: <ul><li>text (The actual PII as extracted)</li> <li>type</li><li>subType</li><li>score (Higher value means it's more likely to be a real entity)</li><li>offset (into the input text)</li><li>length</li></ul> </br> [Possible types and subTypes can be found here.](https://docs.microsoft.com/azure/cognitive-services/text-analytics/named-entity-types?tabs=personal) |
5757
| maskedText | If `maskingMode` is set to a value other than `none`, this output will be the string result of the masking performed on the input text as described by the selected `maskingMode`. If `maskingMode` is set to `none`, this output will not be present. |
5858

59-
## Sample definition
59+
## Sample definition
6060

6161
```json
6262
{
@@ -81,7 +81,7 @@ Parameters are case-sensitive and all are optional.
8181
]
8282
}
8383
```
84-
## Sample input
84+
## Sample input
8585

8686
```json
8787
{
@@ -97,7 +97,7 @@ Parameters are case-sensitive and all are optional.
9797
}
9898
```
9999

100-
## Sample output
100+
## Sample output
101101

102102
```json
103103
{
@@ -123,6 +123,7 @@ Parameters are case-sensitive and all are optional.
123123
}
124124
```
125125

126+
Note that the offsets returned for entities in the output of this skill are directly returned from the [Text Analytics API](https://docs.microsoft.com/azure/cognitive-services/text-analytics/overview), which means if you are using them to index into the original string, you should use the [StringInfo](https://docs.microsoft.com/dotnet/api/system.globalization.stringinfo?view=netframework-4.8) class in .NET in order to extract the correct content. [More details can be found here.](https://docs.microsoft.com/azure/cognitive-services/text-analytics/concepts/text-offsets)
126127

127128
## Error and warning cases
128129
If the language code for the document is unsupported, a warning is returned and no entities are extracted.

0 commit comments

Comments
 (0)