Skip to content

Commit a435810

Browse files
author
Carey MacDonald
committed
Add documentation for pii detection skill
1 parent 5ec7be9 commit a435810

8 files changed

+146
-4
lines changed

articles/search/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -400,6 +400,8 @@
400400
href: cognitive-search-skill-conditional.md
401401
- name: Document Extraction
402402
href: cognitive-search-skill-document-extraction.md
403+
- name: PII Detection
404+
href: cognitive-search-skill-pii-detection.md
403405
- name: Custom skills
404406
items:
405407
- name: Custom Web API

articles/search/cognitive-search-common-errors-warnings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ If you know that your data set contains multiple languages and thus you need the
245245
```
246246

247247
Here are some references for the currently supported languages for each of the skills that may produce this error message:
248-
* [Text Analytics Supported Languages](https://docs.microsoft.com/azure/cognitive-services/text-analytics/text-analytics-supported-languages) (for the [KeyPhraseExtractionSkill](cognitive-search-skill-keyphrases.md), [EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md), and [SentimentSkill](cognitive-search-skill-sentiment.md))
248+
* [Text Analytics Supported Languages](https://docs.microsoft.com/azure/cognitive-services/text-analytics/text-analytics-supported-languages) (for the [KeyPhraseExtractionSkill](cognitive-search-skill-keyphrases.md), [EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md), [SentimentSkill](cognitive-search-skill-sentiment.md), and [PIIDetectionSkill](cognitive-search-skill-pii-detection.md))
249249
* [Translator Supported Languages](https://docs.microsoft.com/azure/cognitive-services/translator/language-support) (for the [Text TranslationSkill](cognitive-search-skill-text-translation.md))
250250
* [Text SplitSkill](cognitive-search-skill-textsplit.md) Supported Languages: `da, de, en, es, fi, fr, it, ko, pt`
251251

articles/search/cognitive-search-concept-intro.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ ms.date: 11/04/2019
1414

1515
AI enrichment is a capability of Azure Cognitive Search indexing used to extract text from images, blobs, and other unstructured data sources - enriching the content to make it more searchable in an index or knowledge store. Extraction and enrichment are implemented through *cognitive skills* attached to an indexing pipeline. Cognitive skills built into the service fall into these categories:
1616

17-
+ **Natural language processing** skills include [entity recognition](cognitive-search-skill-entity-recognition.md), [language detection](cognitive-search-skill-language-detection.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), text manipulation, and [sentiment detection](cognitive-search-skill-sentiment.md). With these skills, unstructured text can assume new forms, mapped as searchable and filterable fields in an index.
17+
+ **Natural language processing** skills include [entity recognition](cognitive-search-skill-entity-recognition.md), [language detection](cognitive-search-skill-language-detection.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), text manipulation, [sentiment detection](cognitive-search-skill-sentiment.md), and [PII detection](cognitive-search-skill-pii-detection.md). With these skills, unstructured text can assume new forms, mapped as searchable and filterable fields in an index.
1818

1919
+ **Image processing** skills include [Optical Character Recognition (OCR)](cognitive-search-skill-ocr.md) and identification of [visual features](cognitive-search-skill-image-analysis.md), such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like colors or image orientation. You can create text-representations of image content, searchable using all the query capabilities of Azure Cognitive Search.
2020

@@ -104,7 +104,7 @@ Indexes are generated from an index schema that defines the fields, attributes,
104104
| Cognitive skill | An atomic transformation in an enrichment pipeline. Often, it is a component that extracts or infers structure, and therefore augments your understanding of the input data. Almost always, the output is text-based and the processing is natural language processing or image processing that extracts or generates text from image inputs. Output from a skill can be mapped to a field in an index, or used as an input for a downstream enrichment. A skill is either predefined and provided by Microsoft, or custom: created and deployed by you. | [Built-in cognitive skills](cognitive-search-predefined-skills.md) |
105105
| Data extraction | Covers a broad range of processing, but pertaining to AI enrichment, the entity recognition skill is most typically used to extract data (an entity) from a source that doesn't provide that information natively. | See [Entity Recognition Skill](cognitive-search-skill-entity-recognition.md) and [Document Extraction Skill (preview)](cognitive-search-skill-document-extraction.md)|
106106
| Image processing | Infers text from an image, such as the ability to recognize a landmark, or extracts text from an image. Common examples include OCR for lifting characters from a scanned document (JPEG) file, or recognizing a street name in a photograph containing a street sign. | See [Image Analysis Skill](cognitive-search-skill-image-analysis.md) or [OCR Skill](cognitive-search-skill-ocr.md)
107-
| Natural language processing | Text processing for insights and information about text inputs. Language detection, sentiment analysis, and key phrase extraction are skills that fall under natural language processing. | See [Key Phrase Extraction Skill](cognitive-search-skill-keyphrases.md), [Language Detection Skill](cognitive-search-skill-language-detection.md), [Text Translation Skill (preview)](cognitive-search-skill-text-translation.md), [Sentiment Analysis Skill](cognitive-search-skill-sentiment.md) |
107+
| Natural language processing | Text processing for insights and information about text inputs. Language detection, sentiment analysis, and key phrase extraction are skills that fall under natural language processing. | See [Key Phrase Extraction Skill](cognitive-search-skill-keyphrases.md), [Language Detection Skill](cognitive-search-skill-language-detection.md), [Text Translation Skill (preview)](cognitive-search-skill-text-translation.md), [Sentiment Analysis Skill](cognitive-search-skill-sentiment.md), [PII Detection Skill (preview)](cognitive-search-skill-pii-detection.md) |
108108
| Document cracking | The process of extracting or creating text content from non-text sources during indexing. Optical character recognition (OCR) is an example, but generally it refers to core indexer functionality as the indexer extracts content from application files. The data source providing source file location, and the indexer definition providing field mappings, are both key factors in document cracking. | See [Indexers overview](search-indexer-overview.md) |
109109
| Shaping | Consolidate text fragments into a larger structure, or conversely break down larger text chunks into a manageable size for further downstream processing. | See [Shaper Skill](cognitive-search-skill-shaper.md), [Text Merger Skill](cognitive-search-skill-textmerger.md), [Text Split Skill](cognitive-search-skill-textsplit.md) |
110110
| Enriched documents | A transitory internal structure, generated during processing, with final output reflected in a search index. A skillset determines which enrichments are performed. Field mappings determine which data elements are added to the index. Optionally, you can create a knowledge store to persist and explore enriched documents using tools like Storage Explorer, Power BI, or any other tool that connects to Azure Blob storage. | See [Knowledge store (preview)](knowledge-store-concept-intro.md) |

articles/search/cognitive-search-predefined-skills.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Several skills are flexible in what they consume or produce. In general, most sk
3030
| [Microsoft.Skills.Text.LanguageDetectionSkill](cognitive-search-skill-language-detection.md) | This skill uses a pretrained model to detect which language is used (one language ID per document). When multiple languages are used within the same text segments, the output is the LCID of the predominantly used language.|
3131
| [Microsoft.Skills.Text.MergeSkill](cognitive-search-skill-textmerger.md) | Consolidates text from a collection of fields into a single field. |
3232
| [Microsoft.Skills.Text.EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md) | This skill uses a pretrained model to establish entities for a fixed set of categories: people, location, organization, emails, URLs, datetime fields. |
33+
| [Microsoft.Skills.Text.PIIDetectionSkill](cognitive-search-skill-pii-detection.md) | This skill uses a pretrained model to extract personally identifiable information from a given text. The skill also gives various options for masking the detected PII entities in the text. |
3334
| [Microsoft.Skills.Text.SentimentSkill](cognitive-search-skill-sentiment.md) | This skill uses a pretrained model to score positive or negative sentiment on a record by record basis. The score is between 0 and 1. Neutral scores occur for both the null case when sentiment cannot be detected, and for text that is considered neutral. |
3435
| [Microsoft.Skills.Text.SplitSkill](cognitive-search-skill-textsplit.md) | Splits text into pages so that you can enrich or augment content incrementally. |
3536
| [Microsoft.Skills.Text.TranslationSkill](cognitive-search-skill-text-translation.md) | This skill uses a pretrained model to translate the input text into a variety of languages for normalization or localization use cases. |

articles/search/cognitive-search-resources-documentation.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ The following articles are the complete documentation for AI enrichment.
3838
+ [Microsoft.Skills.Text.LanguageDetectionSkill](cognitive-search-skill-language-detection.md)
3939
+ [Microsoft.Skills.Text.EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md)
4040
+ [Microsoft.Skills.Text.MergeSkill](cognitive-search-skill-textmerger.md)
41+
+ [Microsoft.Skills.Text.PIIDetectionSkill](cognitive-search-skill-pii-detection.md)
4142
+ [Microsoft.Skills.Text.SplitSkill](cognitive-search-skill-textsplit.md)
4243
+ [Microsoft.Skills.Text.SentimentSkill](cognitive-search-skill-sentiment.md)
4344
+ [Microsoft.Skills.Text.TranslationSkill](cognitive-search-skill-text-translation.md)
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
title: PII Detection cognitive skill (preview)
3+
titleSuffix: Azure Cognitive Search
4+
description: Extract and mask personally identifiable information from text in an enrichment pipeline in Azure Cognitive Search. This skill is currently in public preview.
5+
6+
manager: nitinme
7+
author: careyjmac
8+
ms.author: chalton
9+
ms.service: cognitive-search
10+
ms.topic: conceptual
11+
ms.date: 1/27/2020
12+
---
13+
14+
# PII Detection cognitive skill
15+
16+
> [!IMPORTANT]
17+
> This skill is currently in public preview. Preview functionality is provided without a service level agreement, and is not recommended for production workloads. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
18+
> The [REST API version 2019-05-06-Preview](search-api-preview.md) provides preview features. There is currently no portal or .NET SDK support.
19+
20+
The **PII Detection** skill extracts personally identifiable information from a text and gives you the option to mask it in the text in various ways. This skill uses the machine learning models provided by [Text Analytics](https://docs.microsoft.com/azure/cognitive-services/text-analytics/overview) in Cognitive Services.
21+
22+
> [!NOTE]
23+
> As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to [attach a billable Cognitive Services resource](cognitive-search-attach-cognitive-services.md). Charges accrue when calling APIs in Cognitive Services, and for image extraction as part of the document-cracking stage in Azure Cognitive Search. There are no charges for text extraction from documents.
24+
>
25+
> Execution of built-in skills is charged at the existing [Cognitive Services pay-as-you go price](https://azure.microsoft.com/pricing/details/cognitive-services/). Image extraction pricing is described on the [Azure Cognitive Search pricing page](https://go.microsoft.com/fwlink/?linkid=2042400).
26+
27+
28+
## @odata.type
29+
Microsoft.Skills.Text.PIIDetectionSkill
30+
31+
## Data limits
32+
The maximum size of a record should be 50,000 characters as measured by [`String.Length`](https://docs.microsoft.com/dotnet/api/system.string.length). If you need to break up your data before sending it to the skill, consider using the [Text Split skill](cognitive-search-skill-textsplit.md).
33+
34+
## Skill parameters
35+
36+
Parameters are case-sensitive and are all optional.
37+
38+
| Parameter name | Description |
39+
|--------------------|-------------|
40+
| defaultLanguageCode | Language code of the input text. For now, only `en` is supported. |
41+
| minimumPrecision | A value between 0 and 1. If the confidence score (in the `piiEntities` output) is lower than this value, the entity is not returned or masked. The default is 0. |
42+
| maskingMode | A parameter that provides various ways to mask the detected PII in the input text. The following options are supported: <ul><li>`none` (default): This means that no masking will be performed and the `maskedText` output will not be returned. </li><li> `redact`: This option will remove the detected entities from the input text and not replace them with anything. Note that in this case, the offset in the `piiEntities` output will be in relation to the original text, and not the masked text. </li><li> `replace`: This option will replace the detected entities with the character given in the `maskingCharacter` parameter. The character will be repeated to the length of the detected entity so that the offsets will correctly correspond to both the input text as well as the output `maskedText`.</li></ul> |
43+
| maskingCharacter | The character that will be used to masked the text if the `maskingMode` parameter is set to `replace`. The following options are supported: `*` (default), `#`, `X`. This parameter can only be `null` if `maskingMode` is not set to `replace`. |
44+
45+
46+
## Skill inputs
47+
48+
| Input name | Description |
49+
|---------------|-------------------------------|
50+
| languageCode | Optional. Default is `en`. |
51+
| text | The text to analyze. |
52+
53+
## Skill outputs
54+
55+
| Output name | Description |
56+
|---------------|-------------------------------|
57+
| piiEntities | An array of complex types that contains the following fields: <ul><li>text (The actual PII as extracted)</li> <li>type</li><li>subType</li><li>score (Higher value means it's more likely to be a real entity)</li><li>offset (into the input text)</li><li>length</li></ul> </br> [Possible types and subTypes can be found here.](https://docs.microsoft.com/azure/cognitive-services/text-analytics/named-entity-types?tabs=personal) |
58+
| maskedText | If `maskingMode` is set to a value other than `none`, this output will be the string result of the masking performed on the input text as described by the selected `maskingMode`. If `maskingMode` is set to `none`, this output will not be present. |
59+
60+
## Sample definition
61+
62+
```json
63+
{
64+
"@odata.type": "#Microsoft.Skills.Text.PIIDetectionSkill",
65+
"defaultLanguageCode": "en",
66+
"minimumPrecision": 0.5,
67+
"maskingMode": "replace",
68+
"maskingCharacter": "*",
69+
"inputs": [
70+
{
71+
"name": "text",
72+
"source": "/document/content"
73+
}
74+
],
75+
"outputs": [
76+
{
77+
"name": "piiEntities"
78+
},
79+
{
80+
"name": "maskedText"
81+
}
82+
]
83+
}
84+
```
85+
## Sample input
86+
87+
```json
88+
{
89+
"values": [
90+
{
91+
"recordId": "1",
92+
"data":
93+
{
94+
"text": "Microsoft employee with ssn 859-98-0987 is using our awesome API's."
95+
}
96+
}
97+
]
98+
}
99+
```
100+
101+
## Sample output
102+
103+
```json
104+
{
105+
"values": [
106+
{
107+
"recordId": "1",
108+
"data" :
109+
{
110+
"piiEntities":[
111+
{
112+
"text":"859-98-0987",
113+
"type":"U.S. Social Security Number (SSN)",
114+
"subtype":"",
115+
"offset":28,
116+
"length":11,
117+
"score":0.65
118+
}
119+
],
120+
"maskedText": "Microsoft employee with ssn *********** is using our awesome API's."
121+
}
122+
}
123+
]
124+
}
125+
```
126+
127+
128+
## Error cases
129+
If the language code for the document is unsupported, an error is returned and no entities are extracted.
130+
131+
If the skill returns a warning for any reason, the output `maskedText` will be empty. This means that if you expect that output to exist for input into later skills, it will not work as intended. Keep this in mind when writing your skillset definition.
132+
133+
## See also
134+
135+
+ [Built-in skills](cognitive-search-predefined-skills.md)
136+
+ [How to define a skillset](cognitive-search-defining-skillset.md)

articles/search/search-api-preview.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ This article describes the `api-version=2019-05-06-Preview` version of Search se
3030

3131
+ [Text Translation (preview)](cognitive-search-skill-text-translation.md) is a cognitive skill used during indexing that evaluates text and, for each record, returns the text translated to the specified target language.
3232

33+
+ [PII Detection (preview)](cognitive-search-skill-pii-detection.md) is a cognitive skill used during indexing that allows you to extract and mask personally identifiable information from text.
34+
3335
+ [Knowledge store](knowledge-store-concept-intro.md) is a new destination of an AI-based enrichment pipeline. The physical data structure exists in Azure Blob storage and Azure Table storage, and it is created and populated when you run an indexer that has an attached cognitive skillset. The definition of a knowledge store itself is specified within a skillset definition. Within the knowledge store definition, you control the physical structures of your data through *projection* elements that determine how data is shaped, whether data is stored in Table storage or Blob storage, and whether there are multiple views.
3436

3537
## Earlier preview features

articles/search/search-limits-quotas-capacity.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ For the Storage Optimized tiers, you should expect a lower query throughput and
147147

148148
## Data limits (AI enrichment)
149149

150-
An [AI enrichment pipeline](cognitive-search-concept-intro.md) that makes calls to a Text Analytics resource for [entity recognition](cognitive-search-skill-entity-recognition.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), [sentiment analysis](cognitive-search-skill-sentiment.md), and [language detection](cognitive-search-skill-language-detection.md) is subject to data limits. The maximum size of a record should be 50,000 characters as measured by [`String.Length`](https://docs.microsoft.com/dotnet/api/system.string.length). If you need to break up your data before sending it to the sentiment analyzer, use the [Text Split skill](cognitive-search-skill-textsplit.md).
150+
An [AI enrichment pipeline](cognitive-search-concept-intro.md) that makes calls to a Text Analytics resource for [entity recognition](cognitive-search-skill-entity-recognition.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), [sentiment analysis](cognitive-search-skill-sentiment.md), [language detection](cognitive-search-skill-language-detection.md), and [PII detection](cognitive-search-skill-pii-detection.md) is subject to data limits. The maximum size of a record should be 50,000 characters as measured by [`String.Length`](https://docs.microsoft.com/dotnet/api/system.string.length). If you need to break up your data before sending it to the sentiment analyzer, use the [Text Split skill](cognitive-search-skill-textsplit.md).
151151

152152
## Throttling limits
153153

0 commit comments

Comments
 (0)