Merge pull request #102584 from Careyjmac/piiDetectionSkill

GitHubber17 · web-flow · commit 72e7848031ad · 2020-01-28T20:05:32.000-08:00
Introducing the PII Detection Skill
diff --git a/articles/search/TOC.yml b/articles/search/TOC.yml
@@ -400,6 +400,8 @@
       href: cognitive-search-skill-conditional.md
     - name: Document Extraction
       href: cognitive-search-skill-document-extraction.md
+    - name: PII Detection
+      href: cognitive-search-skill-pii-detection.md  
   - name: Custom skills
     items:
     - name: Custom Web API
diff --git a/articles/search/cognitive-search-common-errors-warnings.md b/articles/search/cognitive-search-common-errors-warnings.md
@@ -245,7 +245,7 @@ If you know that your data set contains multiple languages and thus you need the
 ```
 
 Here are some references for the currently supported languages for each of the skills that may produce this error message:
-* [Text Analytics Supported Languages](https://docs.microsoft.com/azure/cognitive-services/text-analytics/text-analytics-supported-languages) (for the [KeyPhraseExtractionSkill](cognitive-search-skill-keyphrases.md), [EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md), and [SentimentSkill](cognitive-search-skill-sentiment.md))
+* [Text Analytics Supported Languages](https://docs.microsoft.com/azure/cognitive-services/text-analytics/text-analytics-supported-languages) (for the [KeyPhraseExtractionSkill](cognitive-search-skill-keyphrases.md), [EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md), [SentimentSkill](cognitive-search-skill-sentiment.md), and [PIIDetectionSkill](cognitive-search-skill-pii-detection.md))
 * [Translator Supported Languages](https://docs.microsoft.com/azure/cognitive-services/translator/language-support) (for the [Text TranslationSkill](cognitive-search-skill-text-translation.md))
 * [Text SplitSkill](cognitive-search-skill-textsplit.md) Supported Languages: `da, de, en, es, fi, fr, it, ko, pt`
 
diff --git a/articles/search/cognitive-search-concept-intro.md b/articles/search/cognitive-search-concept-intro.md
@@ -14,7 +14,7 @@ ms.date: 11/04/2019
 
 AI enrichment is a capability of Azure Cognitive Search indexing used to extract text from images, blobs, and other unstructured data sources - enriching the content to make it more searchable in an index or knowledge store. Extraction and enrichment are implemented through *cognitive skills* attached to an indexing pipeline. Cognitive skills built into the service fall into these categories: 
 
-+ **Natural language processing** skills include [entity recognition](cognitive-search-skill-entity-recognition.md), [language detection](cognitive-search-skill-language-detection.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), text manipulation, and [sentiment detection](cognitive-search-skill-sentiment.md). With these skills, unstructured text can assume new forms, mapped as searchable and filterable fields in an index.
++ **Natural language processing** skills include [entity recognition](cognitive-search-skill-entity-recognition.md), [language detection](cognitive-search-skill-language-detection.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), text manipulation, [sentiment detection](cognitive-search-skill-sentiment.md), and [PII detection](cognitive-search-skill-pii-detection.md). With these skills, unstructured text can assume new forms, mapped as searchable and filterable fields in an index.
 
 + **Image processing** skills include [Optical Character Recognition (OCR)](cognitive-search-skill-ocr.md) and identification of [visual features](cognitive-search-skill-image-analysis.md), such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like colors or image orientation. You can create text-representations of image content, searchable using all the query capabilities of Azure Cognitive Search.
 
@@ -104,7 +104,7 @@ Indexes are generated from an index schema that defines the fields, attributes,
 | Cognitive skill | An atomic transformation in an enrichment pipeline. Often, it is a component that extracts or infers structure, and therefore augments your understanding of the input data. Almost always, the output is text-based and the processing is natural language processing or image processing that extracts or generates text from image inputs. Output from a skill can be mapped to a field in an index, or used as an input for a downstream enrichment. A skill is either predefined and provided by Microsoft, or custom: created and deployed by you. | [Built-in cognitive skills](cognitive-search-predefined-skills.md) |
 | Data extraction | Covers a broad range of processing, but pertaining to AI enrichment, the entity recognition skill is most typically used to extract data (an entity) from a source that doesn't provide that information natively. | See [Entity Recognition Skill](cognitive-search-skill-entity-recognition.md) and [Document Extraction Skill (preview)](cognitive-search-skill-document-extraction.md)| 
 | Image processing | Infers text from an image, such as the ability to recognize a landmark, or extracts text from an image. Common examples include OCR for lifting characters from a scanned document (JPEG) file, or recognizing a street name in a photograph containing a street sign. | See [Image Analysis Skill](cognitive-search-skill-image-analysis.md) or [OCR Skill](cognitive-search-skill-ocr.md)
-| Natural language processing | Text processing for insights and information about text inputs. Language detection, sentiment analysis, and key phrase extraction are skills that fall under natural language processing.  | See [Key Phrase Extraction Skill](cognitive-search-skill-keyphrases.md), [Language Detection Skill](cognitive-search-skill-language-detection.md), [Text Translation Skill](cognitive-search-skill-text-translation.md), [Sentiment Analysis Skill](cognitive-search-skill-sentiment.md) |
+| Natural language processing | Text processing for insights and information about text inputs. Language detection, sentiment analysis, and key phrase extraction are skills that fall under natural language processing.  | See [Key Phrase Extraction Skill](cognitive-search-skill-keyphrases.md), [Language Detection Skill](cognitive-search-skill-language-detection.md), [Text Translation Skill](cognitive-search-skill-text-translation.md), [Sentiment Analysis Skill](cognitive-search-skill-sentiment.md), [PII Detection Skill (preview)](cognitive-search-skill-pii-detection.md) |
 | Document cracking | The process of extracting or creating text content from non-text sources during indexing. Optical character recognition (OCR) is an example, but generally it refers to core indexer functionality as the indexer extracts content from application files. The data source providing source file location, and the indexer definition providing field mappings, are both key factors in document cracking. | See [Indexers overview](search-indexer-overview.md) |
 | Shaping | Consolidate text fragments into a larger structure, or conversely break down larger text chunks into a manageable size for further downstream processing. | See [Shaper Skill](cognitive-search-skill-shaper.md), [Text Merger Skill](cognitive-search-skill-textmerger.md), [Text Split Skill](cognitive-search-skill-textsplit.md) |
 | Enriched documents | A transitory internal structure, generated during processing, with final output reflected in a search index. A skillset determines which enrichments are performed. Field mappings determine which data elements are added to the index. Optionally, you can create a knowledge store to persist and explore enriched documents using tools like Storage Explorer, Power BI, or any other tool that connects to Azure Blob storage. | See [Knowledge store (preview)](knowledge-store-concept-intro.md) |
diff --git a/articles/search/cognitive-search-predefined-skills.md b/articles/search/cognitive-search-predefined-skills.md
@@ -30,6 +30,7 @@ Several skills are flexible in what they consume or produce. In general, most sk
 | [Microsoft.Skills.Text.LanguageDetectionSkill](cognitive-search-skill-language-detection.md)  | This skill uses a pretrained model to detect which language is used (one language ID per document). When multiple languages are used within the same text segments, the output is the LCID of the predominantly used language.|
 | [Microsoft.Skills.Text.MergeSkill](cognitive-search-skill-textmerger.md) | Consolidates text from a collection of fields into a single field.  |
 | [Microsoft.Skills.Text.EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md) | This skill uses a pretrained model to establish entities for a fixed set of categories: people, location, organization, emails, URLs, datetime fields. |
+| [Microsoft.Skills.Text.PIIDetectionSkill](cognitive-search-skill-pii-detection.md)  | This skill uses a pretrained model to extract personally identifiable information from a given text. The skill also gives various options for masking the detected PII entities in the text.  |
 | [Microsoft.Skills.Text.SentimentSkill](cognitive-search-skill-sentiment.md)  | This skill uses a pretrained model to score positive or negative sentiment on a record by record basis. The score is between 0 and 1. Neutral scores occur for both the null case when sentiment cannot be detected, and for text that is considered neutral.  |
 | [Microsoft.Skills.Text.SplitSkill](cognitive-search-skill-textsplit.md) | Splits text into pages so that you can enrich or augment content incrementally. |
 | [Microsoft.Skills.Text.TranslationSkill](cognitive-search-skill-text-translation.md) | This skill uses a pretrained model to translate the input text into a variety of languages for normalization or localization use cases. |
diff --git a/articles/search/cognitive-search-resources-documentation.md b/articles/search/cognitive-search-resources-documentation.md
@@ -38,6 +38,7 @@ The following articles are the complete documentation for AI enrichment.
   + [Microsoft.Skills.Text.LanguageDetectionSkill](cognitive-search-skill-language-detection.md)
   + [Microsoft.Skills.Text.EntityRecognitionSkill](cognitive-search-skill-entity-recognition.md)
   + [Microsoft.Skills.Text.MergeSkill](cognitive-search-skill-textmerger.md)
+  + [Microsoft.Skills.Text.PIIDetectionSkill](cognitive-search-skill-pii-detection.md)
   + [Microsoft.Skills.Text.SplitSkill](cognitive-search-skill-textsplit.md)
   + [Microsoft.Skills.Text.SentimentSkill](cognitive-search-skill-sentiment.md)
   + [Microsoft.Skills.Text.TranslationSkill](cognitive-search-skill-text-translation.md)
diff --git a/articles/search/cognitive-search-skill-pii-detection.md b/articles/search/cognitive-search-skill-pii-detection.md
@@ -0,0 +1,137 @@
+---
+title: PII Detection cognitive skill (preview)
+titleSuffix: Azure Cognitive Search
+description: Extract and mask personally identifiable information from text in an enrichment pipeline in Azure Cognitive Search. This skill is currently in public preview.
+
+manager: nitinme
+author: careyjmac
+ms.author: chalton
+ms.service: cognitive-search
+ms.topic: conceptual
+ms.date: 1/27/2020
+---
+
+#	PII Detection cognitive skill
+
+> [!IMPORTANT] 
+> This skill is currently in public preview. Preview functionality is provided without a service level agreement, and is not recommended for production workloads. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/). There is currently no portal or .NET SDK support.
+
+The **PII Detection** skill extracts personally identifiable information from an input text and gives you the option to mask it from that text in various ways. This skill uses the machine learning models provided by [Text Analytics](https://docs.microsoft.com/azure/cognitive-services/text-analytics/overview) in Cognitive Services.
+
+> [!NOTE]
+> As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to [attach a billable Cognitive Services resource](cognitive-search-attach-cognitive-services.md). Charges accrue when calling APIs in Cognitive Services, and for image extraction as part of the document-cracking stage in Azure Cognitive Search. There are no charges for text extraction from documents.
+>
+> Execution of built-in skills is charged at the existing [Cognitive Services pay-as-you go price](https://azure.microsoft.com/pricing/details/cognitive-services/). Image extraction pricing is described on the [Azure Cognitive Search pricing page](https://go.microsoft.com/fwlink/?linkid=2042400).
+
+
+## @odata.type  
+Microsoft.Skills.Text.PIIDetectionSkill
+
+## Data limits
+The maximum size of a record should be 50,000 characters as measured by [`String.Length`](https://docs.microsoft.com/dotnet/api/system.string.length). If you need to break up your data before sending it to the skill, consider using the [Text Split skill](cognitive-search-skill-textsplit.md).
+
+## Skill parameters
+
+Parameters are case-sensitive and all are optional.
+
+| Parameter name	 | Description |
+|--------------------|-------------|
+| defaultLanguageCode |	Language code of the input text. For now, only `en` is supported. |
+| minimumPrecision | A value between 0.0 and 1.0. If the confidence score (in the `piiEntities` output) is lower than the set `minimumPrecision` value, the entity is not returned or masked. The default is 0.0. |
+| maskingMode | A parameter that provides various ways to mask the detected PII in the input text. The following options are supported: <ul><li>`none` (default): This means that no masking will be performed and the `maskedText` output will not be returned. </li><li> `redact`: This option will remove the detected entities from the input text and not replace them with anything. Note that in this case, the offset in the `piiEntities` output will be in relation to the original text, and not the masked text. </li><li> `replace`: This option will replace the detected entities with the character given in the `maskingCharacter` parameter.  The character will be repeated to the length of the detected entity so that the offsets will correctly correspond to both the input text as well as the output `maskedText`.</li></ul> |
+| maskingCharacter | The character that will be used to masked the text if the `maskingMode` parameter is set to `replace`. The following options are supported: `*` (default), `#`, `X`. This parameter can only be `null` if `maskingMode` is not set to `replace`. |
+
+
+## Skill inputs
+
+| Input name	  | Description                   |
+|---------------|-------------------------------|
+| languageCode	| Optional. Default is `en`.  |
+| text          | The text to analyze.          |
+
+## Skill outputs
+
+| Output name	  | Description                   |
+|---------------|-------------------------------|
+| piiEntities | An array of complex types that contains the following fields: <ul><li>text (The actual PII as extracted)</li> <li>type</li><li>subType</li><li>score (Higher value means it's more likely to be a real entity)</li><li>offset (into the input text)</li><li>length</li></ul> </br> [Possible types and subTypes can be found here.](https://docs.microsoft.com/azure/cognitive-services/text-analytics/named-entity-types?tabs=personal) |
+| maskedText | If `maskingMode` is set to a value other than `none`, this output will be the string result of the masking performed on the input text as described by the selected `maskingMode`.  If `maskingMode` is set to `none`, this output will not be present. |
+
+##	Sample definition
+
+```json
+  {
+    "@odata.type": "#Microsoft.Skills.Text.PIIDetectionSkill",
+    "defaultLanguageCode": "en",
+    "minimumPrecision": 0.5,
+    "maskingMode": "replace",
+    "maskingCharacter": "*",
+    "inputs": [
+      {
+        "name": "text",
+        "source": "/document/content"
+      }
+    ],
+    "outputs": [
+      {
+        "name": "piiEntities"
+      },
+      {
+        "name": "maskedText"
+      }
+    ]
+  }
+```
+##	Sample input
+
+```json
+{
+    "values": [
+      {
+        "recordId": "1",
+        "data":
+           {
+             "text": "Microsoft employee with ssn 859-98-0987 is using our awesome API's."
+           }
+      }
+    ]
+}
+```
+
+##	Sample output
+
+```json
+{
+  "values": [
+    {
+      "recordId": "1",
+      "data" : 
+      {
+        "piiEntities":[ 
+           { 
+              "text":"859-98-0987",
+              "type":"U.S. Social Security Number (SSN)",
+              "subtype":"",
+              "offset":28,
+              "length":11,
+              "score":0.65
+           }
+        ],
+        "maskedText": "Microsoft employee with ssn *********** is using our awesome API's."
+      }
+    }
+  ]
+}
+```
+
+
+## Error and warning cases
+If the language code for the document is unsupported, a warning is returned and no entities are extracted.
+If your text is empty, a warning will be produced.
+If your text is larger than 50,000 characters, only the first 50,000 characters will be analyzed and a warning will be issued.
+
+If the skill returns a warning, the output `maskedText` may be empty.  This means that if you expect that output to exist for input into later skills, it will not work as intended. Keep this in mind when writing your skillset definition.
+
+## See also
+
++ [Built-in skills](cognitive-search-predefined-skills.md)
++ [How to define a skillset](cognitive-search-defining-skillset.md)
diff --git a/articles/search/search-limits-quotas-capacity.md b/articles/search/search-limits-quotas-capacity.md
@@ -147,7 +147,7 @@ For the Storage Optimized tiers,  you should expect a lower query throughput and
 
 ## Data limits (AI enrichment)
 
-An [AI enrichment pipeline](cognitive-search-concept-intro.md) that makes calls to a Text Analytics resource for [entity recognition](cognitive-search-skill-entity-recognition.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), [sentiment analysis](cognitive-search-skill-sentiment.md), and [language detection](cognitive-search-skill-language-detection.md) is subject to data limits. The maximum size of a record should be 50,000 characters as measured by [`String.Length`](https://docs.microsoft.com/dotnet/api/system.string.length). If you need to break up your data before sending it to the sentiment analyzer, use the [Text Split skill](cognitive-search-skill-textsplit.md).
+An [AI enrichment pipeline](cognitive-search-concept-intro.md) that makes calls to a Text Analytics resource for [entity recognition](cognitive-search-skill-entity-recognition.md), [key phrase extraction](cognitive-search-skill-keyphrases.md), [sentiment analysis](cognitive-search-skill-sentiment.md), [language detection](cognitive-search-skill-language-detection.md), and [PII detection](cognitive-search-skill-pii-detection.md) is subject to data limits. The maximum size of a record should be 50,000 characters as measured by [`String.Length`](https://docs.microsoft.com/dotnet/api/system.string.length). If you need to break up your data before sending it to the sentiment analyzer, use the [Text Split skill](cognitive-search-skill-textsplit.md).
 
 ## Throttling limits
 
diff --git a/articles/search/whats-new.md b/articles/search/whats-new.md
@@ -22,6 +22,10 @@ Azure Search is now renamed to **Azure Cognitive Search** to reflect the expande
 
 ## Feature announcements
 
+### February 2020
+
++ [PII Detection](cognitive-search-skill-pii-detection.md) is a cognitive skill used during indexing that extracts personally identifiable information from an input text and gives you the option to mask it from that text in various ways.
+
 ### January 2020
 
 + [Customer-managed encryption keys](search-security-manage-encryption-keys.md) is now generally available. If you are using REST, you can access the feature using `api-version=2019-05-06`. For managed code, the correct package is still [.NET SDK version 8.0-preview](search-dotnet-sdk-migration-version-9.md) even though the feature is out of preview.