Merge pull request #106808 from aahill/ta-graphemes

GitHubber17 · web-flow · commit 09bc3512830d · 2020-03-12T12:06:44.000-07:00
[CogSvcs] Text Analytics graphemes
diff --git a/articles/cognitive-services/text-analytics/concepts/text-offsets.md b/articles/cognitive-services/text-analytics/concepts/text-offsets.md
@@ -0,0 +1,44 @@
+---
+title: Text offsets in the Text Analytics API
+titleSuffix: Azure Cognitive Services
+description: Learn about offsets caused by multilingual and emoji encodings.
+services: cognitive-services
+author: aahill
+manager: nitinme
+ms.service: cognitive-services
+ms.subservice: text-analytics
+ms.topic: article
+ms.date: 03/09/2020
+ms.author: aahi
+ms.reviewer: jdesousa
+---
+
+# Text offsets in the Text Analytics API output
+
+Multilingual and emoji support has led to Unicode encodings that use more than one [code point](https://wikipedia.org/wiki/Code_point) to represent a single displayed character, called a grapheme. For example, emojis like 🌷 and 👍 may use several characters to compose the shape with additional characters for visual attributes, such as skin tone. Similarly, the Hindi word `अनुच्छेद` is encoded as five letters and three combining marks.
+
+Because of the different lengths of possible multilingual and emoji encodings, the Text Analytics API may return offsets in the response.
+
+## Offsets in the API response. 
+
+Whenever offsets are returned the API response, such as [Named Entity Recognition](../how-tos/text-analytics-how-to-entity-linking.md) or [Sentiment Analysis](../how-tos/text-analytics-how-to-sentiment-analysis.md), remember the following:
+
+* Elements in the response may be specific to the endpoint that was called. 
+* HTTP POST/GET payloads are encoded in [UTF-8](https://www.w3schools.com/charsets/ref_html_utf8.asp), which may or may not be the default character encoding on your client-side compiler or operating system.
+* Offsets refer to grapheme counts based on the [Unicode 8.0.0](https://unicode.org/versions/Unicode8.0.0) standard, not character counts.
+
+## Extracting substrings from text with offsets
+
+Offsets can cause problems when using character-based substring methods, for example the .NET [substring()](https://docs.microsoft.com/dotnet/api/system.string.substring?view=netframework-4.8) method. One problem is that an offset may cause a substring method to end in the middle of a multi-character grapheme encoding instead of the end.
+
+In .NET, consider using the [StringInfo](https://docs.microsoft.com/dotnet/api/system.globalization.stringinfo?view=netframework-4.8) class, which enables you to work with a string as a series of textual elements, rather than individual character objects. You can also look for grapheme splitter libraries in your preferred software environment. 
+
+The Text Analytics API returns these textual elements as well, for convenience.
+
+## See also
+
+* [Text Analytics overview](../overview.md)
+* [Sentiment analysis](../how-tos/text-analytics-how-to-sentiment-analysis.md)
+* [Entity recognition](../how-tos/text-analytics-how-to-entity-linking.md)
+* [Detect language](../how-tos/text-analytics-how-to-keyword-extraction.md)
+* [Language recognition](../how-tos/text-analytics-how-to-language-detection.md)
diff --git a/articles/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking.md b/articles/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking.md
@@ -178,14 +178,13 @@ The Text Analytics API is stateless. No data is stored in your account, and resu
 
 All POST requests return a JSON formatted response with the IDs and detected entity properties.
 
-Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system, and then import it into an application that allows you to sort, search, and manipulate the data.
-
+Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system, and then import it into an application that allows you to sort, search, and manipulate the data. Due to multilingual and emoji support, the response may contain text offsets. See [how to process text offsets](../concepts/text-offsets.md) for more information.
 
 #### [Version 3.0-preview)](#tab/version-3)
 
 ### Example v3 responses
 
-Version 3 provides separate endpoints for NER and entity linking. The responses for both operations are below.
+Version 3 provides separate endpoints for NER and entity linking. The responses for both operations are below. 
 
 #### Example NER response
 
diff --git a/articles/cognitive-services/text-analytics/how-tos/text-analytics-how-to-sentiment-analysis.md b/articles/cognitive-services/text-analytics/how-tos/text-analytics-how-to-sentiment-analysis.md
@@ -8,7 +8,7 @@ manager: nitinme
 ms.service: cognitive-services
 ms.subservice: text-analytics
 ms.topic: sample
-ms.date: 02/10/2020
+ms.date: 03/09/2020
 ms.author: aahi
 ---
 
@@ -155,7 +155,7 @@ The Text Analytics API is stateless. No data is stored in your account, and resu
 
 The sentiment analyzer classifies text as predominantly positive or negative. It assigns a score in the range of 0 to 1. Values close to 0.5 are neutral or indeterminate. A score of 0.5 indicates neutrality. When a string can't be analyzed for sentiment or has no sentiment, the score is always 0.5 exactly. For example, if you pass in a Spanish string with an English language code, the score is 0.5.
 
-Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system. Then, import the output into an application that you can use to sort, search, and manipulate the data.
+Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system. Then, import the output into an application that you can use to sort, search, and manipulate the data. Due to multilingual and emoji support, the response may contain text offsets. See [how to process offsets](../concepts/text-offsets.md) for more information.
 
 #### [Version 3.0-preview](#tab/version-3)
 
diff --git a/articles/cognitive-services/text-analytics/toc.yml b/articles/cognitive-services/text-analytics/toc.yml
@@ -52,7 +52,8 @@
     href: named-entity-types.md
   - name: Language and region support
     href: language-support.md
-
+  - name: Text offsets
+    href: concepts/text-offsets.md
 - name: How-to guides
   items:
    - name: Call the Text Analytics API