Skip to content

Commit 09bc351

Browse files
authored
Merge pull request #106808 from aahill/ta-graphemes
[CogSvcs] Text Analytics graphemes
2 parents cf2bfb7 + 433428e commit 09bc351

File tree

4 files changed

+50
-6
lines changed

4 files changed

+50
-6
lines changed
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: Text offsets in the Text Analytics API
3+
titleSuffix: Azure Cognitive Services
4+
description: Learn about offsets caused by multilingual and emoji encodings.
5+
services: cognitive-services
6+
author: aahill
7+
manager: nitinme
8+
ms.service: cognitive-services
9+
ms.subservice: text-analytics
10+
ms.topic: article
11+
ms.date: 03/09/2020
12+
ms.author: aahi
13+
ms.reviewer: jdesousa
14+
---
15+
16+
# Text offsets in the Text Analytics API output
17+
18+
Multilingual and emoji support has led to Unicode encodings that use more than one [code point](https://wikipedia.org/wiki/Code_point) to represent a single displayed character, called a grapheme. For example, emojis like 🌷 and 👍 may use several characters to compose the shape with additional characters for visual attributes, such as skin tone. Similarly, the Hindi word `अनुच्छेद` is encoded as five letters and three combining marks.
19+
20+
Because of the different lengths of possible multilingual and emoji encodings, the Text Analytics API may return offsets in the response.
21+
22+
## Offsets in the API response.
23+
24+
Whenever offsets are returned the API response, such as [Named Entity Recognition](../how-tos/text-analytics-how-to-entity-linking.md) or [Sentiment Analysis](../how-tos/text-analytics-how-to-sentiment-analysis.md), remember the following:
25+
26+
* Elements in the response may be specific to the endpoint that was called.
27+
* HTTP POST/GET payloads are encoded in [UTF-8](https://www.w3schools.com/charsets/ref_html_utf8.asp), which may or may not be the default character encoding on your client-side compiler or operating system.
28+
* Offsets refer to grapheme counts based on the [Unicode 8.0.0](https://unicode.org/versions/Unicode8.0.0) standard, not character counts.
29+
30+
## Extracting substrings from text with offsets
31+
32+
Offsets can cause problems when using character-based substring methods, for example the .NET [substring()](https://docs.microsoft.com/dotnet/api/system.string.substring?view=netframework-4.8) method. One problem is that an offset may cause a substring method to end in the middle of a multi-character grapheme encoding instead of the end.
33+
34+
In .NET, consider using the [StringInfo](https://docs.microsoft.com/dotnet/api/system.globalization.stringinfo?view=netframework-4.8) class, which enables you to work with a string as a series of textual elements, rather than individual character objects. You can also look for grapheme splitter libraries in your preferred software environment.
35+
36+
The Text Analytics API returns these textual elements as well, for convenience.
37+
38+
## See also
39+
40+
* [Text Analytics overview](../overview.md)
41+
* [Sentiment analysis](../how-tos/text-analytics-how-to-sentiment-analysis.md)
42+
* [Entity recognition](../how-tos/text-analytics-how-to-entity-linking.md)
43+
* [Detect language](../how-tos/text-analytics-how-to-keyword-extraction.md)
44+
* [Language recognition](../how-tos/text-analytics-how-to-language-detection.md)

articles/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -178,14 +178,13 @@ The Text Analytics API is stateless. No data is stored in your account, and resu
178178

179179
All POST requests return a JSON formatted response with the IDs and detected entity properties.
180180

181-
Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system, and then import it into an application that allows you to sort, search, and manipulate the data.
182-
181+
Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system, and then import it into an application that allows you to sort, search, and manipulate the data. Due to multilingual and emoji support, the response may contain text offsets. See [how to process text offsets](../concepts/text-offsets.md) for more information.
183182

184183
#### [Version 3.0-preview)](#tab/version-3)
185184

186185
### Example v3 responses
187186

188-
Version 3 provides separate endpoints for NER and entity linking. The responses for both operations are below.
187+
Version 3 provides separate endpoints for NER and entity linking. The responses for both operations are below.
189188

190189
#### Example NER response
191190

articles/cognitive-services/text-analytics/how-tos/text-analytics-how-to-sentiment-analysis.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ manager: nitinme
88
ms.service: cognitive-services
99
ms.subservice: text-analytics
1010
ms.topic: sample
11-
ms.date: 02/10/2020
11+
ms.date: 03/09/2020
1212
ms.author: aahi
1313
---
1414

@@ -155,7 +155,7 @@ The Text Analytics API is stateless. No data is stored in your account, and resu
155155

156156
The sentiment analyzer classifies text as predominantly positive or negative. It assigns a score in the range of 0 to 1. Values close to 0.5 are neutral or indeterminate. A score of 0.5 indicates neutrality. When a string can't be analyzed for sentiment or has no sentiment, the score is always 0.5 exactly. For example, if you pass in a Spanish string with an English language code, the score is 0.5.
157157

158-
Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system. Then, import the output into an application that you can use to sort, search, and manipulate the data.
158+
Output is returned immediately. You can stream the results to an application that accepts JSON or save the output to a file on the local system. Then, import the output into an application that you can use to sort, search, and manipulate the data. Due to multilingual and emoji support, the response may contain text offsets. See [how to process offsets](../concepts/text-offsets.md) for more information.
159159

160160
#### [Version 3.0-preview](#tab/version-3)
161161

articles/cognitive-services/text-analytics/toc.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,8 @@
5252
href: named-entity-types.md
5353
- name: Language and region support
5454
href: language-support.md
55-
55+
- name: Text offsets
56+
href: concepts/text-offsets.md
5657
- name: How-to guides
5758
items:
5859
- name: Call the Text Analytics API

0 commit comments

Comments
 (0)