|
| 1 | +--- |
| 2 | +title: Text offsets in the Text Analytics API |
| 3 | +titleSuffix: Azure Cognitive Services |
| 4 | +description: Learn about offsets caused by multilingual and emoji encodings. |
| 5 | +services: cognitive-services |
| 6 | +author: aahill |
| 7 | +manager: nitinme |
| 8 | +ms.service: cognitive-services |
| 9 | +ms.subservice: text-analytics |
| 10 | +ms.topic: article |
| 11 | +ms.date: 03/09/2020 |
| 12 | +ms.author: aahi |
| 13 | +ms.reviewer: jdesousa |
| 14 | +--- |
| 15 | + |
| 16 | +# Text offsets in the Text Analytics API output |
| 17 | + |
| 18 | +Multilingual and emoji support has led to Unicode encodings that use more than one [code point](https://wikipedia.org/wiki/Code_point) to represent a single displayed character, called a grapheme. For example, emojis like 🌷 and 👍 may use several characters to compose the shape with additional characters for visual attributes, such as skin tone. Similarly, the Hindi word `अनुच्छेद` is encoded as five letters and three combining marks. |
| 19 | + |
| 20 | +Because of the different lengths of possible multilingual and emoji encodings, the Text Analytics API may return offsets in the response. |
| 21 | + |
| 22 | +## Offsets in the API response. |
| 23 | + |
| 24 | +Whenever offsets are returned the API response, such as [Named Entity Recognition](../how-tos/text-analytics-how-to-entity-linking.md) or [Sentiment Analysis](../how-tos/text-analytics-how-to-sentiment-analysis.md), remember the following: |
| 25 | + |
| 26 | +* Elements in the response may be specific to the endpoint that was called. |
| 27 | +* HTTP POST/GET payloads are encoded in [UTF-8](https://www.w3schools.com/charsets/ref_html_utf8.asp), which may or may not be the default character encoding on your client-side compiler or operating system. |
| 28 | +* Offsets refer to grapheme counts based on the [Unicode 8.0.0](https://unicode.org/versions/Unicode8.0.0) standard, not character counts. |
| 29 | + |
| 30 | +## Extracting substrings from text with offsets |
| 31 | + |
| 32 | +Offsets can cause problems when using character-based substring methods, for example the .NET [substring()](https://docs.microsoft.com/dotnet/api/system.string.substring?view=netframework-4.8) method. One problem is that an offset may cause a substring method to end in the middle of a multi-character grapheme encoding instead of the end. |
| 33 | + |
| 34 | +In .NET, consider using the [StringInfo](https://docs.microsoft.com/dotnet/api/system.globalization.stringinfo?view=netframework-4.8) class, which enables you to work with a string as a series of textual elements, rather than individual character objects. You can also look for grapheme splitter libraries in your preferred software environment. |
| 35 | + |
| 36 | +The Text Analytics API returns these textual elements as well, for convenience. |
| 37 | + |
| 38 | +## See also |
| 39 | + |
| 40 | +* [Text Analytics overview](../overview.md) |
| 41 | +* [Sentiment analysis](../how-tos/text-analytics-how-to-sentiment-analysis.md) |
| 42 | +* [Entity recognition](../how-tos/text-analytics-how-to-entity-linking.md) |
| 43 | +* [Detect language](../how-tos/text-analytics-how-to-keyword-extraction.md) |
| 44 | +* [Language recognition](../how-tos/text-analytics-how-to-language-detection.md) |
0 commit comments