Skip to content

Commit 3c5d20d

Browse files
authored
Merge pull request #234692 from MicrosoftDocs/release-cogsvcs-custom-health
Release cogsvcs custom health--scheduled release at 10AM of 4/18
2 parents 385742e + 463aab7 commit 3c5d20d

File tree

98 files changed

+4325
-91
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

98 files changed

+4325
-91
lines changed
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
---
2+
title: Custom Text Analytics for health data formats
3+
titleSuffix: Azure Cognitive Services
4+
description: Learn about the data formats accepted by custom text analytics for health.
5+
services: cognitive-services
6+
author: aahill
7+
manager: nitinme
8+
ms.service: cognitive-services
9+
ms.subservice: language-service
10+
ms.topic: conceptual
11+
ms.date: 04/14/2023
12+
ms.author: aahi
13+
ms.custom: language-service-custom-ta4h
14+
---
15+
16+
# Accepted data formats in custom text analytics for health
17+
18+
Use this article to learn about formatting your data to be imported into custom text analytics for health.
19+
20+
If you are trying to [import your data](../how-to/create-project.md#import-project) into custom Text Analytics for health, it has to follow a specific format. If you don't have data to import, you can [create your project](../how-to/create-project.md) and use the Language Studio to [label your documents](../how-to/label-data.md).
21+
22+
Your Labels file should be in the `json` format below to be used when importing your labels into a project.
23+
24+
```json
25+
{
26+
"projectFileVersion": "{API-VERSION}",
27+
"stringIndexType": "Utf16CodeUnit",
28+
"metadata": {
29+
"projectName": "{PROJECT-NAME}",
30+
"projectKind": "CustomHealthcare",
31+
"description": "Trying out custom Text Analytics for health",
32+
"language": "{LANGUAGE-CODE}",
33+
"multilingual": true,
34+
"storageInputContainerName": "{CONTAINER-NAME}",
35+
"settings": {}
36+
},
37+
"assets": {
38+
"projectKind": "CustomHealthcare",
39+
"entities": [
40+
{
41+
"category": "Entity1",
42+
"compositionSetting": "{COMPOSITION-SETTING}",
43+
"list": {
44+
"sublists": [
45+
{
46+
"listKey": "One",
47+
"synonyms": [
48+
{
49+
"language": "en",
50+
"values": [
51+
"EntityNumberOne",
52+
"FirstEntity"
53+
]
54+
}
55+
]
56+
}
57+
]
58+
}
59+
},
60+
{
61+
"category": "Entity2"
62+
},
63+
{
64+
"category": "MedicationName",
65+
"list": {
66+
"sublists": [
67+
{
68+
"listKey": "research drugs",
69+
"synonyms": [
70+
{
71+
"language": "en",
72+
"values": [
73+
"rdrug a",
74+
"rdrug b"
75+
]
76+
}
77+
]
78+
79+
}
80+
]
81+
}
82+
"prebuilts": "MedicationName"
83+
}
84+
],
85+
"documents": [
86+
{
87+
"location": "{DOCUMENT-NAME}",
88+
"language": "{LANGUAGE-CODE}",
89+
"dataset": "{DATASET}",
90+
"entities": [
91+
{
92+
"regionOffset": 0,
93+
"regionLength": 500,
94+
"labels": [
95+
{
96+
"category": "Entity1",
97+
"offset": 25,
98+
"length": 10
99+
},
100+
{
101+
"category": "Entity2",
102+
"offset": 120,
103+
"length": 8
104+
}
105+
]
106+
}
107+
]
108+
},
109+
{
110+
"location": "{DOCUMENT-NAME}",
111+
"language": "{LANGUAGE-CODE}",
112+
"dataset": "{DATASET}",
113+
"entities": [
114+
{
115+
"regionOffset": 0,
116+
"regionLength": 100,
117+
"labels": [
118+
{
119+
"category": "Entity2",
120+
"offset": 20,
121+
"length": 5
122+
}
123+
]
124+
}
125+
]
126+
}
127+
]
128+
}
129+
}
130+
131+
```
132+
133+
|Key |Placeholder |Value | Example |
134+
|---------|---------|----------|--|
135+
| `multilingual` | `true`| A boolean value that enables you to have documents in multiple languages in your dataset and when your model is deployed you can query the model in any supported language (not necessarily included in your training documents). See [language support](../language-support.md#) to learn more about multilingual support. | `true`|
136+
|`projectName`|`{PROJECT-NAME}`|Project name|`myproject`|
137+
| `storageInputContainerName` |`{CONTAINER-NAME}`|Container name|`mycontainer`|
138+
| `entities` | | Array containing all the entity types you have in the project. These are the entity types that will be extracted from your documents into.| |
139+
| `category` | | The name of the entity type, which can be user defined for new entity definitions, or predefined for prebuilt entities. For more information, see the entity naming rules below.| |
140+
|`compositionSetting`|`{COMPOSITION-SETTING}`|Rule that defines how to manage multiple components in your entity. Options are `combineComponents` or `separateComponents`. |`combineComponents`|
141+
| `list` | | Array containing all the sublists you have in the project for a specific entity. Lists can be added to prebuilt entities or new entities with learned components.| |
142+
|`sublists`|`[]`|Array containing sublists. Each sublist is a key and its associated values.|`[]`|
143+
| `listKey`| `One` | A normalized value for the list of synonyms to map back to in prediction. | `One` |
144+
|`synonyms`|`[]`|Array containing all the synonyms|synonym|
145+
| `language` | `{LANGUAGE-CODE}` | A string specifying the language code for the synonym in your sublist. If your project is a multilingual project and you want to support your list of synonyms for all the languages in your project, you have to explicitly add your synonyms to each language. See [Language support](../language-support.md) for more information about supported language codes. |`en`|
146+
| `values`| `"EntityNumberone"`, `"FirstEntity"` | A list of comma separated strings that will be matched exactly for extraction and map to the list key. | `"EntityNumberone"`, `"FirstEntity"` |
147+
| `prebuilts` | `MedicationName` | The name of the prebuilt component populating the prebuilt entity. [Prebuilt entities](../../text-analytics-for-health/concepts/health-entity-categories.md) are automatically loaded into your project by default but you can extend them with list components in your labels file. | `MedicationName` |
148+
| `documents` | | Array containing all the documents in your project and list of the entities labeled within each document. | [] |
149+
| `location` | `{DOCUMENT-NAME}` | The location of the documents in the storage container. Since all the documents are in the root of the container this should be the document name.|`doc1.txt`|
150+
| `dataset` | `{DATASET}` | The test set to which this file goes to when split before training. Learn more about data splitting [here](../how-to/train-model.md#data-splitting). Possible values for this field are `Train` and `Test`. |`Train`|
151+
| `regionOffset` | | The inclusive character position of the start of the text. |`0`|
152+
| `regionLength` | | The length of the bounding box in terms of UTF16 characters. Training only considers the data in this region. |`500`|
153+
| `category` | | The type of entity associated with the span of text specified. | `Entity1`|
154+
| `offset` | | The start position for the entity text. | `25`|
155+
| `length` | | The length of the entity in terms of UTF16 characters. | `20`|
156+
| `language` | `{LANGUAGE-CODE}` | A string specifying the language code for the document used in your project. If your project is a multilingual project, choose the language code of the majority of the documents. See [Language support](../language-support.md) for more information about supported language codes. |`en`|
157+
158+
## Entity naming rules
159+
160+
1. [Prebuilt entity names](../../text-analytics-for-health/concepts/health-entity-categories.md) are predefined. They must be populated with a prebuilt component and it must match the entity name.
161+
2. New user defined entities (entities with learned components or labeled text) can't use prebuilt entity names.
162+
3. New user defined entities can't be populated with prebuilt components as prebuilt components must match their associated entities names and have no labeled data assigned to them in the documents array.
163+
164+
165+
166+
## Next steps
167+
* You can import your labeled data into your project directly. Learn how to [import project](../how-to/create-project.md#import-project)
168+
* See the [how-to article](../how-to/label-data.md) more information about labeling your data.
169+
* When you're done labeling your data, you can [train your model](../how-to/train-model.md).
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
title: Entity components in custom Text Analytics for health
3+
titleSuffix: Azure Cognitive Services
4+
description: Learn how custom Text Analytics for health extracts entities from text
5+
services: cognitive-services
6+
author: aahill
7+
manager: nitinme
8+
ms.service: cognitive-services
9+
ms.subservice: language-service
10+
ms.topic: conceptual
11+
ms.date: 04/14/2023
12+
ms.author: aahi
13+
ms.custom: language-service-custom-ta4h
14+
---
15+
16+
# Entity components in custom text analytics for health
17+
18+
In custom Text Analytics for health, entities are relevant pieces of information that are extracted from your unstructured input text. An entity can be extracted by different methods. They can be learned through context, matched from a list, or detected by a prebuilt recognized entity. Every entity in your project is composed of one or more of these methods, which are defined as your entity's components. When an entity is defined by more than one component, their predictions can overlap. You can determine the behavior of an entity prediction when its components overlap by using a fixed set of options in the **Entity options**.
19+
20+
## Component types
21+
22+
An entity component determines a way you can extract the entity. An entity can contain one component, which would determine the only method that would be used to extract the entity, or multiple components to expand the ways in which the entity is defined and extracted.
23+
24+
The [Text Analytics for health entities](../../text-analytics-for-health/concepts/health-entity-categories.md) are automatically loaded into your project as entities with prebuilt components. You can define list components for entities with prebuilt components but you can't add learned components. Similarly, you can create new entities with learned and list components, but you can't populate them with additional prebuilt components.
25+
26+
### Learned component
27+
28+
The learned component uses the entity tags you label your text with to train a machine learned model. The model learns to predict where the entity is, based on the context within the text. Your labels provide examples of where the entity is expected to be present in text, based on the meaning of the words around it and as the words that were labeled. This component is only defined if you add labels to your data for the entity. If you do not label any data, it will not have a learned component.
29+
30+
The Text Analytics for health entities, which by default have prebuilt components can't be extended with learned components, meaning they do not require or accept further labeling to function.
31+
32+
:::image type="content" source="../media/learned-component.png" alt-text="A screenshot showing an example of learned components for entities." lightbox="../media/learned-component.png":::
33+
34+
### List component
35+
36+
The list component represents a fixed, closed set of related words along with their synonyms. The component performs an exact text match against the list of values you provide as synonyms. Each synonym belongs to a "list key", which can be used as the normalized, standard value for the synonym that will return in the output if the list component is matched. List keys are **not** used for matching.
37+
38+
In multilingual projects, you can specify a different set of synonyms for each language. While using the prediction API, you can specify the language in the input request, which will only match the synonyms associated to that language.
39+
40+
41+
:::image type="content" source="../media/list-component.png" alt-text="A screenshot showing an example of list components for entities." lightbox="../media/list-component.png":::
42+
43+
### Prebuilt component
44+
45+
The [Text Analytics for health entities](../../text-analytics-for-health/concepts/health-entity-categories.md) are automatically loaded into your project as entities with prebuilt components. You can define list components for entities with prebuilt components but you cannot add learned components. Similarly, you can create new entities with learned and list components, but you cannot populate them with additional prebuilt components. Entities with prebuilt components are pretrained and can extract information relating to their categories without any labels.
46+
47+
:::image type="content" source="../media/prebuilt-component.png" alt-text="A screenshot showing an example of prebuilt components for entities." lightbox="../media/prebuilt-component.png":::
48+
49+
50+
## Entity options
51+
52+
When multiple components are defined for an entity, their predictions may overlap. When an overlap occurs, each entity's final prediction is determined by one of the following options.
53+
54+
### Combine components
55+
56+
Combine components as one entity when they overlap by taking the union of all the components.
57+
58+
Use this to combine all components when they overlap. When components are combined, you get all the extra information that’s tied to a list or prebuilt component when they are present.
59+
60+
#### Example
61+
62+
Suppose you have an entity called Software that has a list component, which contains “Proseware OS” as an entry. In your input data, you have “I want to buy Proseware OS 9” with “Proseware OS 9” tagged as Software:
63+
64+
:::image type="content" source="../media/union-overlap-example-1.svg" alt-text="A screenshot showing a learned and list entity overlapped." lightbox="../media/union-overlap-example-1.svg":::
65+
66+
By using combine components, the entity will return with the full context as “Proseware OS 9” along with the key from the list component:
67+
68+
:::image type="content" source="../media/union-overlap-example-1-part-2.svg" alt-text="A screenshot showing the result of a combined component." lightbox="../media/union-overlap-example-1-part-2.svg":::
69+
70+
Suppose you had the same utterance but only “OS 9” was predicted by the learned component:
71+
72+
:::image type="content" source="../media/union-overlap-example-2.svg" alt-text="A screenshot showing an utterance with O S 9 predicted by the learned component." lightbox="../media/union-overlap-example-2.svg":::
73+
74+
With combine components, the entity will still return as “Proseware OS 9” with the key from the list component:
75+
76+
:::image type="content" source="../media/union-overlap-example-2-part-2.svg" alt-text="A screenshot showing the returned software entity." lightbox="../media/union-overlap-example-2-part-2.svg":::
77+
78+
79+
### Don't combine components
80+
81+
Each overlapping component will return as a separate instance of the entity. Apply your own logic after prediction with this option.
82+
83+
#### Example
84+
85+
Suppose you have an entity called Software that has a list component, which contains “Proseware Desktop” as an entry. In your labeled data, you have “I want to buy Proseware Desktop Pro” with “Proseware Desktop Pro” labeled as Software:
86+
87+
:::image type="content" source="../media/separated-overlap-example-1.svg" alt-text="A screenshot showing an example of a learned and list entity overlapped." lightbox="../media/separated-overlap-example-1.svg":::
88+
89+
When you do not combine components, the entity will return twice:
90+
91+
:::image type="content" source="../media/separated-overlap-example-1-part-2.svg" alt-text="A screenshot showing the entity returned twice." lightbox="../media/separated-overlap-example-1-part-2.svg":::
92+
93+
94+
## How to use components and options
95+
96+
Components give you the flexibility to define your entity in more than one way. When you combine components, you make sure that each component is represented and you reduce the number of entities returned in your predictions.
97+
98+
A common practice is to extend a prebuilt component with a list of values that the prebuilt might not support. For example, if you have a **Medication Name** entity, which has a `Medication.Name` prebuilt component added to it, the entity may not predict all the medication names specific to your domain. You can use a list component to extend the values of the Medication Name entity and thereby extending the prebuilt with your own values of Medication Names.
99+
100+
Other times you may be interested in extracting an entity through context such as a **medical device**. You would label for the learned component of the medical device to learn _where_ a medical device is based on its position within the sentence. You may also have a list of medical devices that you already know before hand that you'd like to always extract. Combining both components in one entity allows you to get both options for the entity.
101+
102+
When you do not combine components, you allow every component to act as an independent entity extractor. One way of using this option is to separate the entities extracted from a list to the ones extracted through the learned or prebuilt components to handle and treat them differently.
103+
104+
105+
## Next steps
106+
107+
* [Entities with prebuilt components](../../text-analytics-for-health/concepts/health-entity-categories.md)

0 commit comments

Comments
 (0)