Skip to content

Commit fd69291

Browse files
authored
Merge pull request #194091 from aahill/cner-update
Custom NER update
2 parents 1882512 + 76f0122 commit fd69291

File tree

12 files changed

+75
-39
lines changed

12 files changed

+75
-39
lines changed

articles/cognitive-services/language-service/custom-classification/includes/quickstarts/rest-api.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -129,9 +129,9 @@ For the documents key:
129129

130130
|Key |Value | Example |
131131
|---------|---------|---------|
132-
| `location ` | Document name on the blob store. | doc1.txt |
133-
|`language` | The language of the document. | en-us |
134-
|`dataset` | Optional field to specify the dataset which this document will belong to. | Train or Test |
132+
| `location` | Document name on the blob store. | `doc2.txt` |
133+
|`language` | The language of the document. | `en-us` |
134+
|`dataset` | Optional field to specify the dataset which this document will belong to. | `Train` or `Test` |
135135

136136
This request will return an error if:
137137

@@ -181,11 +181,11 @@ Use the following JSON in your request. The model will be named `MyModel` once t
181181
|Key |Value | Example |
182182
|---------|---------|---------|
183183
|`modelLabel ` | Your Model name. | MyModel |
184-
|`runValidation` | Boolean value to run validation on the test set. | True or False |
185-
|`evaluationOptions` | Specifies evaluation options. | -- |
184+
|`runValidation` | Boolean value to run validation on the test set. | `True` or `False` |
185+
|`evaluationOptions` | Specifies evaluation options. | |
186186
|`type` | Specifies datasplit type. | set or percentage |
187-
|`testingSplitPercentage` | Required integer field if `type` is *percentage*. Specifies testing split. | 30 |
188-
|`trainingSplitPercentage` | Required integer field if `type` is *percentage*. Specifies training split. | 70 |
187+
|`testingSplitPercentage` | Required integer field if `type` is *percentage*. Specifies testing split. | `30` |
188+
|`trainingSplitPercentage` | Required integer field if `type` is *percentage*. Specifies training split. | `70` |
189189

190190
Once you send your API request, you will receive a `202` response indicating success. In the response headers, extract the `location` value. It will be formatted like this:
191191

articles/cognitive-services/language-service/custom-named-entity-recognition/faq.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ manager: nitinme
88
ms.service: cognitive-services
99
ms.subservice: language-service
1010
ms.topic: conceptual
11-
ms.date: 11/16/2021
11+
ms.date: 04/05/2022
1212
ms.author: aahi
1313
ms.custom: language-service-custom-ner, ignite-fall-2021
1414
---
@@ -44,7 +44,7 @@ When you're ready to start [using your model to make predictions](#how-do-i-use-
4444

4545
## What is the recommended CI/CD process?
4646

47-
You can train multiple models on the same dataset within the same project. After you have trained your model successfully, you can [view its evaluation](how-to/view-model-evaluation.md). You can [deploy and test](quickstart.md#deploy-your-model) your model within [Language studio](https://aka.ms/languageStudio). You can add or remove tags from your data and train a **new** model and test it as well. View [service limits](service-limits.md)to learn about maximum number of trained models with the same project. When you train a new model your dataset is [split](how-to/train-model.md#data-split) randomly into training and testing sets, so there is no guarantee that the reflected model evaluation is about the same test set, and the results are not comparable. It's recommended that you develop your own test set and use it to evaluate both models so you can measure improvement.
47+
You can train multiple models on the same dataset within the same project. After you have trained your model successfully, you can [view its evaluation](how-to/view-model-evaluation.md). You can [deploy and test](quickstart.md#deploy-your-model) your model within [Language studio](https://aka.ms/languageStudio). You can add or remove tags from your data and train a **new** model and test it as well. View [service limits](service-limits.md)to learn about maximum number of trained models with the same project. When you [train your data](how-to/train-model.md) you can determine how your dataset is split into training and testing sets. You can also have your data split randomly into training and testing set where there is no guarantee that the reflected model evaluation is about the same test set, and the results are not comparable. It's recommended that you develop your own test set and use it to evaluate both models so you can measure improvement.
4848

4949
## Does a low or high model score guarantee bad or good performance in production?
5050

@@ -67,7 +67,7 @@ See the [data selection and schema design](how-to/design-schema.md) article for
6767

6868
## Why do I get different results when I retrain my model?
6969

70-
* When you train a new model your dataset is [split](how-to/train-model.md#data-split) randomly into train and test sets so there is no guarantee that the reflected model evaluation is on the same test set, so results are not comparable.
70+
* When you [train your model](how-to/train-model.md), you can determine if you want your data to be split randomly into train and test sets. If you do, so there is no guarantee that the reflected model evaluation is on the same test set, so results are not comparable.
7171

7272
* If you're retraining the same model, your test set will be the same but you might notice a slight change in predictions made by the model. This is because the trained model is not robust enough and this is a factor of how representative and distinct your data is and the quality of your tagged data.
7373

articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/improve-model.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ manager: nitinme
88
ms.service: cognitive-services
99
ms.subservice: language-service
1010
ms.topic: how-to
11-
ms.date: 11/02/2021
11+
ms.date: 04/05/2022
1212
ms.author: aahi
1313
ms.custom: language-service-custom-ner, ignite-fall-2021
1414
---
@@ -34,7 +34,7 @@ See the [application development lifecycle](../overview.md#application-developme
3434
After you have reviewed your [model's evaluation](view-model-evaluation.md), you'll have formed an idea on what's wrong with your model's prediction.
3535

3636
> [!NOTE]
37-
> This guide focuses on data from the [validation set](train-model.md#data-split) that was created during training.
37+
> This guide focuses on data from the [validation set](train-model.md) that was created during training.
3838
3939
### Review test set
4040

articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/tag-data.md

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ manager: nitinme
88
ms.service: cognitive-services
99
ms.subservice: language-service
1010
ms.topic: how-to
11-
ms.date: 11/02/2021
11+
ms.date: 04/05/2022
1212
ms.author: aahi
1313
ms.custom: language-service-custom-ner, ignite-fall-2021
1414
---
@@ -44,12 +44,29 @@ The precision, consistency and completeness of your tagged data are key factors
4444

4545
3. You can find a list of all `.txt` files available in your projects to the left. You can select the file you want to start tagging or you can use the **Back** and **Next** button from the bottom of the page to navigate.
4646

47-
4. To start tagging, click **Add entities** in the top-right corner. You can either view all files or only tagged files by changing the view from the **Viewing** drop down.
47+
4. To start tagging, click **Add entities** in the top-right corner. You can either view all files or only tagged files by changing the view from the **Viewing** drop down filter.
48+
49+
:::image type="content" source="../media/tagging-screen.png" alt-text="A screenshot showing the Language Studio screen for tagging data." lightbox="../media/tagging-screen.png":::
50+
51+
In the image above:
52+
53+
* *Section 1*: is where the content of the text file is displayed and tagging takes place. You have [two options for tagging](#tagging-options) your files.
54+
55+
* *Section 2*: includes your project's entities and distribution across your files and tags.
56+
If you click **Distribution**, you can view your tag distribution across:
57+
58+
* Files: View the distribution of files across one single entity.
59+
* Tags: view the distribution of tags across all files.
60+
61+
:::image type="content" source="../media/distribution-ner.png" alt-text="A screenshot showing the distribution section." lightbox="../media/distribution-ner.png":::
62+
63+
64+
* *Section 3*: This is the split project data toggle. You can choose to add a selected text file to your training set or the testing set. By default, the toggle is off, and all text files are added to your training set.
65+
66+
To add a text file to a training or testing set, simply choose from the radio buttons to which set it belongs.
4867

4968
>[!TIP]
50-
> * There is no standard number of tags you will need, Consider starting with 50 tags per entity. The number of tags you'll need depends on how distinct your entities are, and how easily they can be differentiated from each other. It also depends on your tagging, which should be consistent and complete.
51-
52-
:::image type="content" source="../media/tagging-screen.png" alt-text="A screenshot showing the Language Studio screen for tagging data." lightbox="../media/tagging-screen.png":::
69+
>It is recommended to define your testing set.
5370
5471
If you enabled multiple languages for your project, you will find a **Language** dropdown, which lets you select the language of each document.
5572

articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/train-model.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,6 @@ The time to train a model varies on the dataset, and may take up to several hour
2727

2828
See the [application development lifecycle](../overview.md#application-development-lifecycle) for more information.
2929

30-
## Data split
31-
32-
Before starting the training process, files in your dataset are divided into two groups at random:
33-
34-
* The **training set** contains 80% of the files in your dataset. It is the main set that is used to train the model.
35-
36-
* The **test set** contains 20% of the files available in your dataset. This set is used to provide an unbiased [evaluation](../how-to/view-model-evaluation.md) of the model. This set is not introduced to the model during training.
37-
3830
## Train model in Language studio
3931

4032
1. Go to your project page in [Language Studio](https://aka.ms/LanguageStudio).
@@ -45,11 +37,17 @@ Before starting the training process, files in your dataset are divided into two
4537

4638
4. To train a new model, select **Train a new model** and type in the model name in the text box below. You can **overwrite an existing model** by selecting this option and select the model you want from the dropdown below.
4739

48-
:::image type="content" source="../media/train-model.png" alt-text="Create a new model" lightbox="../media/train-model.png":::
40+
:::image type="content" source="../media/train-model.png" alt-text="Create a new training job" lightbox="../media/train-model.png":::
41+
42+
If you have enabled [your project data to be split manually](tag-data.md) when you were tagging your data, you will see two training options:
43+
44+
* **Automatic split the testing**: The data will be randomly split for each class between training and testing sets, according to the percentages you choose. The default value is 80% for training and 20% for testing. To change these values, choose which set you want to change and write the new value.
45+
* **Use a manual split**: Assign each document to either the training or testing set, this required first adding files in the test dataset.
46+
4947

5048
5. Click on the **Train** button.
5149

52-
6. You can check the status of the training job in the same page. Only successfully completed tasks will generate models.
50+
6. You can check the status of the training job in the same page. Only successfully completed training jobs will generate models.
5351

5452
You can only have one training job running at a time. You cannot create or start other tasks in the same project.
5553

articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/view-model-evaluation.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ manager: nitinme
88
ms.service: cognitive-services
99
ms.subservice: language-service
1010
ms.topic: how-to
11-
ms.date: 11/02/2021
11+
ms.date: 04/05/2022
1212
ms.author: aahi
1313
ms.custom: language-service-custom-ner, ignite-fall-2021
1414
---
1515

1616

1717
# View the model's evaluation and details
1818

19-
After your model has finished training, you can view the model details and see how well does it perform against the test set, which contains 10% of your data at random, which is created during [training](train-model.md#data-split). The test set consists of data that was not introduced to the model during the training process. For the evaluation process to complete there must be at least 10 files in your dataset. You must also have a [custom NER project](../quickstart.md) with a [trained model](train-model.md).
19+
After your model has finished training, you can view the model details and see how well does it perform against the test set, which contains 10% of your data at random, which is created during [training](train-model.md). The test set consists of data that was not introduced to the model during the training process. For the evaluation process to complete there must be at least 10 files in your dataset. You must also have a [custom NER project](../quickstart.md) with a [trained model](train-model.md).
2020

2121
## Prerequisites
2222

@@ -33,7 +33,7 @@ See the [application development lifecycle](../overview.md#application-developme
3333

3434
2. Select **View model details** from the menu on the left side of the screen.
3535

36-
3. In this page you can only view the sucessfuly trained models. You can click on the model name for more details.
36+
3. In this page you can only view the successfully trained models. You can click on the model name for more details.
3737

3838
4. You can find the **model-level** evaluation metrics under **Overview**, and the **entity-level** evaluation metrics under **Entity performance metrics**. The confusion matrix for the model is located under **Test set confusion matrix**
3939

articles/cognitive-services/language-service/custom-named-entity-recognition/includes/quickstarts/rest-api.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ Use the following JSON in your request. Replace the placeholder values below wit
107107
{
108108
"location": "doc1.txt",
109109
"language": "en-us",
110+
"dataset": "Train",
110111
"extractors": [
111112
{
112113
"regionOffset": 0,
@@ -129,6 +130,7 @@ Use the following JSON in your request. Replace the placeholder values below wit
129130
{
130131
"location": "doc2.txt",
131132
"language": "en-us",
133+
"dataset": "Test",
132134
"extractors": [
133135
{
134136
"regionOffset": 0,
@@ -154,6 +156,15 @@ For the metadata key:
154156
| `modelType ` | Your Model type. | Extraction |
155157
|`storageInputContainerName` | The name of your Azure blob storage container. | `myContainer` |
156158

159+
For the documents key:
160+
161+
|Key |Value | Example |
162+
|---------|---------|---------|
163+
| `location ` | Document name on the blob store. | `doc2.txt` |
164+
|`language` | The language of the document. | `en-us` |
165+
|`dataset` | Optional field to specify the dataset which this document will belong to. | `Train` or `Test` |
166+
167+
157168
This request will return an error if:
158169

159170
* The selected resource doesn't have proper permission for the storage account.
@@ -191,14 +202,24 @@ Use the following JSON in your request. The model will be named `MyModel` once t
191202
```json
192203
{
193204
"modelLabel": "MyModel",
194-
"runValidation": true
205+
"runValidation": true,
206+
"evaluationOptions":
207+
{
208+
"type":"percentage",
209+
"testingSplitPercentage":"30",
210+
"trainingSplitPercentage":"70"
211+
}
195212
}
196213
```
197214

198215
|Key |Value | Example |
199216
|---------|---------|---------|
200217
|`modelLabel ` | Your Model name. | MyModel |
201-
|`runValidation` | Boolean value to run validation on the test set. | True |
218+
|`runValidation` | Boolean value to run validation on the test set. | `True` or `False` |
219+
|`evaluationOptions` | Specifies evaluation options. | |
220+
|`type` | Specifies datasplit type. | set or percentage |
221+
|`testingSplitPercentage` | Required integer field if `type` is *percentage*. Specifies testing split. | `30` |
222+
|`trainingSplitPercentage` | Required integer field if `type` is *percentage*. Specifies training split. | `70` |
202223

203224
Once you send your API request, you’ll receive a `202` response indicating success. In the response headers, extract the `location` value. It will be formatted like this:
204225

articles/cognitive-services/language-service/custom-named-entity-recognition/includes/train-model-language-studio.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ manager: nitinme
55
ms.service: cognitive-services
66
ms.subservice: language-service
77
ms.topic: include
8-
ms.date: 02/02/2022
8+
ms.date: 04/05/2022
99
ms.author: aahi
1010
---
1111

@@ -24,5 +24,5 @@ To start training your model:
2424
3. Click on the **Train** button at the bottom of the page.
2525

2626
> [!NOTE]
27-
> * While training, the data will be spilt into 2 sets: 80% for training and 20% for testing. See [how to train a model](../how-to/train-model.md#data-split) for more information.
27+
> * While training, the data will be spilt into 2 sets for training and testing the model. See [how to train a model](../how-to/train-model.md) for more information.
2828
> * Training can take up to a few hours.
114 KB
Loading
9.6 KB
Loading

0 commit comments

Comments
 (0)