Merge pull request #194091 from aahill/cner-update

hmacgregor1 · web-flow · commit fd6929148e99 · 2022-04-05T20:41:15.000-07:00
Custom NER update
diff --git a/articles/cognitive-services/language-service/custom-classification/includes/quickstarts/rest-api.md b/articles/cognitive-services/language-service/custom-classification/includes/quickstarts/rest-api.md
@@ -129,9 +129,9 @@ For the documents key:
 
 |Key  |Value  | Example |
 |---------|---------|---------|
-| `location `    | Document name on the blob store. | doc1.txt |
-|`language`   | The language of the document.   | en-us |
-|`dataset`   |  Optional field to specify the dataset which this document will belong to. | Train or Test |
+| `location`    | Document name on the blob store. | `doc2.txt` |
+|`language`   | The language of the document.   | `en-us` |
+|`dataset`   |  Optional field to specify the dataset which this document will belong to. | `Train` or `Test` | 
 
 This request will return an error if:
 
@@ -181,11 +181,11 @@ Use the following JSON in your request. The model will be named `MyModel` once t
 |Key  |Value  | Example |
 |---------|---------|---------|
 |`modelLabel  `    | Your Model name.   | MyModel |
-|`runValidation`     | Boolean value to run validation on the test set.   | True or False |
-|`evaluationOptions`     | Specifies evaluation options.   | -- |
+|`runValidation`     | Boolean value to run validation on the test set.   | `True` or `False` |
+|`evaluationOptions`     | Specifies evaluation options.   |  |
 |`type`     | Specifies datasplit type.   | set or percentage |
-|`testingSplitPercentage`     | Required integer field if `type`  is *percentage*. Specifies testing split.   | 30 |
-|`trainingSplitPercentage`     | Required integer field if `type`  is *percentage*. Specifies training split.   | 70 |
+|`testingSplitPercentage`     | Required integer field if `type`  is *percentage*. Specifies testing split.   | `30` |
+|`trainingSplitPercentage`     | Required integer field if `type`  is *percentage*. Specifies training split.   | `70` |
 
 Once you send your API request, you will receive a `202` response indicating success. In the response headers, extract the `location` value. It will be formatted like this: 
 
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/faq.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/faq.md
@@ -8,7 +8,7 @@ manager: nitinme
 ms.service: cognitive-services
 ms.subservice: language-service
 ms.topic: conceptual
-ms.date: 11/16/2021
+ms.date: 04/05/2022
 ms.author: aahi
 ms.custom: language-service-custom-ner, ignite-fall-2021
 ---
@@ -44,7 +44,7 @@ When you're ready to start [using your model to make predictions](#how-do-i-use-
 
 ## What is the recommended CI/CD process?
 
-You can train multiple models on the same dataset within the same project. After you have trained your model successfully, you can [view its evaluation](how-to/view-model-evaluation.md). You can [deploy and test](quickstart.md#deploy-your-model) your model within [Language studio](https://aka.ms/languageStudio). You can add or remove tags from your data and train a **new** model and test it as well. View [service limits](service-limits.md)to learn about maximum number of trained models with the same project. When you train a new model your dataset is [split](how-to/train-model.md#data-split) randomly into training and testing sets, so there is no guarantee that the reflected model evaluation is about the same test set, and the results are not comparable. It's recommended that you develop your own test set and use it to evaluate both models so you can measure improvement.
+You can train multiple models on the same dataset within the same project. After you have trained your model successfully, you can [view its evaluation](how-to/view-model-evaluation.md). You can [deploy and test](quickstart.md#deploy-your-model) your model within [Language studio](https://aka.ms/languageStudio). You can add or remove tags from your data and train a **new** model and test it as well. View [service limits](service-limits.md)to learn about maximum number of trained models with the same project. When you [train your data](how-to/train-model.md) you can determine how your dataset is split into training and testing sets. You can also have your data split randomly into training and testing set where there is no guarantee that the reflected model evaluation is about the same test set, and the results are not comparable. It's recommended that you develop your own test set and use it to evaluate both models so you can measure improvement.
 
 ## Does a low or high model score guarantee bad or good performance in production?
 
@@ -67,7 +67,7 @@ See the [data selection and schema design](how-to/design-schema.md) article for
 
 ## Why do I get different results when I retrain my model?
 
-* When you train a new model your dataset is [split](how-to/train-model.md#data-split) randomly into train and test sets so there is no guarantee that the reflected model evaluation is on the same test set, so results are not comparable.
+* When you [train your model](how-to/train-model.md), you can determine if you want your data to be split randomly into train and test sets. If you do, so there is no guarantee that the reflected model evaluation is on the same test set, so results are not comparable.
 
 * If you're retraining the same model, your test set will be the same but you might notice a slight change in predictions made by the model. This is because the trained model is not robust enough and this is a factor of how representative and distinct your data is and the quality of your tagged data.
 
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/improve-model.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/improve-model.md
@@ -8,7 +8,7 @@ manager: nitinme
 ms.service: cognitive-services
 ms.subservice: language-service
 ms.topic: how-to
-ms.date: 11/02/2021
+ms.date: 04/05/2022
 ms.author: aahi
 ms.custom: language-service-custom-ner, ignite-fall-2021
 ---
@@ -34,7 +34,7 @@ See the [application development lifecycle](../overview.md#application-developme
 After you have reviewed your [model's evaluation](view-model-evaluation.md), you'll have formed an idea on what's wrong with your model's prediction. 
 
 > [!NOTE]
-> This guide focuses on data from the [validation set](train-model.md#data-split) that was created during training.
+> This guide focuses on data from the [validation set](train-model.md) that was created during training.
 
 ### Review test set
 
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/tag-data.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/tag-data.md
@@ -8,7 +8,7 @@ manager: nitinme
 ms.service: cognitive-services
 ms.subservice: language-service
 ms.topic: how-to
-ms.date: 11/02/2021
+ms.date: 04/05/2022
 ms.author: aahi
 ms.custom: language-service-custom-ner, ignite-fall-2021
 ---
@@ -44,12 +44,29 @@ The precision, consistency and completeness of your tagged data are key factors
 
 3. You can find a list of all `.txt` files available in your projects to the left. You can select the file you want to start tagging or you can use the **Back** and **Next** button from the bottom of the page to navigate.
 
-4. To start tagging, click **Add entities** in the top-right corner. You can either view all files or only tagged files by changing the view from the **Viewing** drop down.
+4. To start tagging, click **Add entities** in the top-right corner. You can either view all files or only tagged files by changing the view from the **Viewing** drop down filter.
+
+    :::image type="content" source="../media/tagging-screen.png" alt-text="A screenshot showing the Language Studio screen for tagging data." lightbox="../media/tagging-screen.png":::
+
+    In the image above:
+    
+    * *Section 1*: is where the content of the text file is displayed and tagging takes place. You have [two options for tagging](#tagging-options) your files.
+    
+    * *Section 2*: includes your project's entities and distribution across your files and tags.
+    If you click **Distribution**, you can view your tag distribution across:
+        
+        * Files: View the distribution of files across one single entity.
+        * Tags: view the distribution of tags across all files.
+    
+        :::image type="content" source="../media/distribution-ner.png" alt-text="A screenshot showing the distribution section." lightbox="../media/distribution-ner.png":::
+        
+    
+    * *Section 3*: This is the split project data toggle. You can choose to add a selected text file to your training set or the testing set. By default, the toggle is off, and all text files are added to your training set.
+    
+To add a text file to a training or testing set, simply choose from the radio buttons to which set it belongs.
 
 >[!TIP]
-> * There is no standard number of tags you will need, Consider starting with 50 tags per entity. The number of tags you'll need depends on how distinct your entities are, and how easily they can be differentiated from each other. It also depends on your tagging, which should be consistent and complete.
-
-:::image type="content" source="../media/tagging-screen.png" alt-text="A screenshot showing the Language Studio screen for tagging data." lightbox="../media/tagging-screen.png":::
+>It is recommended to define your testing set.
 
 If you enabled multiple languages for your project, you will find a **Language** dropdown, which lets you select the language of each document.
 
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/train-model.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/train-model.md
@@ -27,14 +27,6 @@ The time to train a model varies on the dataset, and may take up to several hour
 
 See the [application development lifecycle](../overview.md#application-development-lifecycle) for more information.
 
-## Data split
-
-Before starting the training process, files in your dataset are divided into two groups at random:
-
-* The **training set** contains 80% of the files in your dataset. It is the main set that is used to train the model.
-
-* The **test set** contains 20% of the files available in your dataset. This set is used to provide an unbiased [evaluation](../how-to/view-model-evaluation.md) of the model. This set is not introduced to the model during training.
-
 ## Train model in Language studio
 
 1. Go to your project page in [Language Studio](https://aka.ms/LanguageStudio).
@@ -45,11 +37,17 @@ Before starting the training process, files in your dataset are divided into two
 
 4. To train a new model, select **Train a new model** and type in the model name in the text box below. You can **overwrite an existing model** by selecting this option and select the model you want from the dropdown below.
 
-    :::image type="content" source="../media/train-model.png" alt-text="Create a new model" lightbox="../media/train-model.png":::
+    :::image type="content" source="../media/train-model.png" alt-text="Create a new training job" lightbox="../media/train-model.png":::
+    
+If you have enabled [your project data to be split manually](tag-data.md) when you were tagging your data, you will see two training options:
+
+* **Automatic split the testing**: The data will be randomly split for each class between training and testing sets, according to the percentages you choose. The default value is 80% for training and 20% for testing. To change these values, choose which set you want to change and write the new value.
+* **Use a manual split**: Assign each document to either the training or testing set, this required first adding files in the test dataset.
+
 
 5. Click on the **Train** button.
 
-6. You can check the status of the training job in the same page. Only successfully completed tasks will generate models.
+6. You can check the status of the training job in the same page. Only successfully completed training jobs will generate models.
 
 You can only have one training job running at a time. You cannot create or start other tasks in the same project. 
 
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/view-model-evaluation.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/how-to/view-model-evaluation.md
@@ -8,15 +8,15 @@ manager: nitinme
 ms.service: cognitive-services
 ms.subservice: language-service
 ms.topic: how-to
-ms.date: 11/02/2021
+ms.date: 04/05/2022
 ms.author: aahi
 ms.custom: language-service-custom-ner, ignite-fall-2021
 ---
 
 
 # View the model's evaluation and details
 
-After your model has finished training, you can view the model details and see how well does it perform against the test set, which contains 10% of your data at random, which is created during [training](train-model.md#data-split). The test set consists of data that was not introduced to the model during the training process. For the evaluation process to complete there must be at least 10 files in your dataset. You must also have a [custom NER project](../quickstart.md) with a [trained model](train-model.md).
+After your model has finished training, you can view the model details and see how well does it perform against the test set, which contains 10% of your data at random, which is created during [training](train-model.md). The test set consists of data that was not introduced to the model during the training process. For the evaluation process to complete there must be at least 10 files in your dataset. You must also have a [custom NER project](../quickstart.md) with a [trained model](train-model.md).
 
 ## Prerequisites
 
@@ -33,7 +33,7 @@ See the [application development lifecycle](../overview.md#application-developme
 
 2. Select **View model details** from the menu on the left side of the screen.
 
-3. In this page you can only view the sucessfuly trained models. You can click on the model name for more details.
+3. In this page you can only view the successfully trained models. You can click on the model name for more details.
 
 4. You can find the **model-level** evaluation metrics under **Overview**, and the **entity-level** evaluation metrics under **Entity performance metrics**. The confusion matrix for the model is located under **Test set confusion matrix**
     
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/includes/quickstarts/rest-api.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/includes/quickstarts/rest-api.md
@@ -107,6 +107,7 @@ Use the following JSON in your request. Replace the placeholder values below wit
         {
             "location": "doc1.txt",
             "language": "en-us",
+            "dataset": "Train",
             "extractors": [
                 {
                     "regionOffset": 0,
@@ -129,6 +130,7 @@ Use the following JSON in your request. Replace the placeholder values below wit
         {
             "location": "doc2.txt",
             "language": "en-us",
+            "dataset": "Test",
             "extractors": [
                 {
                     "regionOffset": 0,
@@ -154,6 +156,15 @@ For the metadata key:
 | `modelType  `    | Your Model type. | Extraction |
 |`storageInputContainerName`   | The name of your Azure blob storage container.   | `myContainer` |
 
+For the documents key: 
+
+|Key  |Value  | Example |
+|---------|---------|---------|
+| `location `    | Document name on the blob store. | `doc2.txt` |
+|`language`   | The language of the document.   | `en-us` |
+|`dataset`   |  Optional field to specify the dataset which this document will belong to. | `Train` or `Test` | 
+
+
 This request will return an error if:
 
 * The selected resource doesn't have proper permission for the storage account. 
@@ -191,14 +202,24 @@ Use the following JSON in your request. The model will be named `MyModel` once t
 ```json
 {
   "modelLabel": "MyModel",
-  "runValidation": true
+  "runValidation": true,
+  "evaluationOptions":
+    {
+        "type":"percentage",
+        "testingSplitPercentage":"30",
+        "trainingSplitPercentage":"70"
+    }
 }
 ```
 
 |Key  |Value  | Example |
 |---------|---------|---------|
 |`modelLabel  `    | Your Model name.   | MyModel |
-|`runValidation`     | Boolean value to run validation on the test set.   | True |
+|`runValidation`     | Boolean value to run validation on the test set.   | `True` or `False` |
+|`evaluationOptions`     | Specifies evaluation options.   |  |
+|`type`     | Specifies datasplit type.   | set or percentage |
+|`testingSplitPercentage`     | Required integer field if `type`  is *percentage*. Specifies testing split.   | `30` |
+|`trainingSplitPercentage`     | Required integer field if `type`  is *percentage*. Specifies training split.   | `70` |
 
 Once you send your API request, you’ll receive a `202` response indicating success. In the response headers, extract the `location` value. It will be formatted like this: 
 
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/includes/train-model-language-studio.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/includes/train-model-language-studio.md
@@ -5,7 +5,7 @@ manager: nitinme
 ms.service: cognitive-services
 ms.subservice: language-service
 ms.topic: include
-ms.date: 02/02/2022
+ms.date: 04/05/2022
 ms.author: aahi
 ---
 
@@ -24,5 +24,5 @@ To start training your model:
 3. Click on the **Train** button at the bottom of the page.
 
     > [!NOTE]
-    > * While training, the data will be spilt into 2 sets: 80% for training and 20% for testing. See [how to train a model](../how-to/train-model.md#data-split) for more information. 
+    > * While training, the data will be spilt into 2 sets for training and testing the model. See [how to train a model](../how-to/train-model.md) for more information. 
     > * Training can take up to a few hours.
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/media/distribution-ner.png b/articles/cognitive-services/language-service/custom-named-entity-recognition/media/distribution-ner.png
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/media/tagging-screen.png b/articles/cognitive-services/language-service/custom-named-entity-recognition/media/tagging-screen.png
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/media/train-model.png b/articles/cognitive-services/language-service/custom-named-entity-recognition/media/train-model.png
diff --git a/articles/cognitive-services/language-service/custom-named-entity-recognition/service-limits.md b/articles/cognitive-services/language-service/custom-named-entity-recognition/service-limits.md
@@ -8,7 +8,7 @@ manager: nitinme
 ms.service: cognitive-services
 ms.subservice: language-service
 ms.topic: conceptual
-ms.date: 11/02/2021
+ms.date: 04/05/2022
 ms.author: aahi
 ms.custom: language-service-custom-ner, references_regions, ignite-fall-2021
 ---
@@ -27,13 +27,13 @@ Use this article to learn about the data and service limits when using Custom NE
 
 * Maximum allowed length for your file is 128,000 characters, which is approximately 28,000 words or 56 pages.
 
-* Your [training dataset](how-to/train-model.md#data-split) should include at least 10 files and not more than 100,000 files.
+* Your [training dataset](how-to/train-model.md) should include at least 10 files and not more than 100,000 files.
 
 ## APIs limits
 
-* When using the Authoring API, there is a maximum of 10 POST requests and 100 GET requests per minute.
+* The Authoring API has a maximum of 10 POST requests and 100 GET requests per minute.
 
-* When using the Analyze API, there is a maximum of 20 GET or POST requests per minute.
+* The Analyze API has a maximum of 20 GET or POST requests per minute.
 
 * The maximum file size per request is 125,000 characters. You can send up to 25 files as long as they collectively do not exceed 125,000 characters.
 
@@ -70,7 +70,7 @@ Custom text classification is only available select Azure regions. When you crea
 
 * Model names have to be unique within the same project.
 
-* Model names must only contain alphnumeric characters,only letters and numbers, no spaces or special characters are allowed). Model name must have a maximum of 50 characters.
+* Model names must only contain alphanumeric characters, only letters and numbers, no spaces or special characters are allowed). Model name must have a maximum of 50 characters.
 
 * You cannot rename your model after creation.