You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Labeling tips and tricks for custom models in the Form Recognizer Studio
2
+
title: Labeling tips for custom models in the Form Recognizer Studio
3
3
titleSuffix: Azure Applied AI Services
4
4
description: Label tips and tricks for Form Recognizer Studio
5
5
author: laujan
@@ -13,25 +13,25 @@ ms.custom: references_regions
13
13
recommendations: false
14
14
---
15
15
16
-
# Tips and tricks for labeling custom model datasets
16
+
# Tips for labeling custom model datasets
17
17
18
-
Labeling documents can be time consuming when you have a large number of labels, long documents or documents with varying structure. These tips and tricks should help you label documents more efficiently.
18
+
This article highlights the best methods for labeling custom model datasets in the Form Recognizer Studio. Labeling documents can be time consuming when you have a large number of labels, long documents, or documents with varying structure. These tips should help you label documents more efficiently.
19
19
20
20
## Search
21
21
22
-
The Studio now includes a search box for instances when you know you need to find specific words to label, but just don't know where in the document the values are. Simply search for the word or phrase and navigate to the specific section in the document to label the occurrence.
22
+
The Studio now includes a search box for instances when you know you need to find specific words to label, but just don't know where they are located in the document. Simply search for the word or phrase and navigate to the specific section in the document to label the occurrence.
23
23
24
24
## Auto label tables
25
25
26
-
Tables can be tedious to label, when they have many rows or dense text. If the layout table extracts the result you need, you should just use that result and skip labeling the table. In instances where the layout table isn't exactly what you need, you can start with generating the table field from the values layout extracts. Start by selecting the table icon on the page and select on the auto label button. You can then edit the values as needed. Auto label currently only supports single page tables and a future update will include support for multi page tables.
26
+
Tables can be challenging to label, when they have many rows or dense text. If the layout table extracts the result you need, you should just use that result and skip the labeling process. In instances where the layout table isn't exactly what you need, you can start with generating the table field from the values layout extracts. Start by selecting the table icon on the page and select on the auto label button. You can then edit the values as needed. Auto label currently only supports single page tables and a future update will include support for multi page tables.
27
27
28
28
## Shift select
29
29
30
-
When labeling a large span of text, rather than paint each word in the span, hold down the shift key as you're selecting the words to speed up labeling and ensure you don't miss any words in the span of text
30
+
When labeling a large span of text, rather than mark each word in the span, hold down the shift key as you're selecting the words to speed up labeling and ensure you don't miss any words in the span of text.
31
31
32
32
## Region labeling
33
33
34
-
A second option to labeling larger spans of text is to use region labeling to select a region. When region labeling is used, the OCR results are populated in the value at training time. The different between the shift select and region labeling is only in the visual feedback the shift labeling approach provides.
34
+
A second option for labeling larger spans of text is to use region labeling. When region labeling is used, the OCR results are populated in the value at training time. The difference between the shift select and region labeling is only in the visual feedback the shift labeling approach provides.
35
35
36
36
## Field subtypes
37
37
@@ -40,3 +40,20 @@ When creating a field, select the right subtype to minimize post processing, for
40
40
## Batch layout
41
41
42
42
When creating a project, select the batch layout option to prepare all documents in your dataset for labeling. This feature ensures that you no longer have to select on each document and wait for the layout results before you can start labeling.
Copy file name to clipboardExpand all lines: articles/applied-ai-services/form-recognizer/concept-custom-label.md
+44-35Lines changed: 44 additions & 35 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,59 +7,67 @@ manager: nitinme
7
7
ms.service: applied-ai-services
8
8
ms.subservice: forms-recognizer
9
9
ms.topic: conceptual
10
-
ms.date: 12/01/2022
10
+
ms.date: 12/15/2022
11
11
ms.author: vikurpad
12
12
ms.custom: references_regions
13
13
monikerRange: 'form-recog-3.0.0'
14
14
recommendations: false
15
15
---
16
16
17
-
# Best practices on generating a Form Recognizer labeled dataset
17
+
# Best practices: Generating Form Recognizer labeled dataset
18
18
19
-
Custom models, template and neural require a labeled dataset of at least five documents to train a model. The quality of the labeled dataset affects the accuracy of the trained model. This guide helps you learn more about generating a model with high accuracy by assembling a diverse dataset and best practices for labeling your documents.
19
+
Custom models (template and neural) require a labeled dataset of at least five documents to train a model. The quality of the labeled dataset affects the accuracy of the trained model. This guide helps you learn more about generating a model with high accuracy by assembling a diverse dataset and provides best practices for labeling your documents.
20
20
21
-
## Understanding the components of the labeled dataset
21
+
## Understand the components of a labeled dataset
22
22
23
-
A labeled dataset contains three types of files:
23
+
A labeled dataset consists of several files:
24
24
25
-
* A set of sample documents (typically PDFs or images), you need a minimum of five documents to train a model.
26
-
* The labeling process will generate the following files:
27
-
- A `fields.json` file is created when the first field is added. There's one fields.json file for the entire training dataset, the field list contains the field name and associated sub fields and types.
28
-
- The Studio runs each of the documents through the [Layout API](concept-layout.md). The layout response for each of the sample files in the dataset is added as `{file}.ocr.json`. The layout response is used to generate the field labels when a specific span of text is labeled.
29
-
- A `{file}.labels.json` file is created or updated when a field is labeled in a document. The label file contains the spans of text and associated polygons from the layout output for each span of text the user adds as a value for a specific field.
25
+
* You'll provide a set of sample documents (typically PDFs or images). A minimum of five documents is needed to train a model.
30
26
31
-
## Creating a balanced dataset
27
+
* Additionally, the labeling process will generate the following files:
28
+
29
+
* A `fields.json` file is created when the first field is added. There's one `fields.json` file for the entire training dataset, the field list contains the field name and associated sub fields and types.
30
+
31
+
* The Studio runs each of the documents through the [Layout API](concept-layout.md). The layout response for each of the sample files in the dataset is added as `{file}.ocr.json`. The layout response is used to generate the field labels when a specific span of text is labeled.
32
+
33
+
* A `{file}.labels.json` file is created or updated when a field is labeled in a document. The label file contains the spans of text and associated polygons from the layout output for each span of text the user adds as a value for a specific field.
34
+
35
+
## Create a balanced dataset
32
36
33
37
Before you start labeling, it's a good idea to look at a few different samples of the document to identify which samples you want to use in your labeled dataset. A balanced dataset represents all the typical variations you would expect to see for the document. Creating a balanced dataset will result in a model with the highest possible accuracy. A few examples to consider are:
34
38
35
-
* Document formats - If you expect to analyze both digital and scanned documents, add a few examples of each type to the training dataset
39
+
***Document formats**: If you expect to analyze both digital and scanned documents, add a few examples of each type to the training dataset
36
40
37
-
* Variations (template model) - consider splitting the dataset into folders and train a model for each of the variation. Variations that include either structure or layout variations should be split into different models. You can then compose the individual models into a single [composed model](concept-composed-models.md).
41
+
***Variations (template model)**: Consider splitting the dataset into folders and train a model for each of variation. Variations that include either structure or layout should be split into different models. You can then compose the individual models into a single [composed model](concept-composed-models.md).
38
42
39
-
* Variations (Neural models) - When your dataset has a manageable set of variations, about 15 pr fewer, create a single dataset with a few samples of each of the different variations to train a single model. If the number of template variations is larger than 15, you'll train multiple models and [compose](concept-composed-models.md) them together.
43
+
***Variations (Neural models)**: When your dataset has a manageable set of variations, about 15 or fewer, create a single dataset with a few samples of each of the different variations to train a single model. If the number of template variations is larger than 15, you'll train multiple models and [compose](concept-composed-models.md) them together.
40
44
41
-
* Tables - For documents containing tables with a variable number of rows, ensure that the training dataset also represents documents with different number of rows.
45
+
***Tables**: For documents containing tables with a variable number of rows, ensure that the training dataset also represents documents with different numbers of rows.
42
46
43
-
* Multi page tables - When tables span multiple pages, label a single table. Add documents to the training dataset with the expected variations represented, documents with the table on a single page only, documents with the table spanning two or more pages with all the rows labeled.
47
+
***Multi page tables**: When tables span multiple pages, label a single table. Add documents to the training dataset with the expected variations represented—documents with the table on a single page only and documents with the table spanning two or more pages with all the rows labeled.
44
48
45
-
* Optional fields - If your documents contain documents with optional fields, validate that the training dataset has a few documents with the optionality represented.
49
+
***Optional fields**: If your dataset contains documents with optional fields, validate that the training dataset has a few documents with the options represented.
46
50
47
51
## Start by identifying the fields
48
52
49
-
Take the time to identify each of the fields you plan to label in the dataset, paying attention to optional fields. Define the fields with the type that best matches the supported types.
53
+
Take the time to identify each of the fields you plan to label in the dataset. Pay attention to optional fields. Define the fields with the labels that best match the supported types.
54
+
55
+
Use the following guidelines to define the fields:
56
+
57
+
* For custom neural models, use semantically relevant names for fields. For example, if the value being extracted is `Effective Date`, name it `effective_date` or `EffectiveDate` not a generic name like **date1**.
50
58
51
-
Use the following guidelines to defining the fields:
59
+
* Ideally, name your fields with Pascal or camel case.
52
60
53
-
* For custom neural models, use semantically relevant names for fields. As an example, if the value being extracted is `Effective Date`, name it `effective_date` or `EffectiveDate` not a generic name like `date1`
54
-
* Ideally, name your field Pascal case, camel case.
55
-
* If a value is part of a visually repeating structure and you only need a single value, label it as a table and extract the required value in post processing
56
-
* For tabular fields spanning multiple pages, define and label as a single table
61
+
* If a value is part of a visually repeating structure and you only need a single value, label it as a table and extract the required value during post-processing.
57
62
58
-
Custom neural models share the same labeling format and strategy as custom template models. Currently custom neural models only support a subset of the field types supported by custom template models.
63
+
* For tabular fields spanning multiple pages, define and label the fields as a single table.
64
+
65
+
. [!NOTE]
66
+
> Custom neural models share the same labeling format and strategy as custom template models. Currently custom neural models only support a subset of the field types supported by custom template models.
59
67
60
68
## Model capabilities
61
69
62
-
Custom neural models currently only support key-value pairs, structured fields (tables) and selection marks, future releases will include support for signature.
70
+
Custom neural models currently only support key-value pairs, structured fields (tables), and selection marks. Future releases will include support for signatures.
63
71
64
72
| Model type | Form fields | Selection marks | Tabular fields | Signature | Region |
65
73
|--|--|--|--|--|--|
@@ -72,20 +80,21 @@ Custom neural models currently only support key-value pairs, structured fields (
72
80
73
81
Tabular fields (tables) are supported with custom neural models starting with API version ```2022-06-30-preview```. Models trained with API version 2022-06-30-preview or later will accept tabular field labels and documents analyzed with the model with API version 2022-06-30-preview or later will produce tabular fields in the output within the ```documents``` section of the result in the ```analyzeResult``` object.
74
82
75
-
Tabular fields support **cross page tables** by default. To label a table that spans multiple pages, label each row of the table across the different pages in the single table. As a best practice, ensure that your dataset contains a few samples of the expected variations. For example, include samples where an entire table is on a single page and samples of a table spanning two or more pages.
83
+
Tabular fields support **cross page tables** by default. To label a table that spans multiple pages, label each row of the table across the different pages in the single table. As a best practice, ensure that your dataset contains a few samples of the expected variations. For example, include both samples where an entire table is on a single page and samples of a table spanning two or more pages.
76
84
77
-
Tabular field is also useful when extracting repeating information within a document that isn't recognized as a table. For example, a repeating section of work experiences in a resume can be labeled and extracted as a tabular field.
85
+
Tabular fields are also useful when extracting repeating information within a document that isn't recognized as a table. For example, a repeating section of work experiences in a resume can be labeled and extracted as a tabular field.
78
86
79
87
## Labeling guidelines
80
88
81
-
* Labeling the value is required; don't include the surrounding text. For example when labeling a checkbox, name the field to indicate the check box selection for example ```selectionYes``` and ```selectionNo``` rather than labeling the yes or no text in the document.
82
-
* Non interleaving values - Value words/region of one field must be either
83
-
- Consecutive sequence in natural reading order without interleaving with other fields or
84
-
- In a region that doesn't cover any other fields
85
-
* Consistent labeling - If a value appears in multiple contexts withing the document, consistently pick the same context across documents to label the value.
86
-
* Tables support visually repeating groups of information not just explicit tables. Explicit tables will be identified in tables section of the analyzed documents as part of the layout output and don't need to be labeled as tables. Only label a table field if the information is visually repeating and not identified as a table as part of the layout response. An example would be the repeating work experience section of a resume.
87
-
* Region labeling (custom template) allows you to define a value when none exists. If the value is optional, ensure that you leave a few sample documents with the region not labeled.
88
-
* When labeling regions, don't include the surrounding text with the label.
89
+
***Labeling values is required.** Don't include the surrounding text. For example when labeling a checkbox, name the field to indicate the check box selection for example ```selectionYes``` and ```selectionNo``` rather than labeling the yes or no text in the document.
90
+
91
+
***Don't provide interleaving field values** The value of words and/or regions of one field must be either a consecutive sequence in natural reading order without interleaving with other fields or in a region that doesn't cover any other fields
92
+
93
+
***Consistent labeling**. If a value appears in multiple contexts withing the document, consistently pick the same context across documents to label the value.
94
+
95
+
***Visually repeating data**. Tables support visually repeating groups of information not just explicit tables. Explicit tables will be identified in tables section of the analyzed documents as part of the layout output and don't need to be labeled as tables. Only label a table field if the information is visually repeating and not identified as a table as part of the layout response. An example would be the repeating work experience section of a resume.
96
+
97
+
***Region labeling (custom template)**. Labeling specific regions allows you to define a value when none exists. If the value is optional, ensure that you leave a few sample documents with the region not labeled. When labeling regions, don't include the surrounding text with the label.
0 commit comments