Skip to content

Commit da97fe7

Browse files
committed
edit PR #221106
1 parent d53a6d0 commit da97fe7

File tree

2 files changed

+68
-42
lines changed

2 files changed

+68
-42
lines changed
Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Labeling tips and tricks for custom models in the Form Recognizer Studio
2+
title: Labeling tips for custom models in the Form Recognizer Studio
33
titleSuffix: Azure Applied AI Services
44
description: Label tips and tricks for Form Recognizer Studio
55
author: laujan
@@ -13,25 +13,25 @@ ms.custom: references_regions
1313
recommendations: false
1414
---
1515

16-
# Tips and tricks for labeling custom model datasets
16+
# Tips for labeling custom model datasets
1717

18-
Labeling documents can be time consuming when you have a large number of labels, long documents or documents with varying structure. These tips and tricks should help you label documents more efficiently.
18+
This article highlights the best methods for labeling custom model datasets in the Form Recognizer Studio. Labeling documents can be time consuming when you have a large number of labels, long documents, or documents with varying structure. These tips should help you label documents more efficiently.
1919

2020
## Search
2121

22-
The Studio now includes a search box for instances when you know you need to find specific words to label, but just don't know where in the document the values are. Simply search for the word or phrase and navigate to the specific section in the document to label the occurrence.
22+
The Studio now includes a search box for instances when you know you need to find specific words to label, but just don't know where they are located in the document. Simply search for the word or phrase and navigate to the specific section in the document to label the occurrence.
2323

2424
## Auto label tables
2525

26-
Tables can be tedious to label, when they have many rows or dense text. If the layout table extracts the result you need, you should just use that result and skip labeling the table. In instances where the layout table isn't exactly what you need, you can start with generating the table field from the values layout extracts. Start by selecting the table icon on the page and select on the auto label button. You can then edit the values as needed. Auto label currently only supports single page tables and a future update will include support for multi page tables.
26+
Tables can be challenging to label, when they have many rows or dense text. If the layout table extracts the result you need, you should just use that result and skip the labeling process. In instances where the layout table isn't exactly what you need, you can start with generating the table field from the values layout extracts. Start by selecting the table icon on the page and select on the auto label button. You can then edit the values as needed. Auto label currently only supports single page tables and a future update will include support for multi page tables.
2727

2828
## Shift select
2929

30-
When labeling a large span of text, rather than paint each word in the span, hold down the shift key as you're selecting the words to speed up labeling and ensure you don't miss any words in the span of text
30+
When labeling a large span of text, rather than mark each word in the span, hold down the shift key as you're selecting the words to speed up labeling and ensure you don't miss any words in the span of text.
3131

3232
## Region labeling
3333

34-
A second option to labeling larger spans of text is to use region labeling to select a region. When region labeling is used, the OCR results are populated in the value at training time. The different between the shift select and region labeling is only in the visual feedback the shift labeling approach provides.
34+
A second option for labeling larger spans of text is to use region labeling. When region labeling is used, the OCR results are populated in the value at training time. The difference between the shift select and region labeling is only in the visual feedback the shift labeling approach provides.
3535

3636
## Field subtypes
3737

@@ -40,3 +40,20 @@ When creating a field, select the right subtype to minimize post processing, for
4040
## Batch layout
4141

4242
When creating a project, select the batch layout option to prepare all documents in your dataset for labeling. This feature ensures that you no longer have to select on each document and wait for the layout results before you can start labeling.
43+
44+
## Next steps
45+
46+
* Learn more about custom labeling:
47+
48+
> [!div class="nextstepaction"]
49+
> [Custom labels](concept-custom-label.md)
50+
51+
* Learn more about custom template models:
52+
53+
> [!div class="nextstepaction"]
54+
> [Custom template models](concept-custom-template.md )
55+
56+
* Learn more about custom neural models:
57+
58+
> [!div class="nextstepaction"]
59+
> [Custom neural models](concept-custom-neural.md )

articles/applied-ai-services/form-recognizer/concept-custom-label.md

Lines changed: 44 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -7,59 +7,67 @@ manager: nitinme
77
ms.service: applied-ai-services
88
ms.subservice: forms-recognizer
99
ms.topic: conceptual
10-
ms.date: 12/01/2022
10+
ms.date: 12/15/2022
1111
ms.author: vikurpad
1212
ms.custom: references_regions
1313
monikerRange: 'form-recog-3.0.0'
1414
recommendations: false
1515
---
1616

17-
# Best practices on generating a Form Recognizer labeled dataset
17+
# Best practices: Generating Form Recognizer labeled dataset
1818

19-
Custom models, template and neural require a labeled dataset of at least five documents to train a model. The quality of the labeled dataset affects the accuracy of the trained model. This guide helps you learn more about generating a model with high accuracy by assembling a diverse dataset and best practices for labeling your documents.
19+
Custom models (template and neural) require a labeled dataset of at least five documents to train a model. The quality of the labeled dataset affects the accuracy of the trained model. This guide helps you learn more about generating a model with high accuracy by assembling a diverse dataset and provides best practices for labeling your documents.
2020

21-
## Understanding the components of the labeled dataset
21+
## Understand the components of a labeled dataset
2222

23-
A labeled dataset contains three types of files:
23+
A labeled dataset consists of several files:
2424

25-
* A set of sample documents (typically PDFs or images), you need a minimum of five documents to train a model.
26-
* The labeling process will generate the following files:
27-
- A `fields.json` file is created when the first field is added. There's one fields.json file for the entire training dataset, the field list contains the field name and associated sub fields and types.
28-
- The Studio runs each of the documents through the [Layout API](concept-layout.md). The layout response for each of the sample files in the dataset is added as `{file}.ocr.json`. The layout response is used to generate the field labels when a specific span of text is labeled.
29-
- A `{file}.labels.json` file is created or updated when a field is labeled in a document. The label file contains the spans of text and associated polygons from the layout output for each span of text the user adds as a value for a specific field.
25+
* You'll provide a set of sample documents (typically PDFs or images). A minimum of five documents is needed to train a model.
3026

31-
## Creating a balanced dataset
27+
* Additionally, the labeling process will generate the following files:
28+
29+
* A `fields.json` file is created when the first field is added. There's one `fields.json` file for the entire training dataset, the field list contains the field name and associated sub fields and types.
30+
31+
* The Studio runs each of the documents through the [Layout API](concept-layout.md). The layout response for each of the sample files in the dataset is added as `{file}.ocr.json`. The layout response is used to generate the field labels when a specific span of text is labeled.
32+
33+
* A `{file}.labels.json` file is created or updated when a field is labeled in a document. The label file contains the spans of text and associated polygons from the layout output for each span of text the user adds as a value for a specific field.
34+
35+
## Create a balanced dataset
3236

3337
Before you start labeling, it's a good idea to look at a few different samples of the document to identify which samples you want to use in your labeled dataset. A balanced dataset represents all the typical variations you would expect to see for the document. Creating a balanced dataset will result in a model with the highest possible accuracy. A few examples to consider are:
3438

35-
* Document formats - If you expect to analyze both digital and scanned documents, add a few examples of each type to the training dataset
39+
* **Document formats**: If you expect to analyze both digital and scanned documents, add a few examples of each type to the training dataset
3640

37-
* Variations (template model) - consider splitting the dataset into folders and train a model for each of the variation. Variations that include either structure or layout variations should be split into different models. You can then compose the individual models into a single [composed model](concept-composed-models.md).
41+
* **Variations (template model)**: Consider splitting the dataset into folders and train a model for each of variation. Variations that include either structure or layout should be split into different models. You can then compose the individual models into a single [composed model](concept-composed-models.md).
3842

39-
* Variations (Neural models) - When your dataset has a manageable set of variations, about 15 pr fewer, create a single dataset with a few samples of each of the different variations to train a single model. If the number of template variations is larger than 15, you'll train multiple models and [compose](concept-composed-models.md) them together.
43+
* **Variations (Neural models)**: When your dataset has a manageable set of variations, about 15 or fewer, create a single dataset with a few samples of each of the different variations to train a single model. If the number of template variations is larger than 15, you'll train multiple models and [compose](concept-composed-models.md) them together.
4044

41-
* Tables - For documents containing tables with a variable number of rows, ensure that the training dataset also represents documents with different number of rows.
45+
* **Tables**: For documents containing tables with a variable number of rows, ensure that the training dataset also represents documents with different numbers of rows.
4246

43-
* Multi page tables - When tables span multiple pages, label a single table. Add documents to the training dataset with the expected variations represented, documents with the table on a single page only, documents with the table spanning two or more pages with all the rows labeled.
47+
* **Multi page tables**: When tables span multiple pages, label a single table. Add documents to the training dataset with the expected variations representeddocuments with the table on a single page only and documents with the table spanning two or more pages with all the rows labeled.
4448

45-
* Optional fields - If your documents contain documents with optional fields, validate that the training dataset has a few documents with the optionality represented.
49+
* **Optional fields**: If your dataset contains documents with optional fields, validate that the training dataset has a few documents with the options represented.
4650

4751
## Start by identifying the fields
4852

49-
Take the time to identify each of the fields you plan to label in the dataset, paying attention to optional fields. Define the fields with the type that best matches the supported types.
53+
Take the time to identify each of the fields you plan to label in the dataset. Pay attention to optional fields. Define the fields with the labels that best match the supported types.
54+
55+
Use the following guidelines to define the fields:
56+
57+
* For custom neural models, use semantically relevant names for fields. For example, if the value being extracted is `Effective Date`, name it `effective_date` or `EffectiveDate` not a generic name like **date1**.
5058

51-
Use the following guidelines to defining the fields:
59+
* Ideally, name your fields with Pascal or camel case.
5260

53-
* For custom neural models, use semantically relevant names for fields. As an example, if the value being extracted is `Effective Date`, name it `effective_date` or `EffectiveDate` not a generic name like `date1`
54-
* Ideally, name your field Pascal case, camel case.
55-
* If a value is part of a visually repeating structure and you only need a single value, label it as a table and extract the required value in post processing
56-
* For tabular fields spanning multiple pages, define and label as a single table
61+
* If a value is part of a visually repeating structure and you only need a single value, label it as a table and extract the required value during post-processing.
5762

58-
Custom neural models share the same labeling format and strategy as custom template models. Currently custom neural models only support a subset of the field types supported by custom template models.
63+
* For tabular fields spanning multiple pages, define and label the fields as a single table.
64+
65+
. [!NOTE]
66+
> Custom neural models share the same labeling format and strategy as custom template models. Currently custom neural models only support a subset of the field types supported by custom template models.
5967
6068
## Model capabilities
6169

62-
Custom neural models currently only support key-value pairs, structured fields (tables) and selection marks, future releases will include support for signature.
70+
Custom neural models currently only support key-value pairs, structured fields (tables), and selection marks. Future releases will include support for signatures.
6371

6472
| Model type | Form fields | Selection marks | Tabular fields | Signature | Region |
6573
|--|--|--|--|--|--|
@@ -72,20 +80,21 @@ Custom neural models currently only support key-value pairs, structured fields (
7280

7381
Tabular fields (tables) are supported with custom neural models starting with API version ```2022-06-30-preview```. Models trained with API version 2022-06-30-preview or later will accept tabular field labels and documents analyzed with the model with API version 2022-06-30-preview or later will produce tabular fields in the output within the ```documents``` section of the result in the ```analyzeResult``` object.
7482

75-
Tabular fields support **cross page tables** by default. To label a table that spans multiple pages, label each row of the table across the different pages in the single table. As a best practice, ensure that your dataset contains a few samples of the expected variations. For example, include samples where an entire table is on a single page and samples of a table spanning two or more pages.
83+
Tabular fields support **cross page tables** by default. To label a table that spans multiple pages, label each row of the table across the different pages in the single table. As a best practice, ensure that your dataset contains a few samples of the expected variations. For example, include both samples where an entire table is on a single page and samples of a table spanning two or more pages.
7684

77-
Tabular field is also useful when extracting repeating information within a document that isn't recognized as a table. For example, a repeating section of work experiences in a resume can be labeled and extracted as a tabular field.
85+
Tabular fields are also useful when extracting repeating information within a document that isn't recognized as a table. For example, a repeating section of work experiences in a resume can be labeled and extracted as a tabular field.
7886

7987
## Labeling guidelines
8088

81-
* Labeling the value is required; don't include the surrounding text. For example when labeling a checkbox, name the field to indicate the check box selection for example ```selectionYes``` and ```selectionNo``` rather than labeling the yes or no text in the document.
82-
* Non interleaving values - Value words/region of one field must be either
83-
- Consecutive sequence in natural reading order without interleaving with other fields or
84-
- In a region that doesn't cover any other fields
85-
* Consistent labeling - If a value appears in multiple contexts withing the document, consistently pick the same context across documents to label the value.
86-
* Tables support visually repeating groups of information not just explicit tables. Explicit tables will be identified in tables section of the analyzed documents as part of the layout output and don't need to be labeled as tables. Only label a table field if the information is visually repeating and not identified as a table as part of the layout response. An example would be the repeating work experience section of a resume.
87-
* Region labeling (custom template) allows you to define a value when none exists. If the value is optional, ensure that you leave a few sample documents with the region not labeled.
88-
* When labeling regions, don't include the surrounding text with the label.
89+
* **Labeling values is required.** Don't include the surrounding text. For example when labeling a checkbox, name the field to indicate the check box selection for example ```selectionYes``` and ```selectionNo``` rather than labeling the yes or no text in the document.
90+
91+
* **Don't provide interleaving field values** The value of words and/or regions of one field must be either a consecutive sequence in natural reading order without interleaving with other fields or in a region that doesn't cover any other fields
92+
93+
* **Consistent labeling**. If a value appears in multiple contexts withing the document, consistently pick the same context across documents to label the value.
94+
95+
* **Visually repeating data**. Tables support visually repeating groups of information not just explicit tables. Explicit tables will be identified in tables section of the analyzed documents as part of the layout output and don't need to be labeled as tables. Only label a table field if the information is visually repeating and not identified as a table as part of the layout response. An example would be the repeating work experience section of a resume.
96+
97+
* **Region labeling (custom template)**. Labeling specific regions allows you to define a value when none exists. If the value is optional, ensure that you leave a few sample documents with the region not labeled. When labeling regions, don't include the surrounding text with the label.
8998

9099
## Next steps
91100

0 commit comments

Comments
 (0)