Skip to content

Commit 9f0a307

Browse files
Merge pull request #221667 from laujan/55514-221106-vinod-pr
55514 221106 vinod pr
2 parents 90ea106 + 7fa7d49 commit 9f0a307

17 files changed

+490
-17
lines changed

articles/applied-ai-services/form-recognizer/concept-analyze-document-response.md

Lines changed: 275 additions & 0 deletions
Large diffs are not rendered by default.

articles/applied-ai-services/form-recognizer/concept-composed-models.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ manager: nitinme
77
ms.service: applied-ai-services
88
ms.subservice: forms-recognizer
99
ms.topic: conceptual
10-
ms.date: 10/20/2022
10+
ms.date: 12/15/2022
1111
ms.author: lajanuar
1212
recommendations: false
1313
---
@@ -41,12 +41,12 @@ With composed models, you can assign multiple custom models to a composed model
4141
4242
### Composed model compatibility
4343

44-
|Custom model type|Models trained with v2.1 and v2.0| Custom template models v3.0 |Custom neural models v3.0 |Custom neural models 3.0 (GA)|
44+
|Custom model type|Models trained with v2.1 and v2.0 | Custom template models v3.0 |Custom neural models v3.0 (preview) |Custom neural models 3.0 (GA)|
4545
|--|--|--|--|--|
4646
|**Models trained with version 2.1 and v2.0** |Supported|Supported|Not Supported|Not Supported|
4747
|**Custom template models v3.0** |Supported|Supported|Not Supported|NotSupported|
4848
|**Custom template models v3.0 (GA)** |Not Supported|Not Supported|Supported|Not Supported|
49-
|**Custom neural models v3.0**|Not Supported|Not Supported|Supported|Not Supported|
49+
|**Custom neural models v3.0 (preview)**|Not Supported|Not Supported|Supported|Not Supported|
5050
|**Custom Neural models v3.0 (GA)**|Not Supported|Not Supported|Not Supported|Supported|
5151

5252
* To compose a model trained with a prior version of the API (v2.1 or earlier), train a model with the v3.0 API using the same labeled dataset. That addition will ensure that the v2.1 model can be composed with other models.
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
title: Labeling tips for custom models in the Form Recognizer Studio
3+
titleSuffix: Azure Applied AI Services
4+
description: Label tips and tricks for Form Recognizer Studio
5+
author: laujan
6+
manager: nitinme
7+
ms.service: applied-ai-services
8+
ms.subservice: forms-recognizer
9+
ms.topic: conceptual
10+
ms.date: 12/15/2022
11+
ms.author: vikurpad
12+
ms.custom: references_regions
13+
recommendations: false
14+
---
15+
16+
# Tips for labeling custom model datasets
17+
18+
This article highlights the best methods for labeling custom model datasets in the Form Recognizer Studio. Labeling documents can be time consuming when you have a large number of labels, long documents, or documents with varying structure. These tips should help you label documents more efficiently.
19+
20+
## Search
21+
22+
The Studio now includes a search box for instances when you know you need to find specific words to label, but just don't know where they're located in the document. Simply search for the word or phrase and navigate to the specific section in the document to label the occurrence.
23+
24+
## Auto label tables
25+
26+
Tables can be challenging to label, when they have many rows or dense text. If the layout table extracts the result you need, you should just use that result and skip the labeling process. In instances where the layout table isn't exactly what you need, you can start with generating the table field from the values layout extracts. Start by selecting the table icon on the page and select on the auto label button. You can then edit the values as needed. Auto label currently only supports single page tables.
27+
28+
## Shift select
29+
30+
When labeling a large span of text, rather than mark each word in the span, hold down the shift key as you're selecting the words to speed up labeling and ensure you don't miss any words in the span of text.
31+
32+
## Region labeling
33+
34+
A second option for labeling larger spans of text is to use region labeling. When region labeling is used, the OCR results are populated in the value at training time. The difference between the shift select and region labeling is only in the visual feedback the shift labeling approach provides.
35+
36+
## Field subtypes
37+
38+
When creating a field, select the right subtype to minimize post processing, for instance select the ```dmy``` option for dates to extract the values in a ```dd-mm-yyyy``` format.
39+
40+
## Batch layout
41+
42+
When creating a project, select the batch layout option to prepare all documents in your dataset for labeling. This feature ensures that you no longer have to select on each document and wait for the layout results before you can start labeling.
43+
44+
## Next steps
45+
46+
* Learn more about custom labeling:
47+
48+
> [!div class="nextstepaction"]
49+
> [Custom labels](concept-custom-label.md)
50+
51+
* Learn more about custom template models:
52+
53+
> [!div class="nextstepaction"]
54+
> [Custom template models](concept-custom-template.md )
55+
56+
* Learn more about custom neural models:
57+
58+
> [!div class="nextstepaction"]
59+
> [Custom neural models](concept-custom-neural.md )
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: Best practices for labeling documents in the Form Recognizer Studio
3+
titleSuffix: Azure Applied AI Services
4+
description: Label documents in the Studio to create a training dataset. Labeling guidelines aimed at training a model with high accuracy
5+
author: laujan
6+
manager: nitinme
7+
ms.service: applied-ai-services
8+
ms.subservice: forms-recognizer
9+
ms.topic: conceptual
10+
ms.date: 12/15/2022
11+
ms.author: vikurpad
12+
ms.custom: references_regions
13+
monikerRange: 'form-recog-3.0.0'
14+
recommendations: false
15+
---
16+
17+
# Best practices: Generating Form Recognizer labeled dataset
18+
19+
Custom models (template and neural) require a labeled dataset of at least five documents to train a model. The quality of the labeled dataset affects the accuracy of the trained model. This guide helps you learn more about generating a model with high accuracy by assembling a diverse dataset and provides best practices for labeling your documents.
20+
21+
## Understand the components of a labeled dataset
22+
23+
A labeled dataset consists of several files:
24+
25+
* You'll provide a set of sample documents (typically PDFs or images). A minimum of five documents is needed to train a model.
26+
27+
* Additionally, the labeling process will generate the following files:
28+
29+
* A `fields.json` file is created when the first field is added. There's one `fields.json` file for the entire training dataset, the field list contains the field name and associated sub fields and types.
30+
31+
* The Studio runs each of the documents through the [Layout API](concept-layout.md). The layout response for each of the sample files in the dataset is added as `{file}.ocr.json`. The layout response is used to generate the field labels when a specific span of text is labeled.
32+
33+
* A `{file}.labels.json` file is created or updated when a field is labeled in a document. The label file contains the spans of text and associated polygons from the layout output for each span of text the user adds as a value for a specific field.
34+
35+
## Create a balanced dataset
36+
37+
Before you start labeling, it's a good idea to look at a few different samples of the document to identify which samples you want to use in your labeled dataset. A balanced dataset represents all the typical variations you would expect to see for the document. Creating a balanced dataset will result in a model with the highest possible accuracy. A few examples to consider are:
38+
39+
* **Document formats**: If you expect to analyze both digital and scanned documents, add a few examples of each type to the training dataset
40+
41+
* **Variations (template model)**: Consider splitting the dataset into folders and train a model for each of variation. Variations that include either structure or layout should be split into different models. You can then compose the individual models into a single [composed model](concept-composed-models.md).
42+
43+
* **Variations (Neural models)**: When your dataset has a manageable set of variations, about 15 or fewer, create a single dataset with a few samples of each of the different variations to train a single model. If the number of template variations is larger than 15, you'll train multiple models and [compose](concept-composed-models.md) them together.
44+
45+
* **Tables**: For documents containing tables with a variable number of rows, ensure that the training dataset also represents documents with different numbers of rows.
46+
47+
* **Multi page tables**: When tables span multiple pages, label a single table. Add documents to the training dataset with the expected variations represented—documents with the table on a single page only and documents with the table spanning two or more pages with all the rows labeled.
48+
49+
* **Optional fields**: If your dataset contains documents with optional fields, validate that the training dataset has a few documents with the options represented.
50+
51+
## Start by identifying the fields
52+
53+
Take the time to identify each of the fields you plan to label in the dataset. Pay attention to optional fields. Define the fields with the labels that best match the supported types.
54+
55+
Use the following guidelines to define the fields:
56+
57+
* For custom neural models, use semantically relevant names for fields. For example, if the value being extracted is `Effective Date`, name it `effective_date` or `EffectiveDate` not a generic name like **date1**.
58+
59+
* Ideally, name your fields with Pascal or camel case.
60+
61+
* If a value is part of a visually repeating structure and you only need a single value, label it as a table and extract the required value during post-processing.
62+
63+
* For tabular fields spanning multiple pages, define and label the fields as a single table.
64+
65+
. [!NOTE]
66+
> Custom neural models share the same labeling format and strategy as custom template models. Currently custom neural models only support a subset of the field types supported by custom template models.
67+
68+
## Model capabilities
69+
70+
Custom neural models currently only support key-value pairs, structured fields (tables), and selection marks.
71+
72+
| Model type | Form fields | Selection marks | Tabular fields | Signature | Region |
73+
|--|--|--|--|--|--|
74+
| Custom neural | ✔️Supported | ✔️Supported | ✔️Supported | Unsupported | ✔️Supported<sup>1</sup> |
75+
| Custom template | ✔️Supported| ✔️Supported | ✔️Supported | ✔️Supported | ✔️Supported |
76+
77+
<sup>1</sup> Region labeling implementation differs between template and neural models. For template models, the training process injects synthetic data at training time if no text is found in the region labeled. With neural models, no synthetic text is injected and the recognized text is used as is.
78+
79+
## Tabular fields
80+
81+
Tabular fields (tables) are supported with custom neural models starting with API version ```2022-06-30-preview```. Models trained with API version 2022-06-30-preview or later will accept tabular field labels and documents analyzed with the model with API version 2022-06-30-preview or later will produce tabular fields in the output within the ```documents``` section of the result in the ```analyzeResult``` object.
82+
83+
Tabular fields support **cross page tables** by default. To label a table that spans multiple pages, label each row of the table across the different pages in the single table. As a best practice, ensure that your dataset contains a few samples of the expected variations. For example, include both samples where an entire table is on a single page and samples of a table spanning two or more pages.
84+
85+
Tabular fields are also useful when extracting repeating information within a document that isn't recognized as a table. For example, a repeating section of work experiences in a resume can be labeled and extracted as a tabular field.
86+
87+
## Labeling guidelines
88+
89+
* **Labeling values is required.** Don't include the surrounding text. For example when labeling a checkbox, name the field to indicate the check box selection for example ```selectionYes``` and ```selectionNo``` rather than labeling the yes or no text in the document.
90+
91+
* **Don't provide interleaving field values** The value of words and/or regions of one field must be either a consecutive sequence in natural reading order without interleaving with other fields or in a region that doesn't cover any other fields
92+
93+
* **Consistent labeling**. If a value appears in multiple contexts withing the document, consistently pick the same context across documents to label the value.
94+
95+
* **Visually repeating data**. Tables support visually repeating groups of information not just explicit tables. Explicit tables will be identified in tables section of the analyzed documents as part of the layout output and don't need to be labeled as tables. Only label a table field if the information is visually repeating and not identified as a table as part of the layout response. An example would be the repeating work experience section of a resume.
96+
97+
* **Region labeling (custom template)**. Labeling specific regions allows you to define a value when none exists. If the value is optional, ensure that you leave a few sample documents with the region not labeled. When labeling regions, don't include the surrounding text with the label.
98+
99+
## Next steps
100+
101+
* Train a custom model:
102+
103+
> [!div class="nextstepaction"]
104+
> [How to train a model](how-to-guides/build-custom-model-v3.md)
105+
106+
* Learn more about custom template models:
107+
108+
> [!div class="nextstepaction"]
109+
> [Custom template models](concept-custom-template.md )
110+
111+
* Learn more about custom neural models:
112+
113+
> [!div class="nextstepaction"]
114+
> [Custom neural models](concept-custom-neural.md )
115+
116+
* View the REST API:
117+
118+
> [!div class="nextstepaction"]
119+
> [Form Recognizer API v3.0](https://westus.dev.cognitive.microsoft.com/docs/services/form-recognizer-api-v3-0-preview-2/operations/AnalyzeDocument)

articles/applied-ai-services/form-recognizer/concept-custom-neural.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ manager: nitinme
77
ms.service: applied-ai-services
88
ms.subservice: forms-recognizer
99
ms.topic: conceptual
10-
ms.date: 12/02/2022
10+
ms.date: 12/15/2022
1111
ms.author: lajanuar
1212
ms.custom: references_regions
1313
monikerRange: 'form-recog-3.0.0'
@@ -30,11 +30,13 @@ Custom neural models share the same labeling format and strategy as [custom temp
3030

3131
## Model capabilities
3232

33-
Custom neural models currently only support key-value pairs and selection marks, future releases will include support for structured fields (tables) and signature.
33+
Custom neural models currently only support key-value pairs and selection marks and structured fields (tables), future releases will include support for signatures.
3434

3535
| Form fields | Selection marks | Tabular fields | Signature | Region |
3636
|:--:|:--:|:--:|:--:|:--:|
37-
| Supported | Supported | Supported | Unsupported | Unsupported |
37+
| Supported | Supported | Supported | Unsupported | Supported <sup>1</sup> |
38+
39+
<sup>1</sup> Region labels in custom neural models will use the results from the Layout API for specified region. This feature is different from template models where, if no value is present, text is generated at training time.
3840

3941
### Build mode
4042

@@ -59,24 +61,30 @@ Tabular fields are also useful when extracting repeating information within a do
5961

6062
## Supported regions
6163

62-
As of September 16, 2022, Form Recognizer custom neural model training will only be available in the following Azure regions until further notice:
64+
As of October 18, 2022, Form Recognizer custom neural model training will only be available in the following Azure regions until further notice:
6365

6466
* Australia East
6567
* Brazil South
6668
* Canada Central
6769
* Central India
6870
* Central US
6971
* East Asia
72+
* East US
73+
* East US2
7074
* France Central
7175
* Japan East
7276
* South Central US
7377
* Southeast Asia
7478
* UK South
7579
* West Europe
7680
* West US2
81+
* US Gov Arizona
82+
* US Gov Virginia
83+
84+
7785

7886
> [!TIP]
79-
> You can [copy a model](disaster-recovery.md#copy-api-overview) trained in one of the select regions listed above to **any other region** and use it accordingly.
87+
> You can [copy a model](disaster-recovery.md#copy-api-overview) trained in one of the select regions listed to **any other region** and use it accordingly.
8088
>
8189
> Use the [**REST API**](https://westus.dev.cognitive.microsoft.com/docs/services/form-recognizer-api-2022-08-31/operations/CopyDocumentModelTo) or [**Form Recognizer Studio**](https://formrecognizer.appliedai.azure.com/studio/custommodel/projects) to copy a model to another region.
8290
8.73 KB
Loading
11.7 KB
Loading
12.2 KB
Loading
37.5 KB
Loading
13.3 KB
Loading

0 commit comments

Comments
 (0)