Skip to content

Commit 8a42b68

Browse files
authored
Merge pull request #229853 from laujan/229508-vinod-feb-2023-release
update files for 2-28-2023 release
2 parents f2d0cc7 + 1507e63 commit 8a42b68

24 files changed

+607
-242
lines changed

articles/applied-ai-services/form-recognizer/concept-composed-models.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ manager: nitinme
77
ms.service: applied-ai-services
88
ms.subservice: forms-recognizer
99
ms.topic: conceptual
10-
ms.date: 02/28/2023
10+
ms.date: 03/08/2023
1111
ms.author: lajanuar
1212
recommendations: false
1313
---
@@ -38,7 +38,11 @@ With composed models, you can assign multiple custom models to a composed model
3838

3939
* For ```Custom neural``` models the best practice is to add all the different variations of a single document type into a single training dataset and train on custom neural model. Model compose is best suited for scenarios when you have documents of different types being submitted for analysis.
4040

41-
* Pricing is the same whether you're using a composed model or selecting a specific model. One model analyzes each document. With composed models, the system performs a classification to check which of the composed custom models should be invoked and invokes the single best model for the document.
41+
::: moniker-end
42+
43+
::: moniker range="form-recog-3.0.0"
44+
45+
With the introduction of [****custom classifier models****](./concept-custom-classifier.md), you can choose to use [**composed models**](./concept-composed-models.md) or the classifier model as an explicit step before analysis. For a deeper understanding of when to use a classifier or composed model, _see_ [**Custom classifier models**](concept-custom-classifier.md).
4246

4347
## Compose model limits
4448

@@ -57,7 +61,7 @@ With composed models, you can assign multiple custom models to a composed model
5761

5862
* To compose a model trained with a prior version of the API (v2.1 or earlier), train a model with the v3.0 API using the same labeled dataset. That addition ensures that the v2.1 model can be composed with other models.
5963

60-
* Models composed with v2.1 of the API continue to be supported, requiring no updates.
64+
* Models composed with v2.1 of the API continues to be supported, requiring no updates.
6165

6266
* The limit for maximum number of custom models that can be composed is 100.
6367

@@ -90,4 +94,4 @@ Learn to create and compose custom models:
9094

9195
> [!div class="nextstepaction"]
9296
> [**Build a custom model**](how-to-guides/build-a-custom-model.md)
93-
> [**Compose custom models**](how-to-guides/compose-custom-models.md)
97+
> [**Compose custom models**](how-to-guides/compose-custom-models.md)
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
---
2+
title: Custom classifier model - Form Recognizer
3+
titleSuffix: Azure Applied AI Services
4+
description: Use the custom classifier model to train a model to identify and split the documents you process within your application.
5+
author: vkurpad
6+
manager: nitinme
7+
ms.service: applied-ai-services
8+
ms.subservice: forms-recognizer
9+
ms.topic: conceptual
10+
ms.date: 03/08/2023
11+
ms.author: lajanuar
12+
ms.custom: references_regions
13+
monikerRange: 'form-recog-3.0.0'
14+
recommendations: false
15+
---
16+
17+
# Custom classifier model
18+
19+
**This article applies to:** ![Form Recognizer v3.0 checkmark](media/yes-icon.png) **Form Recognizer v3.0**.
20+
21+
Custom classifier models are deep-learning-model types that combine layout and language features to accurately detect and identify documents you process within your application. Custom classifier models can classify each page in an input file to identify the document(s) within and can also identify multiple documents or multiple instances of a single document within an input file.
22+
23+
## Model capabilities
24+
25+
Custom classifier models can analyze a single- or multi-file documents to identify if any of the trained document types are contained within an input file. Here are the currently supported scenarios:
26+
27+
* A single file containing one document. For instance, a loan application form.
28+
29+
* A single file containing multiple documents. For instance, a loan application package containing a loan application form, payslip, and bank statement.
30+
31+
* A single file containing multiple instances of the same document. For instance, a collection of scanned invoices.
32+
33+
Training a custom classifier model requires at least two distinct classes and a minimum of five samples per class.
34+
35+
### Compare custom classifier and composed models
36+
37+
A custom classifier model can replace [a composed model](concept-composed-models.md) in some scenarios but there are a few differences to be aware of:
38+
39+
| Capability | Custom classifier process | Composed model process |
40+
|--|--|--|
41+
|Analyze a single document of unknown type belonging to one of the types trained for extraction model processing.| &#9679; Requires multiple calls. </br> &#9679; Call the classifier models based on the document class. This step allows for a confidence-based check before invoking the extraction model analysis.</br> &#9679; Invoke the extraction model. | &#9679; Requires a single call to a composed model containing the model corresponding to the input document type. |
42+
|Analyze a single document of unknown type belonging to several types trained for extraction model processing.| &#9679;Requires multiple calls.</br> &#9679; Make a call to the classifier that ignores documents not matching a designated type for extraction.</br> &#9679; Invoke the extraction model. | &#9679; Requires a single call to a composed model. The service selects a custom model within the composed model with the highest match.</br> &#9679; A composed model can't ignore documents.|
43+
|Analyze a file containing multiple documents of known or unknown type belonging to one of the types trained for extraction model processing.| &#9679; Requires multiple calls. </br> &#9679; Call the extraction model for each identified document in the input file.</br> &#9679; Invoke the extraction model. | &#9679; Requires a single call to a composed model.</br> &#9679; The composed model invokes the component model once on the first instance of the document. </br> &#9679;The remaining documents are ignored. |
44+
45+
## Language support
46+
47+
Classifier models currently only support English language documents.
48+
49+
## Best practices
50+
51+
Custom classifier models require a minimum of five samples per class to train. If the classes are similar, adding extra training samples improves model accuracy.
52+
53+
## Training a model
54+
55+
Custom classifier models are only available in the [v3.0 API](v3-migration-guide.md) starting with API version ```2023-02-28-preview```. [Form Recognizer Studio](https://formrecognizer.appliedai.azure.com/studio) provides a no-code user interface to interactively train a custom classifier.
56+
57+
When using the REST API, if you've organized your documents by folders, you can use the ```azureBlobSource``` property of the request to train a classifier model.
58+
59+
```rest
60+
https://{endpoint}/formrecognizer/documentClassifiers:build?api-version=2023-02-28-preview
61+
62+
{
63+
"classifierId": "demo2.1",
64+
"description": "",
65+
"docTypes": {
66+
"car-maint": {
67+
"azureBlobSource": {
68+
"containerUrl": "SAS URL to container",
69+
"prefix": "sample1/car-maint/"
70+
}
71+
},
72+
"cc-auth": {
73+
"azureBlobSource": {
74+
"containerUrl": "SAS URL to container",
75+
"prefix": "sample1/cc-auth/"
76+
}
77+
},
78+
"deed-of-trust": {
79+
"azureBlobSource": {
80+
"containerUrl": "SAS URL to container",
81+
"prefix": "sample1/deed-of-trust/"
82+
}
83+
}
84+
}
85+
}
86+
87+
```
88+
89+
Alternatively, if you have a flat list of files or only plan to use a few select files within each folder to train the model, you can use the ```azureBlobFileListSource``` property to train the model. This step requires a ```file list``` in [JSON Lines](https://jsonlines.org/) format. For each class, add a new file with a list of files to be submitted for training.
90+
91+
```rest
92+
{
93+
"classifierId": "demo2",
94+
"description": "",
95+
"docTypes": {
96+
"car-maint": {
97+
"azureBlobFileListSource": {
98+
"containerUrl": "SAS URL to container",
99+
"fileList": "sample1/car-maint.jsonl"
100+
}
101+
},
102+
"cc-auth": {
103+
"azureBlobFileListSource": {
104+
"containerUrl": "SAS URL to container",
105+
"fileList": "sample1/cc-auth.jsonl"
106+
}
107+
},
108+
"deed-of-trust": {
109+
"azureBlobFileListSource": {
110+
"containerUrl": "SAS URL to container",
111+
"fileList": "sample1/deed-of-trust.jsonl"
112+
}
113+
}
114+
}
115+
}
116+
117+
```
118+
119+
File list `car-maint.jsonl` contains the following files.
120+
121+
```json
122+
{"file":"sample1/car-maint/Commercial Motor Vehicle - Adatum.pdf"}
123+
{"file":"sample1/car-maint/Commercial Motor Vehicle - Fincher.pdf"}
124+
{"file":"sample1/car-maint/Commercial Motor Vehicle - Lamna.pdf"}
125+
{"file":"sample1/car-maint/Commercial Motor Vehicle - Liberty.pdf"}
126+
{"file":"sample1/car-maint/Commercial Motor Vehicle - Trey.pdf"}
127+
```
128+
129+
## Next steps
130+
131+
Learn to create custom classifier models:
132+
133+
> [!div class="nextstepaction"]
134+
> [**Build a custom classifier model**](how-to-guides/build-a-custom-classifier.md)
135+
> [**Custom models overview**](concept-custom.md)

articles/applied-ai-services/form-recognizer/concept-custom-label-tips.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ This article highlights the best methods for labeling custom model datasets in t
2121

2222
* The following video is the second of two presentations intended to help you build custom models with higher accuracy (the first presentation explores [How to create a balanced data set](concept-custom-label.md#video-custom-label-tips-and-pointers)).
2323

24-
* Here, we'll examine best practices for labeling your selected documents. With semantically relevant and consistent labeling, you should see an improvement in model performance.</br></br>
24+
* Here, we examine best practices for labeling your selected documents. With semantically relevant and consistent labeling, you should see an improvement in model performance.</br></br>
2525

2626
> [!VIDEO https://www.microsoft.com/en-us/videoplayer/embed/RE5fZKB ]
2727

articles/applied-ai-services/form-recognizer/concept-custom-label.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ manager: nitinme
77
ms.service: applied-ai-services
88
ms.subservice: forms-recognizer
99
ms.topic: conceptual
10-
ms.date: 01/30/2023
10+
ms.date: 03/09/2023
1111
ms.author: vikurpad
1212
ms.custom: references_regions
1313
monikerRange: 'form-recog-3.0.0'
@@ -22,9 +22,9 @@ Custom models (template and neural) require a labeled dataset of at least five d
2222

2323
A labeled dataset consists of several files:
2424

25-
* You'll provide a set of sample documents (typically PDFs or images). A minimum of five documents is needed to train a model.
25+
* You provide a set of sample documents (typically PDFs or images). A minimum of five documents is needed to train a model.
2626

27-
* Additionally, the labeling process will generate the following files:
27+
* Additionally, the labeling process generates the following files:
2828

2929
* A `fields.json` file is created when the first field is added. There's one `fields.json` file for the entire training dataset, the field list contains the field name and associated sub fields and types.
3030

@@ -36,19 +36,19 @@ A labeled dataset consists of several files:
3636

3737
* The following video is the first of two presentations intended to help you build custom models with higher accuracy (The second presentation examines [Best practices for labeling documents](concept-custom-label-tips.md#video-custom-labels-best-practices)).
3838

39-
* Here, we'll explore how to create a balanced data set and select the right documents to label. This process will set you on the path to higher quality models.</br></br>
39+
* Here, we explore how to create a balanced data set and select the right documents to label. This process sets you on the path to higher quality models.</br></br>
4040

4141
> [!VIDEO https://www.microsoft.com/en-us/videoplayer/embed/RWWHru]
4242
4343
## Create a balanced dataset
4444

45-
Before you start labeling, it's a good idea to look at a few different samples of the document to identify which samples you want to use in your labeled dataset. A balanced dataset represents all the typical variations you would expect to see for the document. Creating a balanced dataset will result in a model with the highest possible accuracy. A few examples to consider are:
45+
Before you start labeling, it's a good idea to look at a few different samples of the document to identify which samples you want to use in your labeled dataset. A balanced dataset represents all the typical variations you would expect to see for the document. Creating a balanced dataset results in a model with the highest possible accuracy. A few examples to consider are:
4646

4747
* **Document formats**: If you expect to analyze both digital and scanned documents, add a few examples of each type to the training dataset
4848

4949
* **Variations (template model)**: Consider splitting the dataset into folders and train a model for each of variation. Any variations that include either structure or layout should be split into different models. You can then compose the individual models into a single [composed model](concept-composed-models.md).
5050

51-
* **Variations (Neural models)**: When your dataset has a manageable set of variations, about 15 or fewer, create a single dataset with a few samples of each of the different variations to train a single model. If the number of template variations is larger than 15, you'll train multiple models and [compose](concept-composed-models.md) them together.
51+
* **Variations (Neural models)**: When your dataset has a manageable set of variations, about 15 or fewer, create a single dataset with a few samples of each of the different variations to train a single model. If the number of template variations is larger than 15, you train multiple models and [compose](concept-composed-models.md) them together.
5252

5353
* **Tables**: For documents containing tables with a variable number of rows, ensure that the training dataset also represents documents with different numbers of rows.
5454

@@ -70,12 +70,12 @@ Use the following guidelines to define the fields:
7070

7171
* For tabular fields spanning multiple pages, define and label the fields as a single table.
7272

73-
. [!NOTE]
73+
> [!NOTE]
7474
> Custom neural models share the same labeling format and strategy as custom template models. Currently custom neural models only support a subset of the field types supported by custom template models.
7575
7676
## Model capabilities
7777

78-
Custom neural models currently only support key-value pairs, structured fields (tables), and selection marks.
78+
Custom neural models currently only support key-value pairs, structured fields (tables), and selection marks.
7979

8080
| Model type | Form fields | Selection marks | Tabular fields | Signature | Region |
8181
|--|--|--|--|--|--|
@@ -100,7 +100,7 @@ Tabular fields are also useful when extracting repeating information within a do
100100

101101
* **Consistent labeling**. If a value appears in multiple contexts withing the document, consistently pick the same context across documents to label the value.
102102

103-
* **Visually repeating data**. Tables support visually repeating groups of information not just explicit tables. Explicit tables will be identified in tables section of the analyzed documents as part of the layout output and don't need to be labeled as tables. Only label a table field if the information is visually repeating and not identified as a table as part of the layout response. An example would be the repeating work experience section of a resume.
103+
* **Visually repeating data**. Tables support visually repeating groups of information not just explicit tables. Explicit tables are identified in tables section of the analyzed documents as part of the layout output and don't need to be labeled as tables. Only label a table field if the information is visually repeating and not identified as a table as part of the layout response. An example would be the repeating work experience section of a resume.
104104

105105
* **Region labeling (custom template)**. Labeling specific regions allows you to define a value when none exists. If the value is optional, ensure that you leave a few sample documents with the region not labeled. When labeling regions, don't include the surrounding text with the label.
106106

articles/applied-ai-services/form-recognizer/concept-custom-neural.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ manager: nitinme
77
ms.service: applied-ai-services
88
ms.subservice: forms-recognizer
99
ms.topic: conceptual
10-
ms.date: 12/15/2022
10+
ms.date: 03/08/2023
1111
ms.author: lajanuar
1212
ms.custom: references_regions
1313
monikerRange: 'form-recog-3.0.0'
@@ -30,19 +30,32 @@ Custom neural models share the same labeling format and strategy as [custom temp
3030

3131
## Model capabilities
3232

33-
Custom neural models currently only support key-value pairs and selection marks and structured fields (tables), future releases will include support for signatures.
33+
Custom neural models currently only support key-value pairs and selection marks and structured fields (tables), future releases include support for signatures.
3434

3535
| Form fields | Selection marks | Tabular fields | Signature | Region |
3636
|:--:|:--:|:--:|:--:|:--:|
3737
| Supported | Supported | Supported | Unsupported | Supported <sup>1</sup> |
3838

39-
<sup>1</sup> Region labels in custom neural models will use the results from the Layout API for specified region. This feature is different from template models where, if no value is present, text is generated at training time.
39+
<sup>1</sup> Region labels in custom neural models use the results from the Layout API for specified region. This feature is different from template models where, if no value is present, text is generated at training time.
4040

4141
### Build mode
4242

4343
The build custom model operation has added support for the *template* and *neural* custom models. Previous versions of the REST API and SDKs only supported a single build mode that is now known as the *template* mode.
4444

45-
Neural models support documents that have the same information, but different page structures. Examples of these documents include United States W2 forms, which share the same information, but may vary in appearance across companies. Neural models currently only support English text. For more information, *see* [Custom model build mode](concept-custom.md#build-mode).
45+
Neural models support documents that have the same information, but different page structures. Examples of these documents include United States W2 forms, which share the same information, but may vary in appearance across companies. For more information, *see* [Custom model build mode](concept-custom.md#build-mode).
46+
47+
## Language support
48+
49+
1. Neural models now support added languages in the ```2023-02-28-preview``` API.
50+
51+
| Languages | API version |
52+
|:--:|:--:|
53+
| English | `2022-08-31` (GA), `2023-02-28-preview`|
54+
| German | `2023-02-28-preview`|
55+
| Italian | `2023-02-28-preview`|
56+
| French | `2023-02-28-preview`|
57+
| Spanish | `2023-02-28-preview`|
58+
| Dutch | `2023-02-28-preview`|
4659

4760
## Tabular fields
4861

@@ -98,7 +111,7 @@ Custom neural models can generalize across different formats of a single documen
98111

99112
### Field naming
100113

101-
When you label the data, labeling the field relevant to the value will improve the accuracy of the key-value pairs extracted. For example, for a field value containing the supplier ID, consider naming the field "supplier_id". Field names should be in the language of the document.
114+
When you label the data, labeling the field relevant to the value improves the accuracy of the key-value pairs extracted. For example, for a field value containing the supplier ID, consider naming the field "supplier_id". Field names should be in the language of the document.
102115

103116
### Labeling contiguous values
104117

@@ -114,7 +127,7 @@ Values in training cases should be diverse and representative. For example, if a
114127
## Current Limitations
115128

116129
* The model doesn't recognize values split across page boundaries.
117-
* Custom neural models are only trained in English and model performance will be lower for documents in other languages.
130+
* Custom neural models are only trained in English. Model performance is lower for documents in other languages.
118131
* If a dataset labeled for custom template models is used to train a custom neural model, the unsupported field types are ignored.
119132
* Custom neural models are limited to 10 build operations per month. Open a support request if you need the limit increased.
120133

0 commit comments

Comments
 (0)