Skip to content

Commit 2680bd6

Browse files
authored
Support more doc formats with new documentintelligence SDK (#1224)
* Support more doc formats with new documentintelligence SDK * Location picker for Document Intelligence * Move comment up * Add other data types * Add section on reusing Doc Intelligence * Rename to Doc Intel everywhere
1 parent f90c660 commit 2680bd6

File tree

11 files changed

+81
-57
lines changed

11 files changed

+81
-57
lines changed

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,16 @@ You can also customize the search service (new or existing) for non-English sear
209209
1. To turn off the spell checker, run `azd env set AZURE_SEARCH_QUERY_SPELLER none`. Consult [this table](https://learn.microsoft.com/rest/api/searchservice/preview-api/search-documents#queryLanguage) to determine if spell checker is supported for your query language.
210210
1. To configure the name of the analyzer to use for a searchable text field to a value other than "en.microsoft", run `azd env set AZURE_SEARCH_ANALYZER_NAME {Name of analyzer name}`. ([See other possible values](https://learn.microsoft.com/dotnet/api/microsoft.azure.search.models.field.analyzer?view=azure-dotnet-legacy&viewFallbackFrom=azure-dotnet))
211211

212+
#### Existing Azure Document Intelligence resource
213+
214+
In order to support analysis of many document formats, this repository uses a preview version of Azure Document Intelligence (formerly Form Recognizer) that is only available in [limited regions](https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-layout).
215+
If your existing resource is in one of those regions, then you can re-use it by setting the following environment variables:
216+
217+
1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_SERVICE {Name of existing Azure AI Document Intelligence service}`
218+
1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_LOCATION {Location of existing service}`
219+
1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP {Name of resource group with existing service, defaults to main resource group}`
220+
1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_SKU {SKU of existing service, defaults to S0}`
221+
212222
#### Other existing Azure resources
213223

214224
You can also use existing Azure AI Document Intelligence and Storage Accounts. See `./infra/main.parameters.json` for list of environment variables to pass to `azd env set` to configure those existing resources.

docs/deploy_lowcost.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ However, if your goal is to minimize costs while prototyping your application, f
4141
4. Use the free tier of Azure Document Intelligence (used in analyzing PDFs):
4242

4343
```shell
44-
azd env set AZURE_FORMRECOGNIZER_SKU F0
44+
azd env set AZURE_DOCUMENTINTELLIGENCE_SKU F0
4545
```
4646

4747
Limitation: The free tier will only scan the first two pages of each PDF.

infra/abbreviations.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
"cdnProfiles": "cdnp-",
1313
"cdnProfilesEndpoints": "cdne-",
1414
"cognitiveServicesAccounts": "cog-",
15+
"cognitiveServicesDocumentIntelligence": "cog-di-",
1516
"cognitiveServicesFormRecognizer": "cog-fr-",
1617
"cognitiveServicesComputerVision": "cog-cv-",
1718
"cognitiveServicesTextAnalytics": "cog-ta-",

infra/main.bicep

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -64,10 +64,20 @@ param openAiSkuName string = 'S0'
6464
param openAiApiKey string = ''
6565
param openAiApiOrganization string = ''
6666

67-
param formRecognizerServiceName string = ''
68-
param formRecognizerResourceGroupName string = ''
69-
param formRecognizerResourceGroupLocation string = location
70-
param formRecognizerSkuName string = 'S0'
67+
param documentIntelligenceServiceName string = ''
68+
param documentIntelligenceResourceGroupName string = ''
69+
// Limited regions for new version:
70+
// https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-layout
71+
@description('Location for the Document Intelligence resource group')
72+
@allowed(['eastus', 'westus2', 'westeurope'])
73+
@metadata({
74+
azd: {
75+
type: 'location'
76+
}
77+
})
78+
param documentIntelligenceResourceGroupLocation string
79+
80+
param documentIntelligenceSkuName string = 'S0'
7181

7282
param computerVisionServiceName string = ''
7383
param computerVisionResourceGroupName string = ''
@@ -139,8 +149,8 @@ resource openAiResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' exi
139149
name: !empty(openAiResourceGroupName) ? openAiResourceGroupName : resourceGroup.name
140150
}
141151

142-
resource formRecognizerResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(formRecognizerResourceGroupName)) {
143-
name: !empty(formRecognizerResourceGroupName) ? formRecognizerResourceGroupName : resourceGroup.name
152+
resource documentIntelligenceResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(documentIntelligenceResourceGroupName)) {
153+
name: !empty(documentIntelligenceResourceGroupName) ? documentIntelligenceResourceGroupName : resourceGroup.name
144154
}
145155

146156
resource computerVisionResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(computerVisionResourceGroupName)) {
@@ -320,16 +330,17 @@ module openAi 'core/ai/cognitiveservices.bicep' = if (openAiHost == 'azure') {
320330
}
321331
}
322332

323-
module formRecognizer 'core/ai/cognitiveservices.bicep' = {
324-
name: 'formrecognizer'
325-
scope: formRecognizerResourceGroup
333+
// Formerly known as Form Recognizer
334+
module documentIntelligence 'core/ai/cognitiveservices.bicep' = {
335+
name: 'documentintelligence'
336+
scope: documentIntelligenceResourceGroup
326337
params: {
327-
name: !empty(formRecognizerServiceName) ? formRecognizerServiceName : '${abbrs.cognitiveServicesFormRecognizer}${resourceToken}'
338+
name: !empty(documentIntelligenceServiceName) ? documentIntelligenceServiceName : '${abbrs.cognitiveServicesDocumentIntelligence}${resourceToken}'
328339
kind: 'FormRecognizer'
329-
location: formRecognizerResourceGroupLocation
340+
location: documentIntelligenceResourceGroupLocation
330341
tags: tags
331342
sku: {
332-
name: formRecognizerSkuName
343+
name: documentIntelligenceSkuName
333344
}
334345
}
335346
}
@@ -442,9 +453,9 @@ module openAiRoleUser 'core/security/role.bicep' = if (openAiHost == 'azure') {
442453
}
443454
}
444455

445-
module formRecognizerRoleUser 'core/security/role.bicep' = {
446-
scope: formRecognizerResourceGroup
447-
name: 'formrecognizer-role-user'
456+
module documentIntelligenceRoleUser 'core/security/role.bicep' = {
457+
scope: documentIntelligenceResourceGroup
458+
name: 'documentintelligence-role-user'
448459
params: {
449460
principalId: principalId
450461
roleDefinitionId: 'a97b65f3-24c7-4388-baec-2e87135dc908'
@@ -595,8 +606,8 @@ output AZURE_VISION_ENDPOINT string = useGPT4V ? computerVision.outputs.endpoint
595606
output VISION_SECRET_NAME string = useGPT4V ? computerVisionSecretName : ''
596607
output AZURE_KEY_VAULT_NAME string = useKeyVault ? keyVault.outputs.name : ''
597608

598-
output AZURE_FORMRECOGNIZER_SERVICE string = formRecognizer.outputs.name
599-
output AZURE_FORMRECOGNIZER_RESOURCE_GROUP string = formRecognizerResourceGroup.name
609+
output AZURE_DOCUMENTINTELLIGENCE_SERVICE string = documentIntelligence.outputs.name
610+
output AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP string = documentIntelligenceResourceGroup.name
600611

601612
output AZURE_SEARCH_INDEX string = searchIndexName
602613
output AZURE_SEARCH_SERVICE string = searchService.outputs.name

infra/main.parameters.json

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,17 @@
2323
"openAiSkuName": {
2424
"value": "S0"
2525
},
26-
"formRecognizerServiceName": {
27-
"value": "${AZURE_FORMRECOGNIZER_SERVICE}"
26+
"documentIntelligenceServiceName": {
27+
"value": "${AZURE_DOCUMENTINTELLIGENCE_SERVICE}"
2828
},
29-
"formRecognizerResourceGroupName": {
30-
"value": "${AZURE_FORMRECOGNIZER_RESOURCE_GROUP}"
29+
"documentIntelligenceResourceGroupName": {
30+
"value": "${AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP}"
3131
},
32-
"formRecognizerSkuName": {
33-
"value": "${AZURE_FORMRECOGNIZER_SKU=S0}"
32+
"documentIntelligenceSkuName": {
33+
"value": "${AZURE_DOCUMENTINTELLIGENCE_SKU=S0}"
34+
},
35+
"documentIntelligenceResourceGroupLocation": {
36+
"value": "${AZURE_DOCUMENTINTELLIGENCE_LOCATION}"
3437
},
3538
"searchIndexName": {
3639
"value": "${AZURE_SEARCH_INDEX=gptkbindex}"

scripts/prepdocs.ps1

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ $argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
7676
"--openaihost `"$env:OPENAI_HOST`" --openaimodelname `"$env:AZURE_OPENAI_EMB_MODEL_NAME`" " + `
7777
"--openaiservice `"$env:AZURE_OPENAI_SERVICE`" --openaideployment `"$env:AZURE_OPENAI_EMB_DEPLOYMENT`" " + `
7878
"--openaikey `"$env:OPENAI_API_KEY`" --openaiorg `"$env:OPENAI_ORGANIZATION`" " + `
79-
"--formrecognizerservice $env:AZURE_FORMRECOGNIZER_SERVICE " + `
79+
"--documentintelligenceservice $env:AZURE_DOCUMENTINTELLIGENCE_SERVICE " + `
8080
"$searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg " + `
8181
"$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg " + `
8282
"$tenantArg $aclArg " + `

scripts/prepdocs.py

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -65,16 +65,18 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
6565
doc_int_parser: DocumentAnalysisParser
6666

6767
# check if Azure Document Intelligence credentials are provided
68-
if args.formrecognizerservice is not None:
69-
formrecognizer_creds: Union[AsyncTokenCredential, AzureKeyCredential] = (
70-
credential if is_key_empty(args.formrecognizerkey) else AzureKeyCredential(args.formrecognizerkey)
68+
if args.documentintelligenceservice is not None:
69+
documentintelligence_creds: Union[AsyncTokenCredential, AzureKeyCredential] = (
70+
credential
71+
if is_key_empty(args.documentintelligencekey)
72+
else AzureKeyCredential(args.documentintelligencekey)
7173
)
7274
doc_int_parser = DocumentAnalysisParser(
73-
endpoint=f"https://{args.formrecognizerservice}.cognitiveservices.azure.com/",
74-
credential=formrecognizer_creds,
75+
endpoint=f"https://{args.documentintelligenceservice}.cognitiveservices.azure.com/",
76+
credential=documentintelligence_creds,
7577
verbose=args.verbose,
7678
)
77-
if args.localpdfparser or args.formrecognizerservice is None:
79+
if args.localpdfparser or args.documentintelligenceservice is None:
7880
pdf_parser = LocalPdfParser()
7981
else:
8082
pdf_parser = doc_int_parser
@@ -83,6 +85,14 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
8385
".pdf": FileProcessor(pdf_parser, sentence_text_splitter),
8486
".json": FileProcessor(JsonParser(), SimpleTextSplitter()),
8587
".docx": FileProcessor(doc_int_parser, sentence_text_splitter),
88+
".pptx": FileProcessor(doc_int_parser, sentence_text_splitter),
89+
".xlsx": FileProcessor(doc_int_parser, sentence_text_splitter),
90+
".png": FileProcessor(doc_int_parser, sentence_text_splitter),
91+
".jpg": FileProcessor(doc_int_parser, sentence_text_splitter),
92+
".jpeg": FileProcessor(doc_int_parser, sentence_text_splitter),
93+
".tiff": FileProcessor(doc_int_parser, sentence_text_splitter),
94+
".bmp": FileProcessor(doc_int_parser, sentence_text_splitter),
95+
".heic": FileProcessor(doc_int_parser, sentence_text_splitter),
8696
}
8797
use_vectors = not args.novectors
8898
embeddings: Optional[OpenAIEmbeddings] = None
@@ -355,12 +365,12 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
355365
help="Use PyPdf local PDF parser (supports only digital PDFs) instead of Azure Document Intelligence service to extract text, tables and layout from the documents",
356366
)
357367
parser.add_argument(
358-
"--formrecognizerservice",
368+
"--documentintelligenceservice",
359369
required=False,
360370
help="Optional. Name of the Azure Document Intelligence service which will be used to extract text, tables and layout from the documents (must exist already)",
361371
)
362372
parser.add_argument(
363-
"--formrecognizerkey",
373+
"--documentintelligencekey",
364374
required=False,
365375
help="Optional. Use this Azure Document Intelligence account key instead of the current user identity to login (use az login to set current user for Azure)",
366376
)

scripts/prepdocs.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ $searchAnalyzerNameArg $searchSecretNameArg \
7878
--openaihost "$OPENAI_HOST" --openaimodelname "$AZURE_OPENAI_EMB_MODEL_NAME" \
7979
--openaiservice "$AZURE_OPENAI_SERVICE" --openaideployment "$AZURE_OPENAI_EMB_DEPLOYMENT" \
8080
--openaikey "$OPENAI_API_KEY" --openaiorg "$OPENAI_ORGANIZATION" \
81-
--formrecognizerservice "$AZURE_FORMRECOGNIZER_SERVICE" \
81+
--documentintelligenceservice "$AZURE_DOCUMENTINTELLIGENCE_SERVICE" \
8282
$searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg \
8383
$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg \
8484
$tenantArg $aclArg \

scripts/prepdocslib/pdfparser.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
11
import html
22
from typing import IO, AsyncGenerator, Union
33

4-
from azure.ai.formrecognizer import DocumentTable
5-
from azure.ai.formrecognizer.aio import DocumentAnalysisClient
4+
from azure.ai.documentintelligence.aio import DocumentIntelligenceClient
5+
from azure.ai.documentintelligence.models import DocumentTable
66
from azure.core.credentials import AzureKeyCredential
77
from azure.core.credentials_async import AsyncTokenCredential
88
from pypdf import PdfReader
99

1010
from .page import Page
1111
from .parser import Parser
12-
from .strategy import USER_AGENT
1312

1413

1514
class LocalPdfParser(Parser):
@@ -50,10 +49,12 @@ async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
5049
if self.verbose:
5150
print(f"Extracting text from '{content.name}' using Azure Document Intelligence")
5251

53-
async with DocumentAnalysisClient(
54-
endpoint=self.endpoint, credential=self.credential, headers={"x-ms-useragent": USER_AGENT}
55-
) as form_recognizer_client:
56-
poller = await form_recognizer_client.begin_analyze_document(model_id=self.model_id, document=content)
52+
async with DocumentIntelligenceClient(
53+
endpoint=self.endpoint, credential=self.credential
54+
) as document_intelligence_client:
55+
poller = await document_intelligence_client.begin_analyze_document(
56+
model_id=self.model_id, analyze_request=content, content_type="application/octet-stream"
57+
)
5758
form_recognizer_results = await poller.result()
5859

5960
offset = 0

scripts/requirements.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ pypdf
22
aiohttp
33
azure-identity
44
azure-search-documents==11.6.0b1
5-
azure-ai-formrecognizer
5+
azure-ai-documentintelligence
66
azure-storage-blob
77
azure-storage-file-datalake
88
openai[datalib]>=1.3.5

0 commit comments

Comments
 (0)