Skip to content

Commit a0fca85

Browse files
author
Andrew Desousa
committed
update scripts and data ingestion documentation
1 parent 32876a3 commit a0fca85

File tree

3 files changed

+1
-52
lines changed

3 files changed

+1
-52
lines changed

README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -105,13 +105,6 @@ Note: settings starting with `AZURE_SEARCH` are only needed when using Azure Ope
105105
|UI_FAVICON|| Defaults to Contoso favicon. Configure the URL to your favicon to modify.
106106
|UI_SHOW_SHARE_BUTTON|True|Share button (right-top)
107107
|SANITIZE_ANSWER|False|Whether to sanitize the answer from Azure OpenAI. Set to True to remove any HTML tags from the response.|
108-
|USE_PROMPTFLOW|False|Use existing Promptflow deployed endpoint. If set to `True` then both `PROMPTFLOW_ENDPOINT` and `PROMPTFLOW_API_KEY` also need to be set.|
109-
|PROMPTFLOW_ENDPOINT||URL of the deployed Promptflow endpoint e.g. https://pf-deployment-name.region.inference.ml.azure.com/score|
110-
|PROMPTFLOW_API_KEY||Auth key for deployed Promptflow endpoint. Note: only Key-based authentication is supported.|
111-
|PROMPTFLOW_RESPONSE_TIMEOUT|120|Timeout value in seconds for the Promptflow endpoint to respond.|
112-
|PROMPTFLOW_REQUEST_FIELD_NAME|query|Default field name to construct Promptflow request. Note: chat_history is auto constucted based on the interaction, if your API expects other mandatory field you will need to change the request parameters under `promptflow_request` function.|
113-
|PROMPTFLOW_RESPONSE_FIELD_NAME|reply|Default field name to process the response from Promptflow request.|
114-
|PROMPTFLOW_CITATIONS_FIELD_NAME|documents|Default field name to process the citations output from Promptflow request.|
115108

116109
### Local deployment
117110
Review the local deployment [README](./docs/README_LOCAL.md).

scripts/data_utils.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -833,7 +833,6 @@ def chunk_content(
833833
Returns:
834834
List[Document]: List of chunked documents.
835835
"""
836-
837836
try:
838837
if file_name is None or (cracked_pdf and not use_layout):
839838
file_format = "text"
@@ -1084,6 +1083,7 @@ def process_file(
10841083
captioning_model_endpoint=captioning_model_endpoint,
10851084
captioning_model_key=captioning_model_key
10861085
)
1086+
10871087
for chunk_idx, chunk_doc in enumerate(result.chunks):
10881088
chunk_doc.filepath = rel_file_path
10891089
chunk_doc.metadata = json.dumps({"chunk_id": str(chunk_idx)})
@@ -1183,7 +1183,6 @@ def chunk_directory(
11831183
files_to_process = [file_path for file_path in all_files_directory if os.path.isfile(file_path)]
11841184
print(f"Total files to process={len(files_to_process)} out of total directory size={len(all_files_directory)}")
11851185

1186-
11871186
if njobs==1:
11881187
print("Single process to chunk and parse the files. --njobs > 1 can help performance.")
11891188
for file_path in tqdm(files_to_process):

scripts/readme.md

Lines changed: 0 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -146,46 +146,3 @@ This will use the Form Recognizer Read model by default.
146146
If your documents have a lot of tables and relevant layout information, you can use the Form Recognizer Layout model, which is more costly and slower to run but will preserve table information with better quality. The Layout model will also help preserve some of the formatting information in your document such as titles and sub-headings, which will make the citations more readable. To use the Layout model instead of the default Read model, pass in the argument `--form-rec-use-layout`.
147147

148148
`python data_preparation.py --config config.json --njobs=4 --form-rec-resource <form-rec-resource-name> --form-rec-key <form-rec-key> --form-rec-use-layout`
149-
150-
# Use AML to Prepare Data
151-
## Setup
152-
- Install the [Azure ML CLI v2](https://learn.microsoft.com/en-us/azure/machine-learning/concept-v2?view=azureml-api-2)
153-
154-
## Prerequisites
155-
- Azure Machine Learning (AML) Workspace with associated Keyvault
156-
- Azure Cognitive Search (ACS) resource
157-
- (Optional if processing PDF) [Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.1.0) resource
158-
- (Optional if adding embeddings for vector search) Azure OpenAI resource with Ada (text-embedding-ada-002) deployment
159-
- (Optional) Azure Blob Storage account
160-
161-
## Configure
162-
- Create secrets in the AML keyvault for the Azure Cognitive Search resource admin key, Document Intelligence access key, and Azure OpenAI API key (if using)
163-
- Create a config file like `aml_config.json`. The format can be a single JSON object or a list of them, with each object specifying a configuration of Keyvault secrets, chunking settings, and index configuration.
164-
```
165-
{
166-
"chunk_size": 1024,
167-
"token_overlap": 128,
168-
"keyvault_url": "https://<keyvault name>.vault.azure.net/",
169-
"document_intelligence_secret_name": "myDocIntelligenceKey",
170-
"document_intelligence_endpoint": "https://<document intelligence resource name>.cognitiveservices.azure.com/",
171-
"embedding_key_secret_name": "myAzureOpenAIKey",
172-
"embedding_endpoint": "https:/<azure openai resource name>.openai.azure.com/openai/deployments/<Ada deployment name>/embeddings?api-version=2023-06-01-preview",
173-
"index_name": "<new index name>",
174-
"search_service_name": "<search service name>",
175-
"search_key_secret_name": "mySearchServiceKey"
176-
}
177-
```
178-
179-
## Optional: Create an AML Datastore
180-
If your data is in Azure Blob Storage, you can first create an AML Datastore that will be used to connect to your data during ingestion. Update `datastore.yml` with your storage account information including the account key. Then run this command, using the resource group and workspace name of your AML workspace:
181-
182-
```
183-
az ml datastore create --resource-group <workspace resource group> --workspace-name <workspace name> --file datastore.yml
184-
```
185-
186-
## Submit the data processing pipeline job
187-
In `pipeline.yml`, update the inputs to point to your config file name and the datastore you created. If you're using data stored locally, comment out the datastore path and uncomment the local path, updating to point to your local data location. Then submit the pipeline job to your AML workspace:
188-
189-
```
190-
az ml job create --resource-group <workspace resource group> --workspace-name <workspace name> --file pipeline.yml
191-
```

0 commit comments

Comments
 (0)