update scripts and data ingestion documentation

Andrew Desousa · Andrew Desousa · commit a0fca8524b2d · 2024-08-20T12:07:32.000-04:00
diff --git a/README.md b/README.md
@@ -105,13 +105,6 @@ Note: settings starting with `AZURE_SEARCH` are only needed when using Azure Ope
 |UI_FAVICON|| Defaults to Contoso favicon. Configure the URL to your favicon to modify.
 |UI_SHOW_SHARE_BUTTON|True|Share button (right-top)
 |SANITIZE_ANSWER|False|Whether to sanitize the answer from Azure OpenAI. Set to True to remove any HTML tags from the response.|
-|USE_PROMPTFLOW|False|Use existing Promptflow deployed endpoint. If set to `True` then both `PROMPTFLOW_ENDPOINT` and `PROMPTFLOW_API_KEY` also need to be set.|
-|PROMPTFLOW_ENDPOINT||URL of the deployed Promptflow endpoint e.g. https://pf-deployment-name.region.inference.ml.azure.com/score|
-|PROMPTFLOW_API_KEY||Auth key for deployed Promptflow endpoint. Note: only Key-based authentication is supported.|
-|PROMPTFLOW_RESPONSE_TIMEOUT|120|Timeout value in seconds for the Promptflow endpoint to respond.|
-|PROMPTFLOW_REQUEST_FIELD_NAME|query|Default field name to construct Promptflow request. Note: chat_history is auto constucted based on the interaction, if your API expects other mandatory field you will need to change the request parameters under `promptflow_request` function.|
-|PROMPTFLOW_RESPONSE_FIELD_NAME|reply|Default field name to process the response from Promptflow request.|
-|PROMPTFLOW_CITATIONS_FIELD_NAME|documents|Default field name to process the citations output from Promptflow request.|
 
 ### Local deployment
 Review the local deployment [README](./docs/README_LOCAL.md).
diff --git a/scripts/data_utils.py b/scripts/data_utils.py
@@ -833,7 +833,6 @@ def chunk_content(
     Returns:
         List[Document]: List of chunked documents.
     """
-
     try:
         if file_name is None or (cracked_pdf and not use_layout):
             file_format = "text"
@@ -1084,6 +1083,7 @@ def process_file(
             captioning_model_endpoint=captioning_model_endpoint,
             captioning_model_key=captioning_model_key
         )
+
         for chunk_idx, chunk_doc in enumerate(result.chunks):
             chunk_doc.filepath = rel_file_path
             chunk_doc.metadata = json.dumps({"chunk_id": str(chunk_idx)})
@@ -1183,7 +1183,6 @@ def chunk_directory(
     files_to_process = [file_path for file_path in all_files_directory if os.path.isfile(file_path)]
     print(f"Total files to process={len(files_to_process)} out of total directory size={len(all_files_directory)}")
 
-
     if njobs==1:
         print("Single process to chunk and parse the files. --njobs > 1 can help performance.")
         for file_path in tqdm(files_to_process):
diff --git a/scripts/readme.md b/scripts/readme.md
@@ -146,46 +146,3 @@ This will use the Form Recognizer Read model by default.
 If your documents have a lot of tables and relevant layout information, you can use the Form Recognizer Layout model, which is more costly and slower to run but will preserve table information with better quality. The Layout model will also help preserve some of the formatting information in your document such as titles and sub-headings, which will make the citations more readable. To use the Layout model instead of the default Read model, pass in the argument `--form-rec-use-layout`.
 
 `python data_preparation.py --config config.json --njobs=4 --form-rec-resource <form-rec-resource-name> --form-rec-key <form-rec-key> --form-rec-use-layout`
-
-# Use AML to Prepare Data
-## Setup 
-- Install the [Azure ML CLI v2](https://learn.microsoft.com/en-us/azure/machine-learning/concept-v2?view=azureml-api-2)
-
-## Prerequisites
-- Azure Machine Learning (AML) Workspace with associated Keyvault
-- Azure Cognitive Search (ACS) resource
-- (Optional if processing PDF) [Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.1.0) resource
-- (Optional if adding embeddings for vector search) Azure OpenAI resource with Ada (text-embedding-ada-002) deployment
-- (Optional) Azure Blob Storage account
-
-## Configure
-- Create secrets in the AML keyvault for the Azure Cognitive Search resource admin key, Document Intelligence access key, and Azure OpenAI API key (if using)
-- Create a config file like `aml_config.json`. The format can be a single JSON object or a list of them, with each object specifying a configuration of Keyvault secrets, chunking settings, and index configuration.
-```
-{
-        "chunk_size": 1024,
-        "token_overlap": 128,
-        "keyvault_url": "https://<keyvault name>.vault.azure.net/",
-        "document_intelligence_secret_name": "myDocIntelligenceKey",
-        "document_intelligence_endpoint": "https://<document intelligence resource name>.cognitiveservices.azure.com/",
-        "embedding_key_secret_name": "myAzureOpenAIKey",
-        "embedding_endpoint": "https:/<azure openai resource name>.openai.azure.com/openai/deployments/<Ada deployment name>/embeddings?api-version=2023-06-01-preview",
-        "index_name": "<new index name>",
-        "search_service_name": "<search service name>",
-        "search_key_secret_name": "mySearchServiceKey"
-}
-```
-
-## Optional: Create an AML Datastore
-If your data is in Azure Blob Storage, you can first create an AML Datastore that will be used to connect to your data during ingestion. Update `datastore.yml` with your storage account information including the account key. Then run this command, using the resource group and workspace name of your AML workspace:
-
-```
-az ml datastore create --resource-group <workspace resource group> --workspace-name <workspace name> --file datastore.yml
-```
-
-## Submit the data processing pipeline job
-In `pipeline.yml`, update the inputs to point to your config file name and the datastore you created. If you're using data stored locally, comment out the datastore path and uncomment the local path, updating to point to your local data location. Then submit the pipeline job to your AML workspace:
-
-```
-az ml job create --resource-group <workspace resource group> --workspace-name <workspace name> --file pipeline.yml
-```