add article for troubleshooting prompt flow deployments

likebupt · likebupt · commit 4d383214e065 · 2024-04-01T16:10:01.000+08:00
diff --git a/articles/machine-learning/prompt-flow/how-to-deploy-for-real-time-inference.md b/articles/machine-learning/prompt-flow/how-to-deploy-for-real-time-inference.md
@@ -315,6 +315,12 @@ Select **Metrics** tab in the left navigation. Select **promptflow standard metr
 
 ## Troubleshoot endpoints deployed from prompt flow
 
+## Lack authorization to perform action "Microsoft.MachineLearningService/workspaces/datastores/read"
+
+If your flow contains Index Look Up tool, after deploying the flow, the endpoint needs to access workspace datastore to read MLIndex yaml file or FAISS folder containing chunks and embeddings. Hence, you need to manually grant the endpoint identity permission to do so.
+
+You can either grant the endpoint identity **AzureML Data Scientist** on workspace scope, or a custom role which contains "MachineLearningService/workspace/datastore/reader" action.
+
 ### MissingDriverProgram Error
 
 If you deploy your flow with custom environment and encounter the following error, it might be because you didn't specify the `inference_config` in your custom environment definition.
@@ -335,6 +341,13 @@ If you deploy your flow with custom environment and encounter the following erro
 
 There are 2 ways to fix this error.
 
+1. (Recommended) You can find the container image uri in your custom environment detail page, and set it as the flow base image in the flow.dag.yaml file. When you deploy the flow in UI, you just select **Use environment of current flow definition**, and the backend service will create the customized environment based on this base image and `requirement.txt` for your deployment. Learn more about [the environment specified in the flow definition](#use-environment-of-current-flow-definition). 
+
+    :::image type="content" source="./media/how-to-deploy-for-real-time-inference/custom-environment-image-uri.png" alt-text="Screenshot of custom environment detail page. " lightbox = "./media/how-to-deploy-for-real-time-inference/custom-environment-image-uri.png":::
+
+    :::image type="content" source="./media/how-to-deploy-for-real-time-inference/flow-environment-image.png" alt-text="Screenshot of specifying base image in raw yaml file of the flow. " lightbox = "./media/how-to-deploy-for-real-time-inference/flow-environment-image.png":::
+
+
 1. You can fix this error by adding `inference_config` in your custom environment definition. Learn more about [how to use customized environment](#use-customized-environment).
 
     Following is an example of customized environment definition.
@@ -358,12 +371,6 @@ inference_config:
     path: /score
 ```
 
-2. You can find the container image uri in your custom environment detail page, and set it as the flow base image in the flow.dag.yaml file. When you deploy the flow in UI, you just select **Use environment of current flow definition**, and the backend service will create the customized environment based on this base image and `requirement.txt` for your deployment. Learn more about [the environment specified in the flow definition](#use-environment-of-current-flow-definition). 
-
-    :::image type="content" source="./media/how-to-deploy-for-real-time-inference/custom-environment-image-uri.png" alt-text="Screenshot of custom environment detail page. " lightbox = "./media/how-to-deploy-for-real-time-inference/custom-environment-image-uri.png":::
-
-    :::image type="content" source="./media/how-to-deploy-for-real-time-inference/flow-environment-image.png" alt-text="Screenshot of specifying base image in raw yaml file of the flow. " lightbox = "./media/how-to-deploy-for-real-time-inference/flow-environment-image.png":::
-
 ### Model response taking too long
 
 Sometimes, you might notice that the deployment is taking too long to respond. There are several potential factors for this to occur. 
@@ -398,3 +405,4 @@ If you aren't going use the endpoint after completing this tutorial, you should
 
 - [Iterate and optimize your flow by tuning prompts using variants](how-to-tune-prompts-using-variants.md)
 - [View costs for an Azure Machine Learning managed online endpoint](../how-to-view-online-endpoints-costs.md)
+- [Troubleshoot prompt flow deployments.](how-to-troubleshoot-prompt-flow-deployment.md)
diff --git a/articles/machine-learning/prompt-flow/how-to-deploy-to-code.md b/articles/machine-learning/prompt-flow/how-to-deploy-to-code.md
@@ -462,5 +462,6 @@ request_settings:
 - Learn more about [managed online endpoint schema](../reference-yaml-endpoint-online.md) and [managed online deployment schema](../reference-yaml-deployment-managed-online.md).
 - Learn more about how to [test the endpoint in UI](./how-to-deploy-for-real-time-inference.md#test-the-endpoint-with-sample-data) and [monitor the endpoint](./how-to-deploy-for-real-time-inference.md#view-managed-online-endpoints-common-metrics-using-azure-monitor-optional).
 - Learn more about how to [troubleshoot managed online endpoints](../how-to-troubleshoot-online-endpoints.md).
+- [Troubleshoot prompt flow deployments.](how-to-troubleshoot-prompt-flow-deployment.md)
 - Once you improve your flow, and would like to deploy the improved version with safe rollout strategy, see [Safe rollout for online endpoints](../how-to-safely-rollout-online-endpoints.md).
 - Learn more about [deploy flows to other platforms, such as a local development service, Docker container, Azure APP service, etc.](https://microsoft.github.io/promptflow/how-to-guides/deploy-a-flow/index.html)
diff --git a/articles/machine-learning/prompt-flow/how-to-troubleshoot-prompt-flow-deployment.md b/articles/machine-learning/prompt-flow/how-to-troubleshoot-prompt-flow-deployment.md
@@ -0,0 +1,131 @@
+---
+title: Troubleshoot prompt flow deployments
+titleSuffix: Azure Machine Learning
+description: This article provides instructions on how to troubleshoot your prompt flow deployments.
+manager: scottpolly
+ms.service: machine-learning
+ms.topic: how-to
+ms.date: 04/01/2024
+ms.reviewer: lagayhar
+ms.author: keli19
+author: likebupt
+---
+
+# Troubleshoot prompt flow deployments
+
+This article provides instructions on how to troubleshoot your deployments from prompt flow.
+
+## Lack authorization to perform action "Microsoft.MachineLearningService/workspaces/datastores/read"
+
+If your flow contains Index Look Up tool, after deploying the flow, the endpoint needs to access workspace datastore to read MLIndex yaml file or FAISS folder containing chunks and embeddings. Hence, you need to manually grant the endpoint identity permission to do so.
+
+You can either grant the endpoint identity **AzureML Data Scientist** on workspace scope, or a custom role which contains "MachineLearningService/workspace/datastore/reader" action.
+
+## Upstream request timeout issue when consuming the endpoint
+
+If you use CLI or SDK to deploy the flow, you may encounter timeout error. By default the `request_timeout_ms` is 5000. You can specify at max to 5 minutes, which is 300000 ms. Following is example showing how to specify request time out in the deployment yaml file. Learn more about the deployment schema [here](../reference-yaml-deployment-managed-online.md).
+
+```yaml
+request_settings:
+  request_timeout_ms: 300000
+```
+
+## OpenAI API hits Authentication Error
+
+If you regenerate your Azure OpenAI key and manually update the connection used in prompt flow, you may encounter errors like "Unauthorized. Access token is missing, invalid, audience is incorrect or have expired." when invoking an exissting endpoint created before key regenerating.
+
+This is because the connections used in the endpoints/deployments will not be automatically updated. Any change for key or secrets in deployments should be done by manual update, which aims to avoid impacting online production deployment due to unintentional offline operation.
+
+- If the endpoint was deployed in the studio UI, you can just redeploy the flow to the existing endpoint using the same deployment name.
+- If the endpoint was deployed using SDK or CLI, you need to make some modification to the deployment definition such as adding a dummy environment variable, and then use `az ml online-deployment update` to update your deployment. 
+
+
+## Vulnerability issues in prompt flow deployments
+
+For prompt flow runtime related vulnerabilities, following are approaches which can help mitigate:
+
+- Update the dependency pacakages in your requirements.txt in your flow folder.
+- If you are using customized base image for your flow, you need to update the prompt flow runtime to latest version and rebuild your base image, then re-deploy the flow.
+ 
+For any other vulnerabilities of managed online deployments, Azure Machine Learning will fix the issues in a monthly manner.
+
+## "MissingDriverProgram Error" or "Could not find driver program in the request"
+
+If you deploy your flow encounter the following error, it might be related to the deployment environment.
+
+```text
+'error': 
+{
+    'code': 'BadRequest', 
+    'message': 'The request is invalid.', 
+    'details': 
+         {'code': 'MissingDriverProgram', 
+          'message': 'Could not find driver program in the request.', 
+          'details': [], 
+          'additionalInfo': []
+         }
+}
+```
+
+```text
+Could not find driver program in the request
+```
+
+There are 2 ways to fix this error.
+
+1. (Recommended) You can find the container image uri in your custom environment detail page, and set it as the flow base image in the flow.dag.yaml file. When you deploy the flow in UI, you just select **Use environment of current flow definition**, and the backend service will create the customized environment based on this base image and `requirement.txt` for your deployment. Learn more about [the environment specified in the flow definition](#use-environment-of-current-flow-definition). 
+
+    :::image type="content" source="./media/how-to-deploy-for-real-time-inference/custom-environment-image-uri.png" alt-text="Screenshot of custom environment detail page. " lightbox = "./media/how-to-deploy-for-real-time-inference/custom-environment-image-uri.png":::
+
+    :::image type="content" source="./media/how-to-deploy-for-real-time-inference/flow-environment-image.png" alt-text="Screenshot of specifying base image in raw yaml file of the flow. " lightbox = "./media/how-to-deploy-for-real-time-inference/flow-environment-image.png":::
+
+1. You can fix this error by adding `inference_config` in your custom environment definition.
+
+    Following is an example of customized environment definition.
+
+```yaml
+$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
+name: pf-customized-test
+build:
+  path: ./image_build
+  dockerfile_path: Dockerfile
+description: promptflow customized runtime
+inference_config:
+  liveness_route:
+    port: 8080
+    path: /health
+  readiness_route:
+    port: 8080
+    path: /health
+  scoring_route:
+    port: 8080
+    path: /score
+```
+
+## Model response taking too long
+
+Sometimes, you might notice that the deployment is taking too long to respond. There are several potential factors for this to occur. 
+
+- Model is not powerful enough (ex. use gpt over text-ada)
+- Index query is not optimized and taking too long
+- Flow has many steps to process
+
+Consider optimizing the endpoint with above considerations to improve the performance of the model.
+
+## Unable to fetch deployment schema
+
+After you deploy the endpoint and want to test it in the **Test tab** in the endpoint detail page, if the **Test tab** shows **Unable to fetch deployment schema** like following, you can try the following 2 methods to mitigate this issue:
+
+:::image type="content" source="./media/how-to-deploy-for-real-time-inference/unable-to-fetch-deployment-schema.png" alt-text="Screenshot of the error unable to fetch deployment schema in Test tab in endpoint detail page. " lightbox = "./media/how-to-deploy-for-real-time-inference/unable-to-fetch-deployment-schema.png":::
+
+- Make sure you have granted the correct permission to the endpoint identity. Learn more about [how to grant permission to the endpoint identity](#grant-permissions-to-the-endpoint).
+- It might be because you ran your flow in an old version runtime and then deployed the flow, the deployment used the environment of the runtime which was in old version as well. Update the runtime following [this guidance](./how-to-create-manage-runtime.md#update-a-runtime-on-the-ui) and rerun the flow in the latest runtime and then deploy the flow again.
+
+## Access denied to list workspace secret
+
+If you encounter an error like "Access denied to list workspace secret", check whether you have granted the correct permission to the endpoint identity. Learn more about [how to grant permission to the endpoint identity](#grant-permissions-to-the-endpoint).
+
+## Next steps
+
+- Learn more about [managed online endpoint schema](../reference-yaml-endpoint-online.md) and [managed online deployment schema](../reference-yaml-deployment-managed-online.md).
+- - Learn more about how to [troubleshoot managed online endpoints](../how-to-troubleshoot-online-endpoints.md).
diff --git a/articles/machine-learning/toc.yml b/articles/machine-learning/toc.yml
@@ -676,6 +676,8 @@
               href: ./prompt-flow/how-to-custom-tool-package-creation-and-usage.md
         - name: Monitor generative AI applications in production
           href: ./prompt-flow/how-to-monitor-generative-ai-applications.md
+        - name: Troubleshoot prompt flow deployments
+          href: ./prompt-flow/how-to-troubleshoot-prompt-flow-deployment.md
         - name: Transparency note
           href: ./prompt-flow/transparency-note.md
         - name: Tools Reference