MicrosoftDocs
diff --git a/‎articles/machine-learning/prompt-flow/how-to-bulk-test-evaluate-flow.md
Lines changed: 73 additions & 60 deletions b/‎articles/machine-learning/prompt-flow/how-to-bulk-test-evaluate-flow.md
Lines changed: 73 additions & 60 deletions
diff --git a/‎articles/machine-learning/prompt-flow/how-to-deploy-for-real-time-inference.md
Lines changed: 41 additions & 9 deletions b/‎articles/machine-learning/prompt-flow/how-to-deploy-for-real-time-inference.md
Lines changed: 41 additions & 9 deletions
@@ -9,7 +9,7 @@ ms.topic: how-to
 author: likebupt
 ms.author: keli19
 ms.reviewer: lagayhar
-ms.date: 07/07/2023
+ms.date: 09/12/2023
 ---
 
 
@@ -45,9 +45,6 @@ If you didn't complete the tutorial, you need to build a flow. Testing the flow
 
 We'll use the sample flow **Web Classification** as example to show how to deploy the flow. This sample flow is a standard flow. Deploying chat flows is similar. Evaluation flow doesn't support deployment.
 
-> [!NOTE]
-> Currently Prompt flow only supports **single deployment** of managed online endpoints, so we will simplify the *deployment* configuration in the UI.
-
 ## Create an online endpoint
 
 Now that you have built a flow and tested it properly, it's time to create your online endpoint for real-time inference. 
@@ -105,9 +102,17 @@ Select the identity you want to use, and you'll notice a warning message to remi
 
 See detailed guidance about how to grant permissions to the endpoint identity in [Grant permissions to the endpoint](#grant-permissions-to-the-endpoint).
 
-#### Allow sharing sample input data for testing purpose only
+### Deployment
+
+In this step, you can specify the following properties:
+
+|Property| Description |
+|---|-----|
+|Deployment name| - Within the same endpoint, deployment name should be unique. <br> - If you select an existing endpoint in the previous step, and input an existing deployment name, then that deployment will be overwritten with the new configurations. |
+|Inference data collection| If you enable this, the flow inputs and outputs will be auto collected in an Azure Machine Learning data asset, and can be used for later monitoring. To learn more, see [model monitoring.](how-to-monitor-generative-ai-applications.md)|
+|Application Insights diagnostics| If you enable this, system metrics during inference time (such as token count, flow latency, flow request, and etc.) will be collected into workspace default Application Insights. To learn more, see [prompt flow serving metrics](#view-prompt-flow-endpoints-specific-metrics-optional).|
 
-If the checkbox is selected, the first row of your input data will be used as sample input data for testing the endpoint later.
+:::image type="content" source="./media/how-to-deploy-for-real-time-inference/deploy-wizard-deployment.png" alt-text="Screenshot of the deployment step in the deploy wizard in the studio UI." lightbox = "./media/how-to-deploy-for-real-time-inference/deploy-wizard-deployment.png":::
 
 ### Outputs
 
@@ -228,7 +233,9 @@ The `chat_input` was set during development of the chat flow. You can input the
 
 In the endpoint detail page, switch to the **Consume** tab. You can find the REST endpoint and key/token to consume your endpoint. There is also sample code for you to consume the endpoint in different languages.
 
-## View metrics using Azure Monitor (optional)
+## View endpoint metrics 
+
+### View managed online endpoints common metrics using Azure Monitor (optional)
 
 You can view various metrics (request numbers, request latency, network bytes, CPU/GPU/Disk/Memory utilization, and more) for an online endpoint and its deployments by following links from the endpoint's **Details** page in the studio. Following these links take you to the exact metrics page in the Azure portal for the endpoint or deployment.
 
@@ -239,6 +246,33 @@ You can view various metrics (request numbers, request latency, network bytes, C
 
 For more information on how to view online endpoint metrics, see [Monitor online endpoints](../how-to-monitor-online-endpoints.md#metrics).
 
+### View prompt flow endpoints specific metrics (optional)
+
+If you enable **Application Insights diagnostics** in the UI deploy wizard, or set `app_insights_enabled=true` in the deployment definition using code, there will be following prompt flow endpoints specific metrics collected in the workspace default Application Insights.
+
+| Metrics Name                         | Type      | Dimensions                                | Description                                                                     |
+|--------------------------------------|-----------|-------------------------------------------|---------------------------------------------------------------------------------|
+| token_consumption                    | counter   | - flow <br> - node<br> - llm_engine<br> - token_type:  `prompt_tokens`: LLM API input tokens;  `completion_tokens`: LLM API response tokens ; `total_tokens` = `prompt_tokens + completion tokens`          | openai token consumption metrics                                                |
+| flow_latency                         | histogram | flow,response_code,streaming,response_type| request execution cost, response_type means whether it's full/firstbyte/lastbyte|
+| flow_request                         | counter   | flow,response_code,exception,streaming    | flow request count                                                              |
+| node_latency                         | histogram | flow,node,run_status                      | node execution cost                                                             |
+| node_request                         | counter   | flow,node,exception,run_status            | node execution failure count                                                    |
+| rpc_latency                          | histogram | flow,node,api_call                        | rpc cost                                                                        |
+| rpc_request                          | counter   | flow,node,api_call,exception              | rpc count                                                                       |
+| flow_streaming_response_duration     | histogram | flow                                      | streaming response sending cost, from sending first byte to sending last byte   |
+
+You can find the workspace default Application Insights in your workspace page in Azure portal.
+
+:::image type="content" source="./media/how-to-deploy-for-real-time-inference/workspace-default-app-insights.png" alt-text="Screenshot of the workspace default Application Insights. " lightbox = "./media/how-to-deploy-for-real-time-inference/workspace-default-app-insights.png":::
+
+Open the Application Insights, and select **Usage and estimated costs** from the left navigation. Select **Custom metrics (Preview)**, and select **With dimensions**, and save the change.
+
+:::image type="content" source="./media/how-to-deploy-for-real-time-inference/enable-multidimensional-metrics.png" alt-text="Screenshot of enable multidimensional metrics. " lightbox = "./media/how-to-deploy-for-real-time-inference/enable-multidimensional-metrics.png":::
+
+Select **Metrics** tab in the left navigation. Select **promptflow standard metrics** from the **Metric Namespace**, and you can explore the metrics from the **Metric** dropdown list with different aggregation methods.
+
+:::image type="content" source="./media/how-to-deploy-for-real-time-inference/prompt-flow-metrics.png" alt-text="Screenshot of prompt flow endpoint metrics. " lightbox = "./media/how-to-deploy-for-real-time-inference/prompt-flow-metrics.png":::
+
 ## Troubleshoot endpoints deployed from prompt flow
 
 ### Unable to fetch deployment schema
@@ -261,8 +295,6 @@ If you aren't going use the endpoint after completing this tutorial, you should
 > [!NOTE]
 > The complete deletion may take approximately 20 minutes.
 
-
-
 ## Next Steps
 
 - [Iterate and optimize your flow by tuning prompts using variants](how-to-tune-prompts-using-variants.md)