Skip to content

Commit 44bad1e

Browse files
committed
review PM edits
1 parent bdf84f9 commit 44bad1e

File tree

2 files changed

+30
-30
lines changed

2 files changed

+30
-30
lines changed

articles/machine-learning/how-to-collect-production-data.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -132,34 +132,34 @@ def predict(input_df):
132132
return output_df
133133
```
134134

135-
#### Update your scoring script to log custom unique IDs
135+
### Update your scoring script to log custom unique IDs
136136

137-
In addition to logging pandas DataFrames directly within your scoring script, you can log data with unique IDs of your choice. These IDs can come from your application, an external system, or can be generated by you. If you do not provide a custom ID, as detailed in this section, the Data collector will autogenerate a unique `correlationid` to help you correlate your model's inputs and outputs later. If you supply a custom ID, the `correlationid` field in the logged data will contain the value of your supplied custom ID.
137+
In addition to logging pandas DataFrames directly within your scoring script, you can log data with unique IDs of your choice. These IDs can come from your application, an external system, or you can generate them. If you don't provide a custom ID, as detailed in this section, the Data collector will autogenerate a unique `correlationid` to help you correlate your model's inputs and outputs later. If you supply a custom ID, the `correlationid` field in the logged data will contain the value of your supplied custom ID.
138138

139-
1. In addition to the steps above, import the `azureml.ai.monitoring.context` package by adding the following line to your scoring script:
139+
1. First complete the steps in the previous section, then import the `azureml.ai.monitoring.context` package by adding the following line to your scoring script:
140140

141-
```python
142-
from azureml.ai.monitoring.context import BasicCorrelationContext
143-
```
141+
```python
142+
from azureml.ai.monitoring.context import BasicCorrelationContext
143+
```
144144

145-
1. In your scoring script, instantiate a `BasicCorrelationContext` object and pass in the `id` you wish to log for that row. We recommend that this `id` be a unique ID from your system, so you can uniquely identify each logged row from your Blob storage. Pass this object into your `collect()` API call as a parameter:
145+
1. In your scoring script, instantiate a `BasicCorrelationContext` object and pass in the `id` you wish to log for that row. We recommend that this `id` be a unique ID from your system, so that you can uniquely identify each logged row from your Blob Storage. Pass this object into your `collect()` API call as a parameter:
146146

147-
```python
148-
# create a context with a custom unique id
149-
artificial_context = BasicCorrelationContext(id='test')
150-
151-
# collect inputs data, store correlation_context
152-
context = inputs_collector.collect(input_df, artificial_context)
153-
```
147+
```python
148+
# create a context with a custom unique id
149+
artificial_context = BasicCorrelationContext(id='test')
150+
151+
# collect inputs data, store correlation_context
152+
context = inputs_collector.collect(input_df, artificial_context)
153+
```
154154

155-
1. Ensure that you pass into the context into your `outputs_collector` so that your model inputs and outputs have the same unique ID logged with them, and they can be easily correlated later:
155+
1. Ensure that you pass in the context into your `outputs_collector` so that your model inputs and outputs have the same unique ID logged with them, and they can be easily correlated later:
156156

157-
```python
158-
# collect outputs data, pass in context so inputs and outputs data can be correlated later
159-
outputs_collector.collect(output_df, context)
160-
```
157+
```python
158+
# collect outputs data, pass in context so inputs and outputs data can be correlated later
159+
outputs_collector.collect(output_df, context)
160+
```
161161

162-
A comprehensive example is detailed below:
162+
The following code is an example of a full scoring script (`score.py`) that logs custom unique IDs.
163163

164164
```python
165165
import pandas as pd
@@ -206,15 +206,15 @@ def predict(input_df):
206206
return output_df
207207
```
208208

209-
#### Collect data for model performance monitoring
209+
### Collect data for model performance monitoring
210210

211-
If you are interested in using your collected data for model performance monitoring, it is important that each logged row has a unique `correlationid` which can be used to correlate the data with ground truth data, when it becomes available. The data collector will autogenerate a unique `correlationid` for each logged row, and include it in the `correlationid` field in the JSON object. Please see [store collected data in a blob](#store-collected-data-in-a-blob) for comprehensive details on the JSON schema.
211+
If you want to use your collected data for model performance monitoring, it's important that each logged row has a unique `correlationid` that can be used to correlate the data with ground truth data, when such data becomes available. The data collector will autogenerate a unique `correlationid` for each logged row and include this autogenerated ID in the `correlationid` field in the JSON object. For more information on the JSON schema, see [store collected data in a blob](#store-collected-data-in-a-blob).
212212

213-
If you are interested in using your own unique ID to log with your production data, it is recommended that you log it as a separate column in your `pandas DataFrame`. The reason for this is because [the data collector will batch requests](#data-collector-batching) which fall within close proximity of one another. If you need the `correlationid` to be readily available downstream for integration with ground truth data, having it logged as a separate column is recommended.
213+
If you want to use your own unique ID for logging with your production data, we recommend that you log this ID as a separate column in your pandas DataFrame, since the [data collector batches requests](#data-collector-batching) that are in close proximity to one another. By logging the `correlationid` as a separate column, it will be readily available downstream for integration with ground truth data.
214214

215215
### Update your dependencies
216216

217-
Before you can create your deployment with the updated scoring script, you need to create your environment with the base image `mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04` and the appropriate conda dependencies. Thereafter, you can build the environment using the specification in the following YAML.
217+
Before you can create your deployment with the updated scoring script, you need to create your environment with the base image `mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04` and the appropriate conda dependencies. Thereafter, you can build the environment, using the specification in the following YAML.
218218

219219
```yml
220220
channels:
@@ -383,9 +383,9 @@ With collected binary data, we show the raw file directly, with `instance_id` as
383383

384384
#### Data collector batching
385385

386-
The data collector will batch requests together into the same JSON object if they are sent within a short duration of each other. For example, if you run a script to send sample data to your endpoint, and the deployment has data collection enabled, some of the requests may get batched together, depending on the interval of time between them. If you are using data collection to use with [Azure Machine Learning model monitoring](#concept-model-monitoring.md) this behavior is handled appropriately and each request is handled as independent by the model monitoring service. However, if you expect each logged row of data have its own unique `correlationid`, you can include the `correlationid` as a column in the `pandas DataFrame` you are logging with the data collector. Information on how to do this can be found in [data collection for model performance monitoring][#collect-data-for-model-performance-monitoring].
386+
If requests are sent within short time intervals of one another, the data collector batches them together into the same JSON object. For example, if you run a script to send sample data to your endpoint, and the deployment has data collection enabled, some of the requests can get batched together, depending on the time interval between them. If you're using data collection with [Azure Machine Learning model monitoring](concept-model-monitoring.md), the model monitoring service handles each request independently. However, if you expect each logged row of data to have its own unique `correlationid`, you can include the `correlationid` as a column in the pandas DataFrame you're logging with the data collector. For more information on how you can include your unique `correlationid` as a column in the pandas DataFrame, see [Collect data for model performance monitoring](#collect-data-for-model-performance-monitoring).
387387

388-
Here is an example of two logged requests being batched together:
388+
Here is an example of two logged requests that are batched together:
389389

390390
```json
391391
{"specversion":"1.0",

articles/machine-learning/how-to-monitor-model-performance.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -459,13 +459,13 @@ You must satisfy the following requirements for you to configure your model perf
459459

460460
* (Optional) Have a pre-joined tabular dataset with model outputs and ground truth data already joined together.
461461

462-
### Monitoring model performance requirements when using data collector
462+
### Monitor model performance requirements when using data collector
463463

464-
If you use the [Azure Machine Learning data collector](concept-data-collection.md) to collect production inference data and do not supply your own unique ID for each row as a separate column, a `correlationid` will be autogenerated for you and included in the logged JSON object. However, the data collector will [batch rows](how-to-collect-production-data.md#data-collector-batching) which are sent within close proximity to each other. Batched rows will fall within the same JSON object and will thus have the same `correlationid`.
464+
If you use the [Azure Machine Learning data collector](concept-data-collection.md) to collect production inference data without supplying your own unique ID for each row as a separate column, a `correlationid` will be autogenerated for you and included in the logged JSON object. However, the data collector will [batch rows](how-to-collect-production-data.md#data-collector-batching) that are sent within short time intervals of each other. Batched rows will fall within the same JSON object and will thus have the same `correlationid`.
465465

466-
In order to differentiate between the rows in the same JSON object, Azure Machine Learning model performance monitoring uses indexing to determine the first, second, third, and so on, row in the JSON object. For example, if three rows are batched together, and the `correlationid` is `test`, row 1 will have an id of `test_0`, row 2 will have an id of `test_1`, and row 3 will have an id `test_2`. To ensure that your ground truth dataset contains unique IDs which match to the collected production inference model outputs, ensure that you index each `correlationid` appropriately. If your logged JSON object only has one row, then the `correlationid` would be `correlationid_0`.
466+
In order to differentiate between the rows in the same JSON object, Azure Machine Learning model performance monitoring uses indexing to determine the order of the rows in the JSON object. For example, if three rows are batched together, and the `correlationid` is `test`, row one will have an id of `test_0`, row two will have an id of `test_1`, and row three will have an id of `test_2`. To ensure that your ground truth dataset contains unique IDs that match to the collected production inference model outputs, ensure that you index each `correlationid` appropriately. If your logged JSON object only has one row, then the `correlationid` would be `correlationid_0`.
467467

468-
To avoid using this indexing, we recommend that you log your unique ID in its own column within the `pandas DataFrame`, using the [Azure Machine Learning data collector](how-to-collect-production-data.md). Then, in your model monitoring configuration, you specify the name of this column to join your model output data with your ground data. As long as the IDs for each row in both datasets are the same, Azure Machine Learning model monitoring can perform model performance monitoring.
468+
To avoid using this indexing, we recommend that you log your unique ID in its own column within the pandas DataFrame that you're logging with the [Azure Machine Learning data collector](how-to-collect-production-data.md). Then, in your model monitoring configuration, you specify the name of this column to join your model output data with your ground truth data. As long as the IDs for each row in both datasets are the same, Azure Machine Learning model monitoring can perform model performance monitoring.
469469

470470
### Example workflow for monitoring model performance
471471

0 commit comments

Comments
 (0)