Skip to content

Commit 67f77b8

Browse files
authored
Merge pull request #97111 from vmagelo/datadrift-work
Datadrift small changes.
2 parents 239dd09 + a81222c commit 67f77b8

File tree

2 files changed

+18
-18
lines changed

2 files changed

+18
-18
lines changed

articles/machine-learning/service/how-to-monitor-data-drift.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ In the context of machine learning, data drift is the change in model input data
2626
With Azure Machine Learning, you can monitor the inputs to a model deployed on AKS and compare this data to the training dataset for the model. At regular intervals, the inference data is [snapshot and profiled](how-to-explore-prepare-data.md), then computed against the baseline dataset to produce a data drift analysis that:
2727

2828
+ Measures the magnitude of data drift, called the drift coefficient.
29-
+ Measures the data drift contribution by feature, informing which features caused data drift.
29+
+ Measures the data drift contribution by feature, indicating which features caused data drift.
3030
+ Measures distance metrics. Currently Wasserstein and Energy Distance are computed.
3131
+ Measures distributions of features. Currently kernel density estimation and histograms.
3232
+ Send alerts to data drift by email.
@@ -85,7 +85,7 @@ from azureml.datadrift import DataDriftDetector, AlertConfiguration
8585
# if email address is specified, setup AlertConfiguration
8686
alert_config = AlertConfiguration('[email protected]')
8787
88-
# create a new DatadriftDetector object
88+
# create a new DataDriftDetector object
8989
datadrift = DataDriftDetector.create(ws, model.name, model.version, services, frequency="Day", alert_config=alert_config)
9090
9191
print('Details of Datadrift Object:\n{}'.format(datadrift))
@@ -107,7 +107,7 @@ run = datadrift.run(target_date, services, feature_list=feature_list, compute_ta
107107
108108
# show details of the data drift run
109109
exp = Experiment(ws, datadrift._id)
110-
dd_run = Run(experiment=exp, run_id=run)
110+
dd_run = Run(experiment=exp, run_id=run.id)
111111
RunDetails(dd_run).show()
112112
```
113113
@@ -137,7 +137,7 @@ The following Python example demonstrates how to plot relevant data drift metric
137137
# start and end are datetime objects
138138
drift_metrics = datadrift.get_output(start_time=start, end_time=end)
139139
140-
# Show all data drift result figures, one per serivice.
140+
# Show all data drift result figures, one per service.
141141
# If setting with_details is False (by default), only the data drift magnitude will be shown; if it's True, all details will be shown.
142142
drift_figures = datadrift.show(with_details=True)
143143
```

articles/machine-learning/service/how-to-monitor-datasets.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -61,12 +61,12 @@ Conceptually, there are three primary scenarios for setting up dataset monitors
6161
Scenario | Description
6262
---|---
6363
Monitoring a model's serving data for drift from the model's training data | Results from this scenario can be interpreted as monitoring a proxy for the model's accuracy, given that model accuracy degrades if the serving data drifts from the training data.
64-
Monitoring a time series dataset for drift from a previous time period. | This scenario is more general, and can be used to monitor datasets involved upstream or downstream of model building. The target dataset must have a timestamp column, while the baseline dataset can be any tabular dataset which has features in common with the target dataset.
65-
Performing analysis on past data. | This can be used to understand historical data and inform decisions in settings for dataset monitors.
64+
Monitoring a time series dataset for drift from a previous time period. | This scenario is more general, and can be used to monitor datasets involved upstream or downstream of model building. The target dataset must have a timestamp column, while the baseline dataset can be any tabular dataset that has features in common with the target dataset.
65+
Performing analysis on past data. | This scenario can be used to understand historical data and inform decisions in settings for dataset monitors.
6666

6767
## How dataset can monitor data
6868

69-
Using Azure Machine Learning, data drift is monitored through datasets. To monitor for data drift, a baseline dataset - usually the training dataset for a model - is specified. A target dataset - usually model input data - is compared over time to your baseline dataset. This means that your target dataset must have a timestamp column specified.
69+
Using Azure Machine Learning, data drift is monitored through datasets. To monitor for data drift, a baseline dataset - usually the training dataset for a model - is specified. A target dataset - usually model input data - is compared over time to your baseline dataset. This comparison means that your target dataset must have a timestamp column specified.
7070

7171
### Set the `timeseries` trait in the target dataset
7272

@@ -130,19 +130,19 @@ This table contains basic settings used for the dataset monitor.
130130
| ------- | ----------- | ---- | ------- |
131131
| Name | Name of the dataset monitor. | | No |
132132
| Baseline dataset | Tabular dataset that will be used as the baseline for comparison of the target dataset over time. | The baseline dataset must have features in common with the target dataset. Generally, the baseline should be set to a model's training dataset or a slice of the target dataset. | No |
133-
| Target dataset | Tabular dataset with timestamp column specified which will be analyzed for data drift | The target dataset must have features in common with the baseline dataset, and should be a `timeseries` dataset which new data is appended to. Historical data in the target dataset can be analyzed, or new data can be monitored. | No |
134-
| Frequency | This is the frequency which will be used to schedule the pipeline job and analyze historical data if running a backfill. Options include daily, weekly, or monthly. | Adjust this setting to include a comparable size of data to the baseline. | No |
135-
| Features | List of features which will be analyzed for data drift over time | Set to a model's output feature(s) to measure concept drift. Do not include features that naturally drift over time (month, year, index, etc.). You can backfill and existing data drift monitor after adjusting the list of features. | Yes |
133+
| Target dataset | Tabular dataset with timestamp column specified which will be analyzed for data drift. | The target dataset must have features in common with the baseline dataset, and should be a `timeseries` dataset, which new data is appended to. Historical data in the target dataset can be analyzed, or new data can be monitored. | No |
134+
| Frequency | The frequency that will be used to schedule the pipeline job and analyze historical data if running a backfill. Options include daily, weekly, or monthly. | Adjust this setting to include a comparable size of data to the baseline. | No |
135+
| Features | List of features that will be analyzed for data drift over time. | Set to a model's output feature(s) to measure concept drift. Do not include features that naturally drift over time (month, year, index, etc.). You can backfill and existing data drift monitor after adjusting the list of features. | Yes |
136136
| Compute target | Azure Machine Learning compute target to run the dataset monitor jobs. | | Yes |
137137

138138
### Monitor settings
139139

140-
These settings are for the scheduled dataset monitor pipeline which will be created.
140+
These settings are for the scheduled dataset monitor pipeline, which will be created.
141141

142142
| Setting | Description | Tips | Mutable |
143143
| ------- | ----------- | ---- | ------- |
144-
| Enable | Enable or disable the schedule on the dataset monitor pipeline | Disable this to analyze historical data with the backfill setting. It can be enabled after the dataset monitor is created. | Yes |
145-
| Latency | Time, in hours, it takes for data to arrive in the dataset. For instance, if it takes three days for data to arrive in the SQL DB my dataset encapsulates, set the latency to 72. | Cannot be changed after the dataset monitor is created | No |
144+
| Enable | Enable or disable the schedule on the dataset monitor pipeline | Disable the schedule to analyze historical data with the backfill setting. It can be enabled after the dataset monitor is created. | Yes |
145+
| Latency | Time, in hours, it takes for data to arrive in the dataset. For instance, if it takes three days for data to arrive in the SQL DB the dataset encapsulates, set the latency to 72. | Cannot be changed after the dataset monitor is created | No |
146146
| Email addresses | Email addresses for alerting based on breach of the data drift percentage threshold. | Emails are sent through Azure Monitor. | Yes |
147147
| Threshold | Data drift percentage threshold for email alerting. | Further alerts and events can be set on many other metrics in the workspace's associated Application Insights resource. | Yes |
148148

@@ -153,7 +153,7 @@ These settings are for running a backfill on past data for data drift metrics.
153153
| Setting | Description | Tips |
154154
| ------- | ----------- | ---- |
155155
| Start date | Start date of the backfill job. | |
156-
| End date | End date of the backfill job. | This cannot be more than 31*frequency units of time from the start date. On an existing dataset monitor, metrics can be backfilled to analyze historical data or replace metrics with updated settings. |
156+
| End date | End date of the backfill job. | The end date cannot be more than 31*frequency units of time from the start date. On an existing dataset monitor, metrics can be backfilled to analyze historical data or replace metrics with updated settings. |
157157

158158
## Create dataset monitors
159159

@@ -178,7 +178,7 @@ The resulting dataset monitor will appear in the list. Select it to go to that m
178178

179179
See the [Python SDK reference documentation on data drift](/python/api/azureml-datadrift/azureml.datadrift) for full details.
180180

181-
The following is an example of creation of a dataset monitor using the Python SDK
181+
The following example shows how to create a dataset monitor using the Python SDK
182182

183183
```python
184184
from azureml.core import Workspace, Dataset
@@ -249,7 +249,7 @@ The following image is an example of charts seen in the **Drift overview** resu
249249

250250
The **Feature details** section contains feature-level insights into the change in the selected feature's distribution, as well as other statistics, over time.
251251

252-
The target dataset is also profiled over time. The statistical distance between the baseline distribution of each feature is compared with the target dataset's over time, which is conceptually similar to the data drift magnitude with the exception that this is for an individual feature. Min, max, and mean are also available.
252+
The target dataset is also profiled over time. The statistical distance between the baseline distribution of each feature is compared with the target dataset's over time, which is conceptually similar to the data drift magnitude with the exception that this statistical distance is for an individual feature. Min, max, and mean are also available.
253253

254254
In the Azure Machine Learning studio, if you click on a data point in the graph the distribution of the feature being shown will adjust accordingly. By default, it shows the baseline dataset's distribution and the most recent run's distribution of the same feature.
255255

@@ -292,7 +292,7 @@ Select Logs (Analytics) under Monitoring on the left pane:
292292

293293
![Application insights overview](media/how-to-monitor-datasets/ai-overview.png)
294294

295-
The dataset monitor metrics are stored as `customMetrics`. You can write and run a simple query after setting up a dataset monitor to view them:
295+
The dataset monitor metrics are stored as `customMetrics`. You can write and run a query after setting up a dataset monitor to view them:
296296

297297
[![Log analytics query](media/how-to-monitor-datasets/simple-query.png)](media/how-to-monitor-datasets/simple-query-expanded.png)
298298

@@ -318,7 +318,7 @@ Columns, or features, in the dataset are classified as categorical or numeric ba
318318
| Feature type | Data type | Condition | Limitations |
319319
| ------------ | --------- | --------- | ----------- |
320320
| Categorical | string, bool, int, float | The number of unique values in the feature is less than 100 and less than 5% of the number of rows. | Null is treated as its own category. |
321-
| Numerical | int, float | Of a numerical data type and does not meet conditions for a categorical feature. | Feature dropped if >15% of values are null. |
321+
| Numerical | int, float | The values in the feature are of a numerical data type and do not meet the condition for a categorical feature. | Feature dropped if >15% of values are null. |
322322

323323
## Next steps
324324

0 commit comments

Comments
 (0)