You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Additional advanced preprocessing and featurization are also available, such as missing values imputation, encoding, and transforms. [Learn more about what featurization is included](how-to-create-portal-experiments.md#preprocess). Enable this setting with:
96
+
Additional advanced preprocessing and featurization are also available, such as data guardrails, encoding, and transforms. [Learn more about what featurization is included](how-to-create-portal-experiments.md#preprocess). Enable this setting with:
97
97
98
98
+ Azure Machine Learning studio: Selecting the **View featurization settings** in the **Configuration Run** section [with these steps](how-to-create-portal-experiments.md).
99
99
@@ -160,7 +160,7 @@ Learn more and see an example of [automated machine learning for time series for
160
160
161
161
* holiday detection and featurization
162
162
* time-series and DNN learners (Auto-ARIMA, Prophet, ForecastTCN)
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-create-portal-experiments.md
+16-1Lines changed: 16 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -146,11 +146,12 @@ Variance| Measure of how far spread out this column's data is from its average v
146
146
Skewness| Measure of how different this column's data is from a normal distribution.
147
147
Kurtosis| Measure of how heavily tailed this column's data is compared to a normal distribution.
148
148
149
+
149
150
<aname="preprocess"></a>
150
151
151
152
## Advanced preprocessing options
152
153
153
-
When configuring your experiments, you can enable the advanced setting `Preprocess`. Doing so means that the following data preprocessing and featurization steps are performed automatically.
154
+
When configuring your experiments, you can enable the advanced setting `Preprocess`. Doing so means that as part of preprocessing the following data guardrails and featurization steps are performed automatically.
154
155
155
156
|Preprocessing steps| Description |
156
157
| ------------- | ------------- |
@@ -164,6 +165,20 @@ When configuring your experiments, you can enable the advanced setting `Preproce
164
165
|Weight of Evidence (WoE)|Calculates WoE as a measure of correlation of categorical columns to the target column. It is calculated as the log of the ratio of in-class vs out-of-class probabilities. This step outputs one numerical feature column per class and removes the need to explicitly impute missing values and outlier treatment.|
165
166
|Cluster Distance|Trains a k-means clustering model on all numerical columns. Outputs k new features, one new numerical feature per cluster, containing the distance of each sample to the centroid of each cluster.|
166
167
168
+
### Data guardrails
169
+
170
+
Automated machine learning offers data guardrails to help you identify potential issues with your data (e.g., missing values, class imbalance) and help take corrective actions for improved results. There are many best practices that are available and can be applied to achieve reliable results.
171
+
172
+
The following table describes the currently supported data guardrails, and the associated statuses that users may come across when submitting their experiment.
173
+
174
+
Guardrail|Status|Condition for trigger
175
+
---|---|---
176
+
Missing values imputation |**Passed** <br> <br> **Fixed**| No missing value in any of the input columns <br> <br> Some columns have missing values
177
+
Cross validation|**Done**|If no explicit validation set is provided
178
+
High cardinality feature detection| **Passed** <br> <br>**Done**| No high cardinality features were detected <br><br> High cardinality input columns were detected
179
+
Class balance detection |**Passed** <br><br><br>**Alerted** |Classes are balanced in the training data; A dataset is considered balanced if each class has good representation in the dataset, as measured by number and ratio of samples <br> <br> Classes in the training data are imbalanced
180
+
Time-series data consistency|**Passed** <br><br><br><br> **Fixed** |<br> The selected {horizon, lag, rolling window} value(s) were analyzed, and no potential out-of-memory issues were detected. <br> <br>The selected {horizon, lag, rolling window} values were analyzed and will potentially cause your experiment to run out of memory. The lag or rolling window has been turned off.
181
+
167
182
## Run experiment and view results
168
183
169
184
Select **Start** to run your experiment. The experiment preparing process can take up to 10 minutes. Training jobs can take an additional 2-3 minutes more for each pipeline to finish running.
0 commit comments