Merge pull request #209400 from ssalgadodev/autoMLNLPUpdates

prmerger-automator[bot] · web-flow · commit aa03c269dea7 · 2022-09-26T20:04:28.000Z
AutoML | NLP doc removed  and added features
diff --git a/articles/machine-learning/how-to-auto-train-nlp-models.md b/articles/machine-learning/how-to-auto-train-nlp-models.md
@@ -79,7 +79,11 @@ Task |AutoML job syntax| Description
 ----|----|---
 Multi-class text classification | CLI v2: `text_classification`  <br> SDK v2 (preview): `text_classification()`| There are multiple possible classes and each sample can be classified as exactly one class. The task is to predict the correct class for each sample. <br> <br> For example, classifying a movie script as "Comedy" or "Romantic". 
 Multi-label text classification |  CLI v2: `text_classification_multilabel`  <br> SDK v2 (preview): `text_classification_multilabel()`| There are multiple possible classes and each sample can be assigned any number of classes. The task is to predict all the classes for each sample<br> <br> For example, classifying a movie script as "Comedy", or "Romantic", or "Comedy and Romantic". 
-Named Entity Recognition (NER)|  CLI v2:`text_ner` <br> SDK v2 (preview): `text_ner()`| There are multiple possible tags for tokens in sequences. The task is to predict the tags for all the tokens for each sequence. <br> <br> For example, extracting domain-specific entities from unstructured text, such as contracts or financial documents
+Named Entity Recognition (NER)|  CLI v2:`text_ner` <br> SDK v2 (preview): `text_ner()`| There are multiple possible tags for tokens in sequences. The task is to predict the tags for all the tokens for each sequence. <br> <br> For example, extracting domain-specific entities from unstructured text, such as contracts or financial documents.
+
+## Thresholding
+
+Thresholding is the multi-label feature that allows users to pick the threshold above which the predicted probabilities will lead to a positive label. Lower values allow for more labels, which is better when users care more about recall, but this option could lead to more false positives. Higher values allow fewer labels and hence better for users who care about precision, but this option could lead to more false negatives.
 
 ## Preparing data
 
@@ -178,9 +182,8 @@ Automated ML's NLP capability is triggered through task specific `automl` type j
 However, there are key differences: 
 * You can ignore `primary_metric`, as it is only for reporting purposes. Currently, automated ML only trains one model per run for NLP and there is no model selection.
 * The `label_column_name` parameter is only required for multi-class and multi-label text classification tasks.
-* If the majority of the samples in your dataset contain more than 128 words, it's considered long range. By default, automated ML considers all samples long range text. To disable this feature, include the `enable_long_range_text=False`  parameter  in your `AutoMLConfig`.
-   * If you enable long range text, then a GPU with higher memory is required such as, [NCv3](../virtual-machines/ncv3-series.md) series  or  [ND](../virtual-machines/nd-series.md)  series.
-   * The `enable_long_range_text` parameter is only available for multi-class classification tasks.
+* If more than 10% of the samples in your dataset contain more than 128 tokens, it's considered long range. 
+   * In order to use the long range text feature, you should use a NC6 or higher/better SKUs for GPU such as: [NCv3](../virtual-machines/ncv3-series.md) series or [ND](../virtual-machines/nd-series.md) series.
 
 # [Azure CLI](#tab/cli)
 
@@ -279,6 +282,8 @@ max_concurrent_iterations = number_of_vms
 enable_distributed_dnn_training = True
 ```
 
+In AutoML NLP only hold-out validation is supported and it requires a validation dataset.
+
 ---
 
 ## Submit the AutoML job