|
| 1 | +--- |
| 2 | +title: 'CLI (v2) Automated ML NLP text classification multilabel job YAML schema' |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Reference documentation for the CLI (v2) automated ML NLP text classification multilabel job YAML schema. |
| 5 | +services: machine-learning |
| 6 | +ms.service: machine-learning |
| 7 | +ms.subservice: |
| 8 | +ms.topic: reference |
| 9 | +ms.custom: cliv2, event-tier1-ignite-2022 |
| 10 | + |
| 11 | +ms.author: xiaoxiaoli |
| 12 | +author: xiaoxiaoli |
| 13 | +ms.date: 12/22/2022 |
| 14 | +ms.reviewer: ssalgado |
| 15 | +--- |
| 16 | + |
| 17 | +# CLI (v2) Automated ML text classification multilabel job YAML schema |
| 18 | + |
| 19 | +[!INCLUDE [cli v2](../../includes/machine-learning-cli-v2.md)] |
| 20 | + |
| 21 | +[!INCLUDE [schema note](../../includes/machine-learning-preview-old-json-schema-note.md)] |
| 22 | + |
| 23 | +Every Azure Machine Learning entity has a schematized YAML representation. You can create a new entity from a YAML configuration file with a `.yml` or `.yaml` extension. |
| 24 | + |
| 25 | +This article provides a reference for some syntax concepts you will encounter while configuring these YAML files for NLP text classification multilabel jobs. |
| 26 | + |
| 27 | +The source JSON schema can be found at |
| 28 | +https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLNLPTextClassificationMultilabelJob.schema.json |
| 29 | + |
| 30 | +## YAML syntax |
| 31 | + |
| 32 | +| Key | Type | Description | Allowed values | Default value | |
| 33 | +| --- | ---- | ----------- | -------------- | ------------- | |
| 34 | +| `$schema` | string | Represents the location/url to load the YAML schema. If the user uses the Azure Machine Learning VS Code extension to author the YAML file, including `$schema` at the top of the file enables the user to invoke schema and resource completions. | | | |
| 35 | +| `type` | const | **Required.** The type of job. | `automl` | `automl` | |
| 36 | +| `task` | const | **Required.** The type of AutoML task. <br> Task description for multilabel classification: <br> There are multiple possible classes and each sample can be assigned any number of classes. The task is to predict all the classes for each sample. For example, classifying a movie script as "Comedy", or "Romantic", or "Comedy and Romantic".| `text_classification_multilabel` | | |
| 37 | +| `name` | string | Name of the job. Must be unique across all jobs in the workspace. If omitted, Azure ML will autogenerate a GUID for the name. | | | |
| 38 | +| `display_name` | string | Display name of the job in the studio UI. Can be non-unique within the workspace. If omitted, Azure ML will autogenerate a human-readable adjective-noun identifier for the display name. | | | |
| 39 | +| `experiment_name` | string | Experiment name to organize the job under. Each job's run record will be organized under the corresponding experiment in the studio's "Experiments" tab. If omitted, Azure ML will default it to the name of the working directory where the job was created. | | | |
| 40 | +| `description` | string | Description of the job. | | | |
| 41 | +| `tags` | object | Dictionary of tags for the job. | | | |
| 42 | +| `compute` | string | Name of the compute target to execute the job on. To reference an existing compute in the workspace, we use syntax: `azureml:<compute_name>` | | | |
| 43 | +| `log_verbosity` | number | Different levels of log verbosity. |`not_set`, `debug`, `info`, `warning`, `error`, `critical` | `info` | |
| 44 | +| `primary_metric` | string | The metric that AutoML will optimize for model selection. |`accuracy` | `accuracy` | |
| 45 | +| `target_column_name` | string | **Required.** The name of the column to target for predictions. It must always be specified. This parameter is applicable to `training_data` and `validation_data`. | | | |
| 46 | +| `training_data` | object | **Required.** The data to be used within the job. See [multi label](./how-to-auto-train-nlp-models.md?tabs=cli#multi-label) section for more detail. | | | |
| 47 | +| `validation_data` | object | **Required.** The validation data to be used within the job. It should be consistent with the training data in terms of the set of columns, data type for each column, order of columns from left to right and at least two unique labels. <br> *Note*: the column names within each dataset should be unique. See [data validation](./how-to-auto-train-nlp-models.md?tabs=cli#data-validation) section for more information.| | | |
| 48 | +| `limits` | object | Dictionary of limit configurations of the job. Parameters in this section: `max_concurrent_trials`, `max_nodes`, `max_trials`, `timeout_minutes`, `trial_timeout_minutes`. See [limits](#limits) for detail.| | | |
| 49 | +| `training_parameters` | object | Dictionary containing training parameters for the job. <br> See [supported hyperparameters](#supported-hyperparameters) for detail. <br> *Note*: Hyperparameters set in the `training_parameters` are fixed across all sweeping runs and thus don't need to be included in the search space. | | | |
| 50 | +| `sweep` | object | Dictionary containing sweep parameters for the job. It has two keys - `sampling_algorithm` (**required**) and `early_termination`. For more information, see [model sweeping and hyperparameter tuning](./how-to-auto-train-nlp-models.md?tabs=cli#model-sweeping-and-hyperparameter-tuning-preview) sections. | | | |
| 51 | +| `search_space` | object | Dictionary of the hyperparameter search space. The key is the name of the hyperparameter and the value is the parameter expression. All parameters that are fixable via `training_parameters` are supported here (to be instead swept over). See [supported hyperparameters](#supported-hyperparameters) for more detail. <br> There are two types of hyperparameters: <br> - **Discrete Hyperparameters**: Discrete hyperparameters are specified as a [`choice`](./reference-yaml-job-sweep.md#choice) among discrete values. `choice` can be one or more comma-separated values, a `range` object, or any arbitrary `list` object. Advanced discrete hyperparameters can also be specified using a distribution - [`randint`](./reference-yaml-job-sweep.md#randint), [`qlognormal`, `qnormal`](./reference-yaml-job-sweep.md#qlognormal-qnormal), [`qloguniform`, `quniform`](./reference-yaml-job-sweep.md#qloguniform-quniform). For more information, see this [section](./how-to-tune-hyperparameters.md#discrete-hyperparameters). <br> - **Continuous hyperparameters**: Continuous hyperparameters are specified as a distribution over a continuous range of values. Currently supported distributions are - [`lognormal`, `normal`](./reference-yaml-job-sweep.md#lognormal-normal), [`loguniform`](./reference-yaml-job-sweep.md#loguniform), [`uniform`](./reference-yaml-job-sweep.md#uniform). For more information, see this [section](./how-to-tune-hyperparameters.md#continuous-hyperparameters). <br> <br> See [parameter expressions](./reference-yaml-job-sweep.md#parameter-expressions) section for the set of possible expressions to use. | | | |
| 52 | +| `outputs` | object | Dictionary of output configurations of the job. The key is a name for the output within the context of the job and the value is the output configuration. | | | |
| 53 | +| `outputs.best_model` | object | Dictionary of output configurations for best model. For more information, see [Best model output configuration](#best-model-output-configuration). | | | |
| 54 | + |
| 55 | +Other syntax used in configurations: |
| 56 | + |
| 57 | +### Limits |
| 58 | + |
| 59 | +| Key | Type | Description | Allowed values | Default value | |
| 60 | +| --- | ---- | ----------- | -------------- | ------------- | |
| 61 | +| `max_concurrent_trials` | integer | Represents the maximum number of trials (children jobs) that would be executed in parallel. | | `1` | |
| 62 | +| `max_trials` | integer | Represents the maximum number of trials an AutoML nlp job can try to run a training algorithm with different combination of hyperparameters. | | `1` | |
| 63 | +| `timeout_minutes ` | integer | Represents the maximum amount of time in minutes that the submitted AutoML job can take to run . After this, the job will get terminated. The default timeout in AutoML NLP jobs is 7 days. | | `10080`| |
| 64 | +| `trial_timeout_minutes ` | integer | Represents the maximum amount of time in minutes that each trial (child job) in the submitted AutoML job can take run. After this, the child job will get terminated. | | | |
| 65 | +|`max_nodes`| integer | The maximum number of nodes from the backing compute cluster to leverage for the job.| | `1` | |
| 66 | + |
| 67 | +### Supported hyperparameters |
| 68 | + |
| 69 | +The following table describes the hyperparameters that AutoML NLP supports. |
| 70 | + |
| 71 | +| Parameter name | Description | Syntax | |
| 72 | +|-------|---------|---------| |
| 73 | +| gradient_accumulation_steps | The number of backward operations whose gradients are to be summed up before performing one step of gradient descent by calling the optimizer’s step function. <br><br> This is leveraged to use an effective batch size which is gradient_accumulation_steps times larger than the maximum size that fits the GPU. | Must be a positive integer. |
| 74 | +| learning_rate | Initial learning rate. | Must be a float in the range (0, 1). | |
| 75 | +| learning_rate_scheduler |Type of learning rate scheduler. | Must choose from `linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup`. | |
| 76 | +| model_name | Name of one of the supported models. | Must choose from `bert_base_cased, bert_base_uncased, bert_base_multilingual_cased, bert_base_german_cased, bert_large_cased, bert_large_uncased, distilbert_base_cased, distilbert_base_uncased, roberta_base, roberta_large, distilroberta_base, xlm_roberta_base, xlm_roberta_large, xlnet_base_cased, xlnet_large_cased`. | |
| 77 | +| number_of_epochs | Number of training epochs. | Must be a positive integer. | |
| 78 | +| training_batch_size | Training batch size. | Must be a positive integer. | |
| 79 | +| validation_batch_size | Validation batch size. | Must be a positive integer. | |
| 80 | +| warmup_ratio | Ratio of total training steps used for a linear warmup from 0 to learning_rate. | Must be a float in the range [0, 1]. | |
| 81 | +| weight_decay | Value of weight decay when optimizer is sgd, adam, or adamw. | Must be a float in the range [0, 1]. | |
| 82 | + |
| 83 | +### Training or validation data |
| 84 | + |
| 85 | +| Key | Type | Description | Allowed values | Default value | |
| 86 | +| --- | ---- | ----------- | -------------- | ------------- | |
| 87 | +| `description` | string | The detailed information that describes this input data. | | | |
| 88 | +| `path` | string | The path from where data should be loaded. Path can be a `file` path, `folder` path or `pattern` for paths. `pattern` specifies a search pattern to allow globbing(`*` and `**`) of files and folders containing data. Supported URI types are `azureml`, `https`, `wasbs`, `abfss`, and `adl`. For more information on how to use the `azureml://` URI format, see [core yaml syntax](./reference-yaml-core-syntax.md). URI of the location of the artifact file. If this URI doesn't have a scheme (for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the default workspace blob-storage as the entity is created. | | | |
| 89 | +| `mode` | string | Dataset delivery mechanism. | `direct` | `direct` | |
| 90 | +| `type` | const | In order to generate nlp models, the user needs to bring training data in the form of an MLTable. For more information, see [preparing data](./how-to-auto-train-nlp-models.md#preparing-data) | mltable | mltable| |
| 91 | + |
| 92 | +### Best model output configuration |
| 93 | + |
| 94 | +| Key | Type | Description | Allowed values |Default value | |
| 95 | +| --- | ---- | ----------- | -------------- | ------------ | |
| 96 | +| `type` | string | **Required.** Type of best model. AutoML allows only mlflow models. | `mlflow_model` | `mlflow_model` | |
| 97 | +| `path` | string | **Required.** URI of the location where the model-artifact file(s) are stored. If this URI doesn't have a scheme (for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the default workspace blob-storage as the entity is created. | | | |
| 98 | +| `storage_uri` | string | The HTTP URL of the Model. Use this URL with `az storage copy -s THIS_URL -d DESTINATION_PATH --recursive` to download the data. | | | |
| 99 | + |
| 100 | +## Remarks |
| 101 | + |
| 102 | +The `az ml job` command can be used for managing Azure Machine Learning jobs. |
| 103 | + |
| 104 | +## Examples |
| 105 | + |
| 106 | +Examples are available in the [examples GitHub repository](https://github.com/Azure/azureml-examples/tree/main/cli/jobs). Examples relevant to NLP text classification multilabel jobs are linked below. |
| 107 | + |
| 108 | +## YAML: AutoML text classification multilabel job |
| 109 | + |
| 110 | +:::code language="yaml" source="~/azureml-examples-main/cli/jobs/automl-standalone-jobs/cli-automl-text-classification-multilabel-paper-cat/cli-automl-text-classification-multilabel-paper-cat.yml"::: |
| 111 | + |
| 112 | +## YAML: AutoML text classification multilabel pipeline job |
| 113 | + |
| 114 | +:::code language="yaml" source="~/azureml-examples-main/cli/jobs/pipelines/automl/cli-automl-text-classification-multilabel-paper-categorization-pipeline/pipeline.yml"::: |
| 115 | + |
| 116 | +## Next steps |
| 117 | + |
| 118 | +- [Install and use the CLI (v2)](how-to-configure-cli.md) |
0 commit comments