-
Notifications
You must be signed in to change notification settings - Fork 148
components group_relative_policy_optimization
github-actions[bot] edited this page Jul 22, 2025
·
3 revisions
Component to optimize group-relative policy using reinforcement learning. Supports PyTorch distributed training with DeepSpeed optimizations.
Version: 0.0.1
View in Studio: https://ml.azure.com/registries/azureml/components/group_relative_policy_optimization/version/0.0.1
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
dataset_train_split | Path to the training dataset in JSONL format | uri_file | True | ||
dataset_validation_split | Path to the validation dataset in JSONL format | uri_file | True | ||
model_name_or_path | output folder of model import component containing model artifacts and a metadata file. | uri_folder | False | ||
beta | The beta parameter controls the strength of the KL divergence penalty in the objective function | number | 0.0 | True | |
dataset_name | Name of the Hugging Face dataset to pull in | string | True | ||
dataset_prompt_column | Column in the dataset containing the prompt for the chat completion template | string | problem | False | |
deepspeed_config | Path to a custom DeepSpeed configuration file in JSON format | uri_file | False | ||
epsilon | Epsilon value for clipping | number | 0.5 | True | |
eval_steps | Number of steps between evaluations | integer | 1 | True | |
eval_strategy | Evaluation strategy to use during training. Options are "disable", "steps", or "epoch". | string | disable | True | ['disable', 'steps', 'epoch'] |
gradient_accumulation_steps | Number of steps before performing a backward/update pass to accumulate gradients. | integer | 1 | True | |
learning_rate | Learning rate for training. | number | 3e-06 | True | |
logging_steps | Number of steps between logging updates. | number | 5 | True | |
lr_scheduler_type | The scheduler type to use for learning rate scheduling. | string | cosine | True | ['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup', 'inverse_sqrt', 'reduce_lr_on_plateau'] |
max_completion_length | Maximum length of the generated completion. | integer | 256 | True | |
max_grad_norm | Maximum gradient norm for gradient clipping. | number | 1.0 | True | |
max_prompt_length | Maximum length of the prompt. If the prompt is longer than this value, it will be truncated left. | integer | 512 | True | |
max_steps | If set to a positive number, this will override num_train_epochs and train for exactly this many steps. Set to -1 to disable (default). | integer | -1 | True | |
num_generations | Number of generations to sample.The effective batch size (num_processesper_device_batch_sizegradient_accumulation_steps) must be evenly divisible by this value. | integer | 4 | True | |
num_iterations | Number of iterations per batch (denoted as μ in the algorithm). | integer | 3 | True | |
num_train_epochs | Number of training epochs. | number | 4 | True | |
optim | The optimizer to use. | string | adamw_torch | True | ['adamw_torch', 'adamw_torch_fused', 'adafactor', 'ademamix', 'sgd', 'adagrad', 'rmsprop', 'galore_adamw', 'lomo', 'adalomo', 'grokadamw', 'schedule_free_sgd'] |
per_device_eval_batch_size | Per device batch size used for evaluation. | integer | 8 | True | |
per_device_train_batch_size | Per device batch size used for training | integer | 8 | True | |
save_steps | Number of steps between saving checkpoints. | integer | 100 | True | |
save_total_limit | Maximum number of checkpoints to keep. | integer | 20 | True | |
shuffle_dataset | Whether to shuffle the training dataset. | boolean | True | True | |
temperature | Temperature for sampling. The higher the temperature, the more random the completions. | number | 1.0 | True | |
top_p | Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1.0 to consider all tokens. |
number | 1.0 | True | |
use_liger_kernel | Whether to use the Liger kernel. | boolean | False | True | |
vllm_gpu_memory_utilization | Control the GPU memory utilization for vLLM. | number | 0.3 | True | |
vllm_tensor_parallel_size | Control the tensor parallel size for vLLM. | integer | 1 | True | |
warmup_ratio | Ratio of total training steps used for a linear warmup from 0 to learning_rate . |
number | 0.1 | True |
final_model_save_pat
Name | Description | Type |
---|---|---|
mlflow_model_folder | output folder containing best model as defined by metric_for_best_model. Along with the best model, output folder contains checkpoints saved after every evaluation which is defined by the evaluation_strategy. Each checkpoint contains the model weight(s), config, tokenizer, optimzer, scheduler and random number states. | mlflow_model |
output_model_path | Path to the output model folder containing the checkpoints | uri_folder |
azureml://registries/azureml/environments/acft-group-relative-policy-optimization/versions/3