Skip to content

components group_relative_policy_optimization

github-actions[bot] edited this page Jul 22, 2025 · 3 revisions

Group Relative Policy Optimization

group_relative_policy_optimization

Overview

Component to optimize group-relative policy using reinforcement learning. Supports PyTorch distributed training with DeepSpeed optimizations.

Version: 0.0.1

View in Studio: https://ml.azure.com/registries/azureml/components/group_relative_policy_optimization/version/0.0.1

Inputs

Name Description Type Default Optional Enum
dataset_train_split Path to the training dataset in JSONL format uri_file True
dataset_validation_split Path to the validation dataset in JSONL format uri_file True
model_name_or_path output folder of model import component containing model artifacts and a metadata file. uri_folder False
beta The beta parameter controls the strength of the KL divergence penalty in the objective function number 0.0 True
dataset_name Name of the Hugging Face dataset to pull in string True
dataset_prompt_column Column in the dataset containing the prompt for the chat completion template string problem False
deepspeed_config Path to a custom DeepSpeed configuration file in JSON format uri_file False
epsilon Epsilon value for clipping number 0.5 True
eval_steps Number of steps between evaluations integer 1 True
eval_strategy Evaluation strategy to use during training. Options are "disable", "steps", or "epoch". string disable True ['disable', 'steps', 'epoch']
gradient_accumulation_steps Number of steps before performing a backward/update pass to accumulate gradients. integer 1 True
learning_rate Learning rate for training. number 3e-06 True
logging_steps Number of steps between logging updates. number 5 True
lr_scheduler_type The scheduler type to use for learning rate scheduling. string cosine True ['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup', 'inverse_sqrt', 'reduce_lr_on_plateau']
max_completion_length Maximum length of the generated completion. integer 256 True
max_grad_norm Maximum gradient norm for gradient clipping. number 1.0 True
max_prompt_length Maximum length of the prompt. If the prompt is longer than this value, it will be truncated left. integer 512 True
max_steps If set to a positive number, this will override num_train_epochs and train for exactly this many steps. Set to -1 to disable (default). integer -1 True
num_generations Number of generations to sample.The effective batch size (num_processesper_device_batch_sizegradient_accumulation_steps) must be evenly divisible by this value. integer 4 True
num_iterations Number of iterations per batch (denoted as μ in the algorithm). integer 3 True
num_train_epochs Number of training epochs. number 4 True
optim The optimizer to use. string adamw_torch True ['adamw_torch', 'adamw_torch_fused', 'adafactor', 'ademamix', 'sgd', 'adagrad', 'rmsprop', 'galore_adamw', 'lomo', 'adalomo', 'grokadamw', 'schedule_free_sgd']
per_device_eval_batch_size Per device batch size used for evaluation. integer 8 True
per_device_train_batch_size Per device batch size used for training integer 8 True
save_steps Number of steps between saving checkpoints. integer 100 True
save_total_limit Maximum number of checkpoints to keep. integer 20 True
shuffle_dataset Whether to shuffle the training dataset. boolean True True
temperature Temperature for sampling. The higher the temperature, the more random the completions. number 1.0 True
top_p Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1.0 to consider all tokens. number 1.0 True
use_liger_kernel Whether to use the Liger kernel. boolean False True
vllm_gpu_memory_utilization Control the GPU memory utilization for vLLM. number 0.3 True
vllm_tensor_parallel_size Control the tensor parallel size for vLLM. integer 1 True
warmup_ratio Ratio of total training steps used for a linear warmup from 0 to learning_rate. number 0.1 True

Outputs

final_model_save_pat

Name Description Type
mlflow_model_folder output folder containing best model as defined by metric_for_best_model. Along with the best model, output folder contains checkpoints saved after every evaluation which is defined by the evaluation_strategy. Each checkpoint contains the model weight(s), config, tokenizer, optimzer, scheduler and random number states. mlflow_model
output_model_path Path to the output model folder containing the checkpoints uri_folder

Environment

azureml://registries/azureml/environments/acft-group-relative-policy-optimization/versions/3

Clone this wiki locally