components group_relative_policy_optimization

Group Relative Policy Optimization

group_relative_policy_optimization

Overview

Component to optimize group-relative policy using reinforcement learning. Supports PyTorch distributed training with DeepSpeed optimizations.

Version: 0.0.1

View in Studio: https://ml.azure.com/registries/azureml/components/group_relative_policy_optimization/version/0.0.1

Inputs

Name	Description	Type	Default	Optional	Enum
dataset_train_split	Path to the training dataset in JSONL format	uri_file		True
dataset_validation_split	Path to the validation dataset in JSONL format	uri_file		True
model_name_or_path	output folder of model import component containing model artifacts and a metadata file.	uri_folder		False
beta	The beta parameter controls the strength of the KL divergence penalty in the objective function	number	0.0	True
dataset_name	Name of the Hugging Face dataset to pull in	string		True
dataset_prompt_column	Column in the dataset containing the prompt for the chat completion template	string	problem	False
deepspeed_config	Path to a custom DeepSpeed configuration file in JSON format	uri_file		False
epsilon	Epsilon value for clipping	number	0.5	True
eval_steps	Number of steps between evaluations	integer	1	True
eval_strategy	Evaluation strategy to use during training. Options are "disable", "steps", or "epoch".	string	disable	True	['disable', 'steps', 'epoch']
gradient_accumulation_steps	Number of steps before performing a backward/update pass to accumulate gradients.	integer	1	True
learning_rate	Learning rate for training.	number	3e-06	True
logging_steps	Number of steps between logging updates.	number	5	True
lr_scheduler_type	The scheduler type to use for learning rate scheduling.	string	cosine	True	['linear', 'cosine', 'cosine_with_restarts', 'polynomial', 'constant', 'constant_with_warmup', 'inverse_sqrt', 'reduce_lr_on_plateau']
max_completion_length	Maximum length of the generated completion.	integer	256	True
max_grad_norm	Maximum gradient norm for gradient clipping.	number	1.0	True
max_prompt_length	Maximum length of the prompt. If the prompt is longer than this value, it will be truncated left.	integer	512	True
max_steps	If set to a positive number, this will override num_train_epochs and train for exactly this many steps. Set to -1 to disable (default).	integer	-1	True
num_generations	Number of generations to sample.The effective batch size (num_processesper_device_batch_sizegradient_accumulation_steps) must be evenly divisible by this value.	integer	4	True
num_iterations	Number of iterations per batch (denoted as μ in the algorithm).	integer	3	True
num_train_epochs	Number of training epochs.	number	4	True
optim	The optimizer to use.	string	adamw_torch	True	['adamw_torch', 'adamw_torch_fused', 'adafactor', 'ademamix', 'sgd', 'adagrad', 'rmsprop', 'galore_adamw', 'lomo', 'adalomo', 'grokadamw', 'schedule_free_sgd']
per_device_eval_batch_size	Per device batch size used for evaluation.	integer	8	True
per_device_train_batch_size	Per device batch size used for training	integer	8	True
save_steps	Number of steps between saving checkpoints.	integer	100	True
save_total_limit	Maximum number of checkpoints to keep.	integer	20	True
shuffle_dataset	Whether to shuffle the training dataset.	boolean	True	True
temperature	Temperature for sampling. The higher the temperature, the more random the completions.	number	1.0	True
top_p	Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to `1.0` to consider all tokens.	number	1.0	True
use_liger_kernel	Whether to use the Liger kernel.	boolean	False	True
vllm_gpu_memory_utilization	Control the GPU memory utilization for vLLM.	number	0.3	True
vllm_tensor_parallel_size	Control the tensor parallel size for vLLM.	integer	1	True
warmup_ratio	Ratio of total training steps used for a linear warmup from 0 to `learning_rate`.	number	0.1	True

Outputs

final_model_save_pat

Name	Description	Type
mlflow_model_folder	output folder containing best model as defined by metric_for_best_model. Along with the best model, output folder contains checkpoints saved after every evaluation which is defined by the evaluation_strategy. Each checkpoint contains the model weight(s), config, tokenizer, optimzer, scheduler and random number states.	mlflow_model
output_model_path	Path to the output model folder containing the checkpoints	uri_folder

Environment

azureml://registries/azureml/environments/acft-group-relative-policy-optimization/versions/3

Wiki menu

Home
Reference Documentation
- Components
- Data
- Environments
- Models
Contributing

components group_relative_policy_optimization

Group Relative Policy Optimization

group_relative_policy_optimization

Overview

Inputs

Outputs

Environment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!