-
Notifications
You must be signed in to change notification settings - Fork 0
Training and Evaluation #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a comprehensive training and evaluation pipeline for FactualDPO++, introducing baseline DPO training, modified Factual-DPO++ training with Δ-margin, automated GPT-4o-mini evaluation, and centralized YAML-based configuration. The changes include TRL-derived trainer files that are intentionally copied without formatting.
Key Changes
- Added training scripts for original DPO and modified Factual-DPO++ approaches
- Implemented evaluation framework with async GPT-judge scoring and batch processing
- Integrated TRL utilities and modified DPO trainer for Factual-DPO patch
- Added centralized configuration system and comprehensive README documentation
Reviewed changes
Copilot reviewed 88 out of 122 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/aixpert/training/training/trl/trainer/dpo_config.py | Configuration dataclass for DPO training parameters copied from TRL |
| src/aixpert/training/training/trl/trainer/cpo_trainer.py | CPO trainer implementation copied from TRL for preference optimization |
| src/aixpert/training/training/trl/trainer/cpo_config.py | Configuration dataclass for CPO training parameters |
| src/aixpert/training/training/trl/trainer/callbacks.py | Training callbacks including BEMA, weight sync, and logging utilities |
| src/aixpert/training/training/trl/trainer/bco_trainer.py | Deprecated BCO trainer wrapper redirecting to experimental module |
| src/aixpert/training/training/trl/trainer/bco_config.py | Deprecated BCO config wrapper with deprecation warning |
| src/aixpert/training/training/trl/trainer/base_trainer.py | Base trainer class with model card generation |
| src/aixpert/training/training/trl/trainer/init.py | Lazy module initialization for trainer components |
| src/aixpert/training/training/trl/templates/rm_model_card.md | Template for reward model cards |
| src/aixpert/training/training/trl/templates/lm_model_card.md | Template for language model cards |
| src/aixpert/training/training/trl/scripts/vllm_serve.py | vLLM server script with weight synchronization support |
| src/aixpert/training/training/trl/scripts/utils.py | Utility functions for dataset loading and argument parsing |
| src/aixpert/training/training/trl/scripts/sft.py | Supervised fine-tuning script |
| src/aixpert/training/training/trl/scripts/rloo.py | RLOO training script with reward functions |
| src/aixpert/training/training/trl/scripts/reward.py | Reward model training script |
| src/aixpert/training/training/trl/scripts/kto.py | KTO training script |
| src/aixpert/training/training/trl/scripts/grpo.py | GRPO training script |
| src/aixpert/training/training/trl/scripts/env.py | Environment information printing utility |
| src/aixpert/training/training/trl/scripts/dpo.py | DPO training script |
| src/aixpert/training/training/trl/scripts/init.py | Lazy module initialization for scripts |
| src/aixpert/training/training/trl/rewards/other_rewards.py | Soft overlong punishment reward function |
| src/aixpert/training/training/trl/rewards/format_rewards.py | Format-checking reward functions |
| src/aixpert/training/training/trl/rewards/accuracy_rewards.py | Accuracy-based reward functions |
| src/aixpert/training/training/trl/rewards/init.py | Reward functions module initialization |
| src/aixpert/training/training/trl/models/utils.py | Model utilities including chat template setup and PEFT preparation |
| src/aixpert/training/training/trl/models/modeling_value_head.py | Value head models for causal and seq2seq LMs |
| src/aixpert/training/training/trl/models/init.py | Models module initialization |
| src/aixpert/training/training/trl/extras/init.py | Extras module placeholder |
| src/aixpert/training/training/trl/experimental/xpo/init.py | XPO experimental trainer exports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 88 out of 122 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logging_steps: 20 | ||
| seed: 3407 | ||
|
|
||
| modified_dpo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seed is defined for original_dpo but missing here in modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added seed in modified_dpo as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script assumes attribute-style config access (e.g., paths.output_root, hp.max_seq_length). Please confirm that load_config() guarantees this, or update to explicit dict access to avoid runtime errors.
The training logic is hard-coded to original_dpo. If this script is intended to generalize, consider parameterizing the DPO mode. No explicit seed is set in the training script?
These are small changes, but important for correctness and experimental reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for flagging this. load_config() returns a plain Python dictionary via yaml.safe_load, so attribute-style access is not guaranteed. We have updated the code to use explicit dictionary access throughout (e.g., paths["output_root"], hp["max_seq_length"]) to avoid any runtime errors and ensure predictable behavior.
This script is intentionally scoped to run the Original-DPO baseline and therefore explicitly references the original_dpo configuration block. Factual-DPO training is handled by a separate runner that consumes the factual_dpo configuration. This separation keeps baseline and modified objectives isolated and avoids accidental configuration mixing.
name: FactualDPO Feature Request
|
|
I have addressed all your coments
Please verify all these modifications |
|
all comments been addressed, the structure is cleaner and consistent now |
|
Okay as per my discussion with shaina, merging this PR |
PR Type
Feature
Short Description
Add complete training and evaluation pipeline for FactualDPO++ including:
Details
run_dpo_training.pyrun_modified_training.pyeval_template.py)eval_core_utilities.py)run_all_evaluations.py)training/trl/*andmodifieddpo_trainer.py) as intentionally copied without template formatting.Notes