Skip to content

Conversation

@Sindhuja217
Copy link
Collaborator

PR Type

Feature

Short Description

Add complete training and evaluation pipeline for FactualDPO++ including:

  • Original DPO baseline training
  • Modified Factual-DPO++ training using Δ-margin
  • Automated factuality evaluation using GPT-4o-mini
  • Centralized config-driven system for both Training and Evaluation Seperately
  • README documentation

Details

  • Added full training scripts:
    • run_dpo_training.py
    • run_modified_training.py
  • Added utility modules for dataset loading, Unsloth model setup, LoRA config, TRL DPO trainer construction
  • Integrated FactualDPO patch via modified_dpo_trainer and TRL utilities (copied from TRL repo)
  • Implemented evaluation framework:
    • Async GPT-judge scoring(eval_template.py)
    • Model generation utilities(eval_core_utilities.py)
    • Batch evaluation across all models and deltas(run_all_evaluations.py)
  • Added central YAML config and config_loader
  • Added README describing architecture, workflow, and usage
  • Marked TRL-derived files (training/trl/* and modifieddpo_trainer.py) as intentionally copied without template formatting.

Notes

  • TRL-related trainer files are committed directly and excluded from formatting because they are unmodified code directly copied for hugging face trl.

@Sindhuja217 Sindhuja217 changed the title Training Training and Evaluation Dec 19, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive training and evaluation pipeline for FactualDPO++, introducing baseline DPO training, modified Factual-DPO++ training with Δ-margin, automated GPT-4o-mini evaluation, and centralized YAML-based configuration. The changes include TRL-derived trainer files that are intentionally copied without formatting.

Key Changes

  • Added training scripts for original DPO and modified Factual-DPO++ approaches
  • Implemented evaluation framework with async GPT-judge scoring and batch processing
  • Integrated TRL utilities and modified DPO trainer for Factual-DPO patch
  • Added centralized configuration system and comprehensive README documentation

Reviewed changes

Copilot reviewed 88 out of 122 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/aixpert/training/training/trl/trainer/dpo_config.py Configuration dataclass for DPO training parameters copied from TRL
src/aixpert/training/training/trl/trainer/cpo_trainer.py CPO trainer implementation copied from TRL for preference optimization
src/aixpert/training/training/trl/trainer/cpo_config.py Configuration dataclass for CPO training parameters
src/aixpert/training/training/trl/trainer/callbacks.py Training callbacks including BEMA, weight sync, and logging utilities
src/aixpert/training/training/trl/trainer/bco_trainer.py Deprecated BCO trainer wrapper redirecting to experimental module
src/aixpert/training/training/trl/trainer/bco_config.py Deprecated BCO config wrapper with deprecation warning
src/aixpert/training/training/trl/trainer/base_trainer.py Base trainer class with model card generation
src/aixpert/training/training/trl/trainer/init.py Lazy module initialization for trainer components
src/aixpert/training/training/trl/templates/rm_model_card.md Template for reward model cards
src/aixpert/training/training/trl/templates/lm_model_card.md Template for language model cards
src/aixpert/training/training/trl/scripts/vllm_serve.py vLLM server script with weight synchronization support
src/aixpert/training/training/trl/scripts/utils.py Utility functions for dataset loading and argument parsing
src/aixpert/training/training/trl/scripts/sft.py Supervised fine-tuning script
src/aixpert/training/training/trl/scripts/rloo.py RLOO training script with reward functions
src/aixpert/training/training/trl/scripts/reward.py Reward model training script
src/aixpert/training/training/trl/scripts/kto.py KTO training script
src/aixpert/training/training/trl/scripts/grpo.py GRPO training script
src/aixpert/training/training/trl/scripts/env.py Environment information printing utility
src/aixpert/training/training/trl/scripts/dpo.py DPO training script
src/aixpert/training/training/trl/scripts/init.py Lazy module initialization for scripts
src/aixpert/training/training/trl/rewards/other_rewards.py Soft overlong punishment reward function
src/aixpert/training/training/trl/rewards/format_rewards.py Format-checking reward functions
src/aixpert/training/training/trl/rewards/accuracy_rewards.py Accuracy-based reward functions
src/aixpert/training/training/trl/rewards/init.py Reward functions module initialization
src/aixpert/training/training/trl/models/utils.py Model utilities including chat template setup and PEFT preparation
src/aixpert/training/training/trl/models/modeling_value_head.py Value head models for causal and seq2seq LMs
src/aixpert/training/training/trl/models/init.py Models module initialization
src/aixpert/training/training/trl/extras/init.py Extras module placeholder
src/aixpert/training/training/trl/experimental/xpo/init.py XPO experimental trainer exports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shainarazavi shainarazavi requested a review from Copilot December 22, 2025 17:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 88 out of 122 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

logging_steps: 20
seed: 3407

modified_dpo:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed is defined for original_dpo but missing here in modified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added seed in modified_dpo as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script assumes attribute-style config access (e.g., paths.output_root, hp.max_seq_length). Please confirm that load_config() guarantees this, or update to explicit dict access to avoid runtime errors.
The training logic is hard-coded to original_dpo. If this script is intended to generalize, consider parameterizing the DPO mode. No explicit seed is set in the training script?
These are small changes, but important for correctness and experimental reliability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for flagging this. load_config() returns a plain Python dictionary via yaml.safe_load, so attribute-style access is not guaranteed. We have updated the code to use explicit dictionary access throughout (e.g., paths["output_root"], hp["max_seq_length"]) to avoid any runtime errors and ensure predictable behavior.

This script is intentionally scoped to run the Original-DPO baseline and therefore explicitly references the original_dpo configuration block. Factual-DPO training is handled by a separate runner that consumes the factual_dpo configuration. This separation keeps baseline and modified objectives isolated and avoids accidental configuration mixing.

@AhmedRadwan02
Copy link
Collaborator


name: FactualDPO Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Training Pipeline Improvements and Clarifications'
labels: 'enhancement'

Problem Statement

Several aspects of the training and evaluation pipeline need clarification and improvement for better usability and consistency.

Proposed Solution

  1. Documentation: Add visual figures comparing DPO vs Modified DPO equations in README
  2. Naming Consistency: Standardize terminology (ModifiedDPO or ModifiedDPO++ or FactualDPO++) across all files and comments based on final paper naming
  3. Configuration: Add missing seed hyperparameter to modified_dpo config
  4. LLM Judge Config: Rename model config section to llm-as-judge for clarity, set temperature to 0.0 for consistent evaluation
  5. Attribution: Add TRL repository link and reference in README

Alternative Solutions

Could maintain current naming but ensure consistency within the chosen convention.

Use Cases

  • Clearer onboarding for new users through visual documentation
  • Reproducible experiments via seed configuration
  • Consistent evaluation results with deterministic judge settings

Implementation Ideas

  • Add equation comparison figure to training README
  • Find-and-replace naming standardization
  • Update YAML config with seed parameter
  • Modify eval config: rename section and set temperature: 0.0

Component Impact

  • Core functionality (seed parameter)
  • Documentation (README figures, TRL attribution)
  • API
  • Docker/Infrastructure

Additional Context

Question: Does the environment require updates for data/train/evaluation stages, or can the existing requirements be reused?

Priority

  • Nice to have
  • Would be helpful
  • Important for my use case
  • Critical/blocking

@shainarazavi shainarazavi self-requested a review December 24, 2025 14:10
@Sindhuja217
Copy link
Collaborator Author

@AhmedRadwan02

I have addressed all your coments

  1. Updated documentation with equations in Readme.md
  2. Followed Naming Consistency across all files so it will be now factualdpo across all files
  3. Added missing hyperparameter which is seed
    4)Changed the model key to llm as judge in config.yaml file
  4. Cited Trl repository and its link in Readme.md

Please verify all these modifications

@AhmedRadwan02
Copy link
Collaborator

all comments been addressed, the structure is cleaner and consistent now

@Sindhuja217
Copy link
Collaborator Author

Okay as per my discussion with shaina, merging this PR

@Sindhuja217 Sindhuja217 merged commit c9024be into main Jan 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants