Training and Evaluation #7

Sindhuja217 · 2025-12-19T04:20:23Z

PR Type

Feature

Short Description

Add complete training and evaluation pipeline for FactualDPO++ including:

Original DPO baseline training
Modified Factual-DPO++ training using Δ-margin
Automated factuality evaluation using GPT-4o-mini
Centralized config-driven system for both Training and Evaluation Seperately
README documentation

Details

Added full training scripts:
- run_dpo_training.py
- run_modified_training.py
Added utility modules for dataset loading, Unsloth model setup, LoRA config, TRL DPO trainer construction
Integrated FactualDPO patch via modified_dpo_trainer and TRL utilities (copied from TRL repo)
Implemented evaluation framework:
- Async GPT-judge scoring(eval_template.py)
- Model generation utilities(eval_core_utilities.py)
- Batch evaluation across all models and deltas(run_all_evaluations.py)
Added central YAML config and config_loader
Added README describing architecture, workflow, and usage
Marked TRL-derived files (training/trl/* and modifieddpo_trainer.py) as intentionally copied without template formatting.

Notes

TRL-related trainer files are committed directly and excluded from formatting because they are unmodified code directly copied for hugging face trl.

Copilot

Pull request overview

This PR adds a comprehensive training and evaluation pipeline for FactualDPO++, introducing baseline DPO training, modified Factual-DPO++ training with Δ-margin, automated GPT-4o-mini evaluation, and centralized YAML-based configuration. The changes include TRL-derived trainer files that are intentionally copied without formatting.

Key Changes

Added training scripts for original DPO and modified Factual-DPO++ approaches
Implemented evaluation framework with async GPT-judge scoring and batch processing
Integrated TRL utilities and modified DPO trainer for Factual-DPO patch
Added centralized configuration system and comprehensive README documentation

Reviewed changes

Copilot reviewed 88 out of 122 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/aixpert/training/training/trl/trainer/dpo_config.py	Configuration dataclass for DPO training parameters copied from TRL
src/aixpert/training/training/trl/trainer/cpo_trainer.py	CPO trainer implementation copied from TRL for preference optimization
src/aixpert/training/training/trl/trainer/cpo_config.py	Configuration dataclass for CPO training parameters
src/aixpert/training/training/trl/trainer/callbacks.py	Training callbacks including BEMA, weight sync, and logging utilities
src/aixpert/training/training/trl/trainer/bco_trainer.py	Deprecated BCO trainer wrapper redirecting to experimental module
src/aixpert/training/training/trl/trainer/bco_config.py	Deprecated BCO config wrapper with deprecation warning
src/aixpert/training/training/trl/trainer/base_trainer.py	Base trainer class with model card generation
src/aixpert/training/training/trl/trainer/init.py	Lazy module initialization for trainer components
src/aixpert/training/training/trl/templates/rm_model_card.md	Template for reward model cards
src/aixpert/training/training/trl/templates/lm_model_card.md	Template for language model cards
src/aixpert/training/training/trl/scripts/vllm_serve.py	vLLM server script with weight synchronization support
src/aixpert/training/training/trl/scripts/utils.py	Utility functions for dataset loading and argument parsing
src/aixpert/training/training/trl/scripts/sft.py	Supervised fine-tuning script
src/aixpert/training/training/trl/scripts/rloo.py	RLOO training script with reward functions
src/aixpert/training/training/trl/scripts/reward.py	Reward model training script
src/aixpert/training/training/trl/scripts/kto.py	KTO training script
src/aixpert/training/training/trl/scripts/grpo.py	GRPO training script
src/aixpert/training/training/trl/scripts/env.py	Environment information printing utility
src/aixpert/training/training/trl/scripts/dpo.py	DPO training script
src/aixpert/training/training/trl/scripts/init.py	Lazy module initialization for scripts
src/aixpert/training/training/trl/rewards/other_rewards.py	Soft overlong punishment reward function
src/aixpert/training/training/trl/rewards/format_rewards.py	Format-checking reward functions
src/aixpert/training/training/trl/rewards/accuracy_rewards.py	Accuracy-based reward functions
src/aixpert/training/training/trl/rewards/init.py	Reward functions module initialization
src/aixpert/training/training/trl/models/utils.py	Model utilities including chat template setup and PEFT preparation
src/aixpert/training/training/trl/models/modeling_value_head.py	Value head models for causal and seq2seq LMs
src/aixpert/training/training/trl/models/init.py	Models module initialization
src/aixpert/training/training/trl/extras/init.py	Extras module placeholder
src/aixpert/training/training/trl/experimental/xpo/init.py	XPO experimental trainer exports

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 88 out of 122 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

shainarazavi · 2025-12-23T00:08:27Z

src/aixpert/training/config/config.yaml

+    logging_steps: 20
+    seed: 3407
+
+modified_dpo:


The seed is defined for original_dpo but missing here in modified.

I have added seed in modified_dpo as well

shainarazavi · 2025-12-23T00:13:06Z

src/aixpert/training/training/factualdpo_trainer.py

The script assumes attribute-style config access (e.g., paths.output_root, hp.max_seq_length). Please confirm that load_config() guarantees this, or update to explicit dict access to avoid runtime errors.
The training logic is hard-coded to original_dpo. If this script is intended to generalize, consider parameterizing the DPO mode. No explicit seed is set in the training script?
These are small changes, but important for correctness and experimental reliability.

Thank you for flagging this. load_config() returns a plain Python dictionary via yaml.safe_load, so attribute-style access is not guaranteed. We have updated the code to use explicit dictionary access throughout (e.g., paths["output_root"], hp["max_seq_length"]) to avoid any runtime errors and ensure predictable behavior.

This script is intentionally scoped to run the Original-DPO baseline and therefore explicitly references the original_dpo configuration block. Factual-DPO training is handled by a separate runner that consumes the factual_dpo configuration. This separation keeps baseline and modified objectives isolated and avoids accidental configuration mixing.

AhmedRadwan02 · 2025-12-23T23:35:46Z

name: FactualDPO Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Training Pipeline Improvements and Clarifications'
labels: 'enhancement'

Problem Statement

Several aspects of the training and evaluation pipeline need clarification and improvement for better usability and consistency.

Proposed Solution

Documentation: Add visual figures comparing DPO vs Modified DPO equations in README
Naming Consistency: Standardize terminology (ModifiedDPO or ModifiedDPO++ or FactualDPO++) across all files and comments based on final paper naming
Configuration: Add missing seed hyperparameter to modified_dpo config
LLM Judge Config: Rename model config section to llm-as-judge for clarity, set temperature to 0.0 for consistent evaluation
Attribution: Add TRL repository link and reference in README

Alternative Solutions

Could maintain current naming but ensure consistency within the chosen convention.

Use Cases

Clearer onboarding for new users through visual documentation
Reproducible experiments via seed configuration
Consistent evaluation results with deterministic judge settings

Implementation Ideas

Add equation comparison figure to training README
Find-and-replace naming standardization
Update YAML config with seed parameter
Modify eval config: rename section and set temperature: 0.0

Component Impact

Core functionality (seed parameter)
Documentation (README figures, TRL attribution)
API
Docker/Infrastructure

Additional Context

Question: Does the environment require updates for data/train/evaluation stages, or can the existing requirements be reused?

Priority

Nice to have
Would be helpful
Important for my use case
Critical/blocking

Sindhuja217 · 2025-12-24T21:33:40Z

@AhmedRadwan02

I have addressed all your coments

Updated documentation with equations in Readme.md
Followed Naming Consistency across all files so it will be now factualdpo across all files
Added missing hyperparameter which is seed
4)Changed the model key to llm as judge in config.yaml file
Cited Trl repository and its link in Readme.md

Please verify all these modifications

AhmedRadwan02 · 2025-12-30T23:08:53Z

all comments been addressed, the structure is cleaner and consistent now

Sindhuja217 · 2026-01-02T01:07:03Z

Okay as per my discussion with shaina, merging this PR

sindchad added 4 commits December 18, 2025 20:04

Add TRL folder

668d77c

modifed trainer

bf922c5

training files

7a53c9b

evaluation files

508e183

Sindhuja217 changed the title ~~Training~~ Training and Evaluation Dec 19, 2025

Add readme.md files

fac1408

Sindhuja217 requested review from AhmedRadwan02 and shainarazavi December 19, 2025 04:49

Sindhuja217 self-assigned this Dec 19, 2025

shainarazavi requested a review from Copilot December 22, 2025 17:27

Copilot AI reviewed Dec 22, 2025

View reviewed changes

shainarazavi requested a review from Copilot December 22, 2025 17:31

Copilot AI reviewed Dec 22, 2025

View reviewed changes

shainarazavi reviewed Dec 23, 2025

View reviewed changes

shainarazavi self-requested a review December 24, 2025 14:10

Sindhujach217 added 11 commits December 24, 2025 16:09

modifed trainer

1a28ff1

update readme.md

9047458

update readme.md

ddc75a9

update readme.md

a8d0d1a

update readme.md

a39635c

update readme.md

6b660bd

update readme.md

d748c11

update readme.md

a09043c

update readme.md

c5860e3

update readme.md

e5810c8

update readme.md

8a9309d

Exclude trl and factualdpo_trainer from code checks

ff1bfa4

Sindhuja217 merged commit c9024be into main Jan 2, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training and Evaluation #7

Training and Evaluation #7

Uh oh!

Sindhuja217 commented Dec 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

shainarazavi Dec 23, 2025

Uh oh!

Sindhuja217 Dec 24, 2025

Uh oh!

shainarazavi Dec 23, 2025

Uh oh!

Sindhuja217 Dec 24, 2025

Uh oh!

AhmedRadwan02 commented Dec 23, 2025

Uh oh!

Sindhuja217 commented Dec 24, 2025

Uh oh!

AhmedRadwan02 commented Dec 30, 2025

Uh oh!

Sindhuja217 commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Training and Evaluation #7

Training and Evaluation #7

Uh oh!

Conversation

Sindhuja217 commented Dec 19, 2025

PR Type

Short Description

Details

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

shainarazavi Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Sindhuja217 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

shainarazavi Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Sindhuja217 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

AhmedRadwan02 commented Dec 23, 2025

name: FactualDPO Feature Request about: Suggest a new feature or enhancement title: '[FEATURE] Training Pipeline Improvements and Clarifications' labels: 'enhancement'

Problem Statement

Proposed Solution

Alternative Solutions

Use Cases

Implementation Ideas

Component Impact

Additional Context

Priority

Uh oh!

Sindhuja217 commented Dec 24, 2025

Uh oh!

AhmedRadwan02 commented Dec 30, 2025

Uh oh!

Sindhuja217 commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

name: FactualDPO Feature Request
about: Suggest a new feature or enhancement
title: '[FEATURE] Training Pipeline Improvements and Clarifications'
labels: 'enhancement'