Add example: Paper Reproduction, Replacing the Energy-Based Reward Model (EBRM) in the Paper with QBM #95

YuzeHao2023 · 2026-01-06T11:56:53Z

【论文复现：将论文中的基于能量的奖励模型（EBRM）替换为QBM】

任务描述：奖励模型（RMs）对于将大语言模型（LLMs）与人类偏好对齐至关重要，然而它们往往难以捕捉复杂的人类偏好，并难以泛化至未见数据。本任务需将论文《Energy-Based Reward Models for Robust Language Model Alignment》中提到的基于能量的奖励模型（EBRM）中的Energy Score模块替换为QBM，并使用文章中提到的数据集进行结果对比验证。

论文中EBRM模型中的energy score算法的传入参数为RM模型传出的特征值'embedding'以及RM模型的打分'r'，传出的参数为'r*'作为修正的打分值。我们使用QBM替换energy score同样使用相同的参数传递方法来进行修正打分。

模型训练部分我们对两个数据集均训练了5个epochs并进行数据的可视化（参考example/qbm_ebrm_results/imgs/文件夹或example/qbm_ebrm_results/README.md）。

文件变更：

├── README.md
├── imgs
│   ├── pairwire-training_plots.png
│   └── training_plots.png
├── model_final.pth
├── prepare_rmb_dataset.py
├── rmb_dataset.pt
├── rmb_dataset2.pt
├── rmb_dataset_pairwise.pt
├── rmb_dataset_train.pt
├── rmb_dataset_val.pt
├── run_diagnostic_train.py
├── run_qbm_ebm.py # smoke text
├── run_train_qbm_ebm.py
├── save_and_plot_results.py
└── training_metrics.npz

close #78

参考文献：

@article{lochab2025energy,
  title={Energy-Based Reward Models for Robust Language Model Alignment},
  author={Lochab, Anamika and Zhang, Ruqi},
  journal={arXiv preprint arXiv:2504.13134},
  year={2025}
}

@inproceedings{lambert2025rewardbench,
  title={Rewardbench: Evaluating reward models for language modeling},
  author={Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, Lester James Validad and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and others},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
  pages={1755--1797},
  year={2025}
}

基础架构及初次训练

第二次训练

yzhao112 and others added 10 commits January 6, 2026 08:11

初始提交

2886369

Merge pull request #1 from yzhao112/main

52c341b

基础架构及初次训练

第二次训练

555227e

Update README.md

4ee6630

Merge pull request #2 from yzhao112/main

d76c35e

第二次训练

updata readme

26a29a9

updata readme

55427de

updata readme

d4f4f74

清理一些杂项

babe297

updata readme

acc5274

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example: Paper Reproduction, Replacing the Energy-Based Reward Model (EBRM) in the Paper with QBM #95

Add example: Paper Reproduction, Replacing the Energy-Based Reward Model (EBRM) in the Paper with QBM #95

Uh oh!

YuzeHao2023 commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add example: Paper Reproduction, Replacing the Energy-Based Reward Model (EBRM) in the Paper with QBM #95

Are you sure you want to change the base?

Add example: Paper Reproduction, Replacing the Energy-Based Reward Model (EBRM) in the Paper with QBM #95

Uh oh!

Conversation

YuzeHao2023 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YuzeHao2023 commented Jan 6, 2026 •

edited

Loading