Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions example/qbm_ebrm_results/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# 【论文复现:将论文中的基于能量的奖励模型(EBRM)替换为QBM】

## 背景描述

奖励模型(RMs)对于将大语言模型(LLMs)与人类偏好对齐至关重要,然而它们往往难以捕捉复杂的人类偏好,并难以泛化至未见数据。本任务需将论文《Energy-Based Reward Models for Robust Language Model Alignment》中提到的基于能量的奖励模型(EBRM)中的Energy Score模块替换为QBM,并使用文章中提到的数据集进行结果对比验证。

详情见[issue #78](https://github.com/qboson/kaiwu-pytorch-plugin/issues/78)。

项目文件:

```
├── README.md
├── imgs
│ ├── pairwire-training_plots.png
│ └── training_plots.png
├── model_final.pth
├── prepare_rmb_dataset.py
├── rmb_dataset.pt
├── rmb_dataset2.pt
├── rmb_dataset_pairwise.pt
├── rmb_dataset_train.pt
├── rmb_dataset_val.pt
├── run_diagnostic_train.py
├── run_qbm_ebm.py # smoke text
├── run_train_qbm_ebm.py
├── save_and_plot_results.py
└── training_metrics.npz
```

我们使用[RMB](https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark)数据集进行测试,如果该数据集对你有帮助请引用:

```
@inproceedings{lambert2025rewardbench,
title={Rewardbench: Evaluating reward models for language modeling},
author={Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, Lester James Validad and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and others},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
pages={1755--1797},
year={2025}
}
```
## 复现步骤

### 数据管线的搭建

我们首先构建了[prepare_rmb_dataset.py](example/qbm_ebrm_results/prepare_rmb_dataset.py)将json格式的数据集转化为pt格式文件用于适配dataloader。

### 算法的搭建

然后将EBRM中的energy score算法替换为QBM算法,原论文中energy score算法接受两个参数传统RM算法输出的特征值'embedding'以及RM算法的打分'r',并返回修改后的打分'r*'。新的QBM算法替代energy score算法也采取相同的输入输出参数。

### 训练及验证结果

我们分别对[RMB-Reward-Model-Benchmark](https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark)数据集中的BoN_set以及Pairwise_set分别训练五个epochs进行测试。测试结果可视化如下:

![BoN_set](https://github.com/YuzeHao2023/kaiwu-pytorch-plugin/blob/main/example/qbm_ebrm_results/imgs/training_plots.png "BoN_set效果可视化结果")

BoN_set效果可视化结果

![Pairwise_set](https://github.com/YuzeHao2023/kaiwu-pytorch-plugin/blob/main/example/qbm_ebrm_results/imgs/pairwire-training_plots.png "Pairwise_set效果可视化结果")

Pairwise_set效果可视化结果

## 快速开始

### 安装

首先将kaiwu-pytorch-plugin项目fork,然后再您本地运行下面代码将其下载到本地:

```bash
git clone https://github.com/qboson/kaiwu-pytorch-plugin.git
```

安装KPP:

```bash
cd kaiwu-pytorch-plugin
pip3 install -r requirements/requirements.txt
pip3 install .
```

### 数据集转化

使用下面指令下载数据集:

```bash
git clone https://github.com/Zhou-Zoey/RMB-Reward-Model-Benchmark.git
```

将[prepare_rmb_dataset.py](example/qbm_ebrm_results/prepare_rmb_dataset.py)中的路径修改为实际需要的路径:

```python
src = os.path.join(repo_root, 'RMB-Reward-Model-Benchmark', 'RMB_dataset', 'BoN_set', 'Harmlessness', 'S2.json')
```

然后运行:

```bash
python3 example/qbm_ebrm_results/prepare_rmb_dataset.py
```

即可生成对应的pt格式数据集文件。

### 训练及可视化

在wandb离线状态下进行训练,

```bash
python3 example/qbm_ebrm_results/run_train_qbm_ebm.py
WANDB_MODE=offline
python3 example/qbm_ebrm_results/save_and_plot_results.py
```

可视化部分见训练及验证结果部分图片。

---

如果这篇论文对您的研究有帮助,请引用下面的论文:

```
@article{lochab2025energy,
title={Energy-Based Reward Models for Robust Language Model Alignment},
author={Lochab, Anamika and Zhang, Ruqi},
journal={arXiv preprint arXiv:2504.13134},
year={2025}
}
```

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added example/qbm_ebrm_results/imgs/training_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added example/qbm_ebrm_results/model_final.pth
Binary file not shown.
68 changes: 68 additions & 0 deletions example/qbm_ebrm_results/prepare_rmb_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import json
import torch
import os
import hashlib
import numpy as np


def text_to_embedding(text, dim=512):
# deterministic pseudo-embedding from text using hash seed
h = hashlib.sha256(text.encode('utf-8')).digest()
# RandomState expects a 32-bit seed
seed = int.from_bytes(h[:8], 'big') % (2 ** 32)
rng = np.random.RandomState(seed)
vec = rng.normal(size=(dim,)).astype('float32')
return torch.from_numpy(vec)


def prepare_from_json(json_path, out_pt_path):
data = json.load(open(json_path, 'r', encoding='utf-8'))
examples = []
for item in data:
# Format 1: BoN style with 'bon_best' and 'loser_list'
if 'bon_best' in item or 'loser_list' in item:
best = item.get('bon_best', {})
best_ans = best.get('answer', None)
if best_ans:
emb = text_to_embedding(best_ans)
examples.append({'embedding': emb, 'reward': 1.0})
losers = item.get('loser_list', [])
for l in losers:
ans = l.get('answer', None)
if ans:
emb = text_to_embedding(ans)
examples.append({'embedding': emb, 'reward': 0.0})
# Format 2: Pairwise with 'chosen' and 'reject'
elif 'chosen' in item and 'reject' in item:
chosen = item.get('chosen', {})
reject = item.get('reject', {})
c_ans = chosen.get('answer', None)
r_ans = reject.get('answer', None)
if c_ans:
examples.append({'embedding': text_to_embedding(c_ans), 'reward': 1.0})
if r_ans:
examples.append({'embedding': text_to_embedding(r_ans), 'reward': 0.0})
# Fallback: try common field names
else:
# try 'answer' at top-level
ans = item.get('answer') if isinstance(item, dict) else None
if ans:
examples.append({'embedding': text_to_embedding(ans), 'reward': 0.0})

# save as list of dicts compatible with RewardEmbeddingDataset
torch.save(examples, out_pt_path)
print(f"Saved {len(examples)} examples to {out_pt_path}")


if __name__ == '__main__':
import sys
here = os.path.dirname(__file__)
repo_root = os.path.abspath(os.path.join(here, '..', '..'))
# default source JSON path
src = os.path.join(repo_root, 'RMB-Reward-Model-Benchmark', 'RMB_dataset', 'BoN_set', 'Harmlessness', 'S2.json')
out = os.path.join(here, 'rmb_dataset2.pt')
if len(sys.argv) > 1:
src = sys.argv[1]
if len(sys.argv) > 2:
out = sys.argv[2]
prepare_from_json(src, out)
Binary file added example/qbm_ebrm_results/rmb_dataset.pt
Binary file not shown.
Binary file added example/qbm_ebrm_results/rmb_dataset2.pt
Binary file not shown.
Binary file added example/qbm_ebrm_results/rmb_dataset_pairwise.pt
Binary file not shown.
Binary file added example/qbm_ebrm_results/rmb_dataset_train.pt
Binary file not shown.
Binary file added example/qbm_ebrm_results/rmb_dataset_val.pt
Binary file not shown.
71 changes: 71 additions & 0 deletions example/qbm_ebrm_results/run_diagnostic_train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import os
import torch
import traceback
from pathlib import Path

repo_root = Path(__file__).resolve().parents[2]
import sys
sys.path.insert(0, str(repo_root))

# provide a lightweight wandb shim if wandb is not installed
try:
import wandb
except Exception:
class _DummyWandb:
def init(self, *args, **kwargs):
return None
def log(self, *args, **kwargs):
return None
wandb = _DummyWandb()
sys.modules['wandb'] = wandb

# provide a minimal transformers.trainer_utils shim if transformers is missing
try:
import transformers
except Exception:
import types
transformers = types.SimpleNamespace()
trainer_utils = types.SimpleNamespace(EvalPrediction=object)
transformers.trainer_utils = trainer_utils
sys.modules['transformers'] = transformers

from EBRM.src.reward_modeling.ebm_training import ebm_nce_plus as ebm


def main():
data_path = Path(__file__).parent / 'rmb_dataset2.pt'
val_path = data_path # reuse as val for diagnostic
if not data_path.exists():
print('Dataset not found:', data_path)
return

dataset = ebm.RewardEmbeddingDataset(str(data_path))
val_dataset = ebm.RewardEmbeddingDataset(str(val_path))

# build model (USE_QBM env honored inside)
use_qbm = os.environ.get('USE_QBM', '1')
if use_qbm.lower() in ('1','true','yes'):
try:
from kaiwu.torch_plugin.qbm_adapter import QBMModel
model = QBMModel(embedding_size=512, num_nodes=64, num_visible=16, device=torch.device('cpu'))
print('Using QBMModel')
except Exception as e:
print('QBM import failed, falling back to EBM_DNN', e)
model = ebm.EBM_DNN(embedding_size=512)
else:
model = ebm.EBM_DNN(embedding_size=512)

model.to(torch.device('cpu'))

# diagnostics: smaller M, lower lr, stronger regularization
try:
torch.autograd.set_detect_anomaly(True)
acc = ebm.train_ebm(model, dataset, val_dataset, beta=0.1, batch_size=32, epochs=3, learning_rate=1e-5, weight_decay=1e-4, M=64, std_devs=[0.5], lambda_reg=0.1, letter='diag')
print('Training finished. val acc:', acc)
except Exception as e:
print('Training raised exception:')
traceback.print_exc()


if __name__ == '__main__':
main()
40 changes: 40 additions & 0 deletions example/qbm_ebrm_results/run_qbm_ebm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
"""Smoke test runner for QBM + EBRM integration.

This script imports the QBM adapter and runs a single forward
on random embeddings and rewards to verify the module loads.
"""
import os
import sys
import torch

# ensure the repo `src/` is on PYTHONPATH so `kaiwu` package resolves
ROOT_SRC = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..', 'src'))
sys.path.insert(0, ROOT_SRC)

os.environ.setdefault('USE_QBM', '1')

def main():
try:
from kaiwu.torch_plugin.qbm_adapter import QBMModel
except Exception as e:
print('QBMModel import failed:', e)
return

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = QBMModel(embedding_size=512, num_nodes=64, num_visible=16, device=device)
model.to(device)
model.eval()

# small random batch
B = 4
embedding = torch.randn(B, 512, device=device)
reward = torch.randn(B, device=device)

with torch.no_grad():
out = model(embedding, reward)

print('QBMModel forward output shape:', out.shape)
print('Sample outputs:', out.cpu().numpy())

if __name__ == '__main__':
main()
Loading