This repository hosts the code for the paper Preference Learning with Lie Detectors can Induce Honesty or Evasion.
An example of a setup and a basic experimental run is given in run.sh
. Different run configurations can be adjusted by setting the flags such as DO_DPO
to true
or false
. The codebase has been tested on the pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel
Docker image.