This repository contains the code for the data processing, web application, model training, experiments, and analyses described in the paper "Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities".
All code and data is intended for non-commercial research use only – please refer to the accompanying license/copyright notice for details.
HF4ATS (Human Feedback For Automatic Text Simplification) is a German-language collection of news sentence simplifications and simplification preferences suitable for automatic simplification preference alignment and/or fine-tuning. It is composed of:
- [HF4ATS-DPO] Automatic text simplification (ATS) preference pairs annotated by both text simplification experts and persons with cognitive impairments. This dataset is suitable for preference alignment.
- [HF4ATS-SFT] Complex-simple manual text simplification pairs. This dataset is suitable for supervised fine-tuning. This data is sourced from DEplain-APA data, available at Zenodo.
Both are available at SWISSUbase under reference number 21058.
Our SFT and DPO model checkpoints are available at SWISSUbase.
Requirements:
- Python == 3.9
pip install -r requirements.txt
Clone this repository for the necessary data and directory structure.
Train, development, and test HF4ATS-SFT data is available on SWISSUbase. That said, it can be reproduced from the raw DEplain-APA data.
To do so, first download the DEPlain-APA data. Place all.csv
inside data/deplain_sentences/
.
Next, run the following script:
python pre_sft.py
At this point, the complex-simple SFT data is ready for use under data/sft_train.jsonl
, data/sft_dev.jsonl
, and data/sft_holdout.jsonl
.
Our SFT model checkpoints are available at [SWISSUbase].
Should you wish to perform SFT, you must first ensure you have access (if required) to the following four models on HuggingFace:
- https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
- https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
- https://huggingface.co/DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1
- https://huggingface.co/LeoLM/leo-mistral-hessianai-7b-chat
Next, input your HuggingFace token in place of the text <YOUR HUGGINGFACE TOKEN HERE>
inside utils/base_dependencies.py
Finally, run the following script:
python sft_wrapper.py "<model>" bs16ga"<ga>"dv"<gpu>"lr"<lr>"
<model>
must be from the following set: [disco_llama8b, llama8b, leolm_mistral7b, mistral7b]
<ga>
, <gpu>
, and <lr>
refer to gradient accumulation steps, GPU count, and learning rate respectively.
The post-grid-search parameter set argument we used for SFT was bs16ga1dv1lr1e-4.
An example SFT train call:
python sft_wrapper.py "disco_llama8b" bs16ga1dv1lr1e-4
Our DPO model checkpoints are available at [SWISSUbase].
Should you wish to perform DPO, you must first download the HF4ATS-DPO preference data at [Zenodo].
The HF4ATS-DPO Zenodo record contains all raw data obtained during annotation and final evaluation as well as several subset of this data. To run DPO, store these json files inside data/preferences/
.
You must also have the necessary SFT checkpoints saved with the proper name and at the proper location. The rules are as follows:
- Let
<model>
be a string from the following set: [disco_llama8b, llama8b, leolm_mistral7b, mistral7b] - Let
####
insidecheckpoint####
refer to the number of training instances seen during SFT (e.g. 2800 for our disco_llama8b checkpoint). - Save the SFT checkpoint being post-trained at
outputs/models/bs16ga1dv1lr1e-4/sft_<model>/checkpoint####/
.
Next, run the following script to perform DPO:
python dpo_wrapper.py "bs16ga1dv1lr1e-4/sft_<model>_checkpoint####" "<variant>" "<user_set>" "train"
<user_set>
can be either "ta" (target group annotations) or "ea" (expert group annotations)
<variant>
can be as follows:
- "all": all preferences (see SWISSUbase record for information about raw preference data deduplication and exclusion).
- "equality": all preferences on ATS pairs which had equal levels of information, according to pair creators.
- "model": all preferences on ATS pairs generated by the SFT checkpoint being post-trained (assumes this preference data is being used to post-train our specific SFT checkpoints).
- "intraAA": all preferences indicated by the four (for target group) or two (expert group) annotators with the highest intra-annotation agreement.
- "interAA": all preferences indicated by the four (for target group) or two (expert group) annotators with the highest inter-annotation agreement.
- "groupX": uses "all" expert-group preferences when training and "intraAA" target-group preferences during evaluation (must set <user_set> to 'ea' when using this preference subset!)
An example DPO post-train call:
python dpo_wrapper.py "bs16ga1dv1lr1e-4/sft_disco_llama8b_checkpoint2800" "intraAA" "ta" "train"
To perform evaluation, you need to have the SFT or DPO checkpoint inside outputs/models/bs16ga1dv1lr1e-4/
. The rules for SFT checkpoint name & location are provided above. The rules for DPO checkpoint name & location are as follows:
- Let
<sft_model>
be a string from the following set: [sft_disco_llama8b_checkpoint2800, sft_llama8b_checkpoint2400, sft_leolm_mistral7b_checkpoint1600] (our SFT checkpoints) dpo_wrapper.py
will save the DPO checkpoints insideoutputs/models/bs16ga1dv1lr1e-4/
asdpo_<sft_model>_data<variant>_<user_set>/
, where<variant>
and<user_set>
are defined as in the Direct Preference Optimization section above.
Run the following script to evaluate an SFT or DPO checkpoint:
python eval_wrapper.py "<full_model>" "<metric>" "bs16ga1dv1lr1e-4" "<stage>"
where <stage>
is either "dev" or "test" depending on which HF4ATS-SFT dataset should be used for evaluation, <metric>
is among the following set: [all, bert, bleu, ease, empt, equl, lens, lang, leng, sari, wstf], and <full_model>
is the name of the DPO or SFT checkpoint to be evaluated.
eval_wrapper.py
is set up for offline evaluation. It will first generate inferences to be placed inside outputs/generations/
before calculating the selected evluation metric.
An example evaluation call for an SFT checkpoint:
python eval_wrapper.py "sft_leolm_mistral7b_checkpoint1600" "all" "bs16ga1dv1lr1e-4" "test"
An example evaluation call for a DPO checkpoint:
python eval_wrapper.py "dpo_sft_leolm_mistral7b_checkpoint1600_dataall_ea" "all" "bs16ga1dv1lr1e-4" "test"
All annotation data is available on SWISSUbase.
To replicate HF4ATS-DPO from the raw annotatino files, first download the individual preference annotation files and placed the contents of the unzipped folder in data/annotations/
.
Next, run the following script:
python pre_dpo.py
At this point, HF4ATS-DPO data is ready for use under data/preferences_<subset>.jsonl
, where <subset>
is from the following set: [all, ie, iter, itra, sft_disco_llama8b_checkpoint2800, sft_leolm_mistral7b_checkpoint1600, sft_llama8b_checkpoint2400]
The file preferences_raw.jsonl
is also created to allow inspection of annotations, not only those in HF4ATS-DPO (see the SWISSUbase description for further explanation).
If you use the dataset, please cite the following paper:
@article{
title={{Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities}},
author={Gao, Yingqiang and Johnson, Kaede and Froehlich, David and Carrer, Luisa and Ebling, Sarah},
journal={arXiv preprint arXiv:2507.01479},
year={2025}
}