Skip to content

ZurichNLP/HF4ATS

Repository files navigation

HF4ATS: Human Feedback for Automatic Text Simplification

Pipeline

This repository contains the code for the data processing, web application, model training, experiments, and analyses described in the paper "Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities".

All code and data is intended for non-commercial research use only – please refer to the accompanying license/copyright notice for details.

HF4ATS Data

HF4ATS (Human Feedback For Automatic Text Simplification) is a German-language collection of news sentence simplifications and simplification preferences suitable for automatic simplification preference alignment and/or fine-tuning. It is composed of:

  1. [HF4ATS-DPO] Automatic text simplification (ATS) preference pairs annotated by both text simplification experts and persons with cognitive impairments. This dataset is suitable for preference alignment.
  2. [HF4ATS-SFT] Complex-simple manual text simplification pairs. This dataset is suitable for supervised fine-tuning. This data is sourced from DEplain-APA data, available at Zenodo.

Both are available at SWISSUbase under reference number 21058.

HF4ATS Checkpoints

Our SFT and DPO model checkpoints are available at SWISSUbase.

Usage

Requirements:

  • Python == 3.9
pip install -r requirements.txt

Clone this repository for the necessary data and directory structure.

Preparing DEPlain-APA data for Supervised Fine-Tuning

Train, development, and test HF4ATS-SFT data is available on SWISSUbase. That said, it can be reproduced from the raw DEplain-APA data.

To do so, first download the DEPlain-APA data. Place all.csv inside data/deplain_sentences/.

Next, run the following script:

python pre_sft.py

At this point, the complex-simple SFT data is ready for use under data/sft_train.jsonl, data/sft_dev.jsonl, and data/sft_holdout.jsonl.

Supervised Fine-Tuning

Our SFT model checkpoints are available at [SWISSUbase].

Should you wish to perform SFT, you must first ensure you have access (if required) to the following four models on HuggingFace:

  1. https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
  2. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
  3. https://huggingface.co/DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1
  4. https://huggingface.co/LeoLM/leo-mistral-hessianai-7b-chat

Next, input your HuggingFace token in place of the text <YOUR HUGGINGFACE TOKEN HERE> inside utils/base_dependencies.py

Finally, run the following script:

python sft_wrapper.py "<model>" bs16ga"<ga>"dv"<gpu>"lr"<lr>"

<model> must be from the following set: [disco_llama8b, llama8b, leolm_mistral7b, mistral7b]

<ga>, <gpu>, and <lr> refer to gradient accumulation steps, GPU count, and learning rate respectively.

The post-grid-search parameter set argument we used for SFT was bs16ga1dv1lr1e-4.

An example SFT train call:

python sft_wrapper.py "disco_llama8b" bs16ga1dv1lr1e-4

Direct Preference Optimization

Our DPO model checkpoints are available at [SWISSUbase].

Should you wish to perform DPO, you must first download the HF4ATS-DPO preference data at [Zenodo].

The HF4ATS-DPO Zenodo record contains all raw data obtained during annotation and final evaluation as well as several subset of this data. To run DPO, store these json files inside data/preferences/.

You must also have the necessary SFT checkpoints saved with the proper name and at the proper location. The rules are as follows:

  1. Let <model> be a string from the following set: [disco_llama8b, llama8b, leolm_mistral7b, mistral7b]
  2. Let #### inside checkpoint#### refer to the number of training instances seen during SFT (e.g. 2800 for our disco_llama8b checkpoint).
  3. Save the SFT checkpoint being post-trained at outputs/models/bs16ga1dv1lr1e-4/sft_<model>/checkpoint####/.

Next, run the following script to perform DPO:

python dpo_wrapper.py "bs16ga1dv1lr1e-4/sft_<model>_checkpoint####" "<variant>" "<user_set>" "train"

<user_set> can be either "ta" (target group annotations) or "ea" (expert group annotations)

<variant> can be as follows:

  1. "all": all preferences (see SWISSUbase record for information about raw preference data deduplication and exclusion).
  2. "equality": all preferences on ATS pairs which had equal levels of information, according to pair creators.
  3. "model": all preferences on ATS pairs generated by the SFT checkpoint being post-trained (assumes this preference data is being used to post-train our specific SFT checkpoints).
  4. "intraAA": all preferences indicated by the four (for target group) or two (expert group) annotators with the highest intra-annotation agreement.
  5. "interAA": all preferences indicated by the four (for target group) or two (expert group) annotators with the highest inter-annotation agreement.
  6. "groupX": uses "all" expert-group preferences when training and "intraAA" target-group preferences during evaluation (must set <user_set> to 'ea' when using this preference subset!)

An example DPO post-train call:

python dpo_wrapper.py "bs16ga1dv1lr1e-4/sft_disco_llama8b_checkpoint2800" "intraAA" "ta" "train"

Automatic Evaluation for an SFT or DPO Checkpoint

To perform evaluation, you need to have the SFT or DPO checkpoint inside outputs/models/bs16ga1dv1lr1e-4/. The rules for SFT checkpoint name & location are provided above. The rules for DPO checkpoint name & location are as follows:

  1. Let <sft_model> be a string from the following set: [sft_disco_llama8b_checkpoint2800, sft_llama8b_checkpoint2400, sft_leolm_mistral7b_checkpoint1600] (our SFT checkpoints)
  2. dpo_wrapper.py will save the DPO checkpoints inside outputs/models/bs16ga1dv1lr1e-4/ as dpo_<sft_model>_data<variant>_<user_set>/, where <variant> and <user_set> are defined as in the Direct Preference Optimization section above.

Run the following script to evaluate an SFT or DPO checkpoint:

python eval_wrapper.py "<full_model>" "<metric>" "bs16ga1dv1lr1e-4" "<stage>"

where <stage> is either "dev" or "test" depending on which HF4ATS-SFT dataset should be used for evaluation, <metric> is among the following set: [all, bert, bleu, ease, empt, equl, lens, lang, leng, sari, wstf], and <full_model> is the name of the DPO or SFT checkpoint to be evaluated.

eval_wrapper.py is set up for offline evaluation. It will first generate inferences to be placed inside outputs/generations/ before calculating the selected evluation metric.

An example evaluation call for an SFT checkpoint:

python eval_wrapper.py "sft_leolm_mistral7b_checkpoint1600" "all" "bs16ga1dv1lr1e-4" "test"

An example evaluation call for a DPO checkpoint:

python eval_wrapper.py "dpo_sft_leolm_mistral7b_checkpoint1600_dataall_ea" "all" "bs16ga1dv1lr1e-4" "test"

HF4ATS-DPO Recreation

All annotation data is available on SWISSUbase.

To replicate HF4ATS-DPO from the raw annotatino files, first download the individual preference annotation files and placed the contents of the unzipped folder in data/annotations/.

Next, run the following script:

python pre_dpo.py

At this point, HF4ATS-DPO data is ready for use under data/preferences_<subset>.jsonl, where <subset> is from the following set: [all, ie, iter, itra, sft_disco_llama8b_checkpoint2800, sft_leolm_mistral7b_checkpoint1600, sft_llama8b_checkpoint2400]

The file preferences_raw.jsonl is also created to allow inspection of annotations, not only those in HF4ATS-DPO (see the SWISSUbase description for further explanation).

Citation

If you use the dataset, please cite the following paper:

@article{
  title={{Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities}},
  author={Gao, Yingqiang and Johnson, Kaede and Froehlich, David and Carrer, Luisa and Ebling, Sarah},
  journal={arXiv preprint arXiv:2507.01479},
  year={2025}
}

About

Code and data for HF4ATS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •