Skip to content

code and data for the paper "p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models."

License

Notifications You must be signed in to change notification settings

boschresearch/p2-TQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

Framework Overview

This repository contains the official code and data for the AACL 2025 paper, "p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models."

1. 🛠️ Installation

```
git clone [https://github.com/Table-R1/Table-R1.git](https://github.com/Table-R1/Table-R1.git)
cd Table-R1
pip install -r requirements.txt

```

2. 🧩 Project Structure

```
Table-R1/
├── p2tqa/
│   ├── code/
│   │   ├── prompts.py
│   │   ├── state_sample_dpo.py
│   │   ├── state_selection_score.py
│   │   └── utils.py
│   └── data/
│       ├── datasets/
│       └── generated_data/
├── README.md
└── requirements.txt
```
  • p2tqa/code/: Contains the core Python scripts for generating steo DPO data.

    • prompts.py: Prompts used in the project.
    • state_selection_score.py: Stage 1 generation which involves sampling reasoning chains and calculating state values.
    • state_sample_dpo.py: Stage 2 generation, which collect step DPO pairs based on state values.
    • utils.py: Contains miscellaneous utility functions.
  • upload_code/data/: Data used/sampled in the project.

    • datasets/: Source datasets to create DPO data from (e.g., WTQ, TabFact, HiTab).
    • generated_data/: Generated Data. We include data for rejected sampling fine-tuning, full-chain DPO, and our method from two models (Qwen-2.5-7B and LlaMA-3.1-8B). We also examplify outputs from stage 1 and 2. Data for fine-tuning a TQA model is provided in tqa_model_finetune.json .
  • README.md: The main documentation file for the repository.

  • requirements.txt: Lists all necessary Python dependencies required to run the project.

3. ⚙️ Usage

This repo contains code for collecting stepwise DPO data. For fine-tuning, we use LlamaFactory with parameters specified in the paper. To collect data, follow the below steps:

  1. SFT: Finetune a TQA model with step separation. We provide the fine-tuning data in data/generated_data/tqa_model_finetune.json.

  2. Calculate state values:

    • Deploy an LLM judge model with VLLM. (You can skip this step if you only want to use Monte Carlo Sampling to estimate state values.)
    vllm serve  MODEL_PATH   --api-key token-abc123 --max_model_len 14000  --tensor-parallel-size 2  --disable-custom-all-reduce  --enforce-eager
    
    • Run the following command to start stage 1 collection. This stage involves (1) sampling multiple reasoning chains and (2) calculating state values for sampled chains.
    python state_selection_score.py
            --model_name     qwen2.5-7b
            --dataset_name      wtq
            --judge_model_name      Qwen2.5-72B-Instruct
            --model_path      PATH_TO_THE_FINETUNED_TQA_MODEL
            --judge_model_path     PATH_TO_THE_JUDGE_MODEL
            --dataset_path      datasets/wtq_dev.jsonl
            --base_url       BASE_URL_TO_QUERY_JUDGE_MODEL (no need to speficy if not using the judge model)
            --state_evaluator     correctness
            --add_sep
    
    • We examplify the output of stage 1 in data/generated_data/p2_tqa_stage1_output_wtq_val.json
      • sft: marked as True if all reasoning chains lead to correct answers. We do not consider these instances in stage 2 sampling.
      • tree: a list of tree nodes collected from each reasoning chain if sft is false. Each tree node is structured as {'state content': {'MC value': float, 'number of reasoning steps': [], 'state expansion history':[], 'llm judge evaluation': {}}}
  3. Sample stepwise DPO data:

    • Deploy an LLM judge model with VLLM. (You can skip this step if you only want to use Monte Carlo Sampling to estimate state values.)
    vllm serve  MODEL_PATH   --api-key token-abc123 --max_model_len 14000  --tensor-parallel-size 2  --disable-custom-all-reduce  --enforce-eager
    
    • Run the following command to start stage 2 collection. This stage involves sampling stepwise DPO data.
    python state_sample_dpo.py
            --model_name     qwen2.5-7b
            --dataset_name      wtq
            --judge_model_name      Qwen2.5-72B-Instruct
            --model_path      PATH_TO_THE_FINETUNED_TQA_MODEL
            --judge_model_path     PATH_TO_THE_JUDGE_MODEL
            --dataset_path      datasets/wtq_dev.jsonl
            --base_url       BASE_URL_TO_QUERY_JUDGE_MODEL (no need to speficy if not using the judge model)
            --s1_input     PATH_TO_THE_OUTPUT_FROM_STAGE1
            --state_evaluator     correctness
    
    • We examplify the output of stage 2 in data/generated_data/p2_tqa_stage_2_output_wtq_mix_example.json
      • q_idx: question index from the original source dataset.
      • bad_step: a sampled rejected step.
      • good_step: a sampled preferred step.
      • history: previous steps.
      • score: [bad step, good step] Aggregated scores based on value function.
      • original_score: [bad step (llm_evaluation score, mc estimation), good step (...)] scores from the llm judge and from monte carlo sampling.
  4. DPO: Finetune a TQA model with sampled DPO data. We provided already sampled DPO data in data/generated_data/p2_tqa_step_dpo_correct_0.9_qwen_8k.json and data/generated_data/p2_tqa_step_dpo_correct_0.9_llama_7k.json.

📝 Citation

If you find our paper useful, please cite our paper:

@article{zhou2025p2tqaprocessbasedpreferencelearning,
  title={p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models},
  author={Wei Zhou and Mohsen Mesgar and Heike Adel and Annemarie Friedrich},
  journal={arXiv preprint arXiv:2505.17565},
  year={2025}
}

About

code and data for the paper "p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models."

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages