p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

This repository contains the official code and data for the AACL 2025 paper, "p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models."

1. 🛠️ Installation

```
git clone [https://github.com/Table-R1/Table-R1.git](https://github.com/Table-R1/Table-R1.git)
cd Table-R1
pip install -r requirements.txt

```

2. 🧩 Project Structure

```
Table-R1/
├── p2tqa/
│   ├── code/
│   │   ├── prompts.py
│   │   ├── state_sample_dpo.py
│   │   ├── state_selection_score.py
│   │   └── utils.py
│   └── data/
│       ├── datasets/
│       └── generated_data/
├── README.md
└── requirements.txt
```

p2tqa/code/: Contains the core Python scripts for generating steo DPO data.
- prompts.py: Prompts used in the project.
- state_selection_score.py: Stage 1 generation which involves sampling reasoning chains and calculating state values.
- state_sample_dpo.py: Stage 2 generation, which collect step DPO pairs based on state values.
- utils.py: Contains miscellaneous utility functions.
upload_code/data/: Data used/sampled in the project.
- datasets/: Source datasets to create DPO data from (e.g., WTQ, TabFact, HiTab).
- generated_data/: Generated Data. We include data for rejected sampling fine-tuning, full-chain DPO, and our method from two models (Qwen-2.5-7B and LlaMA-3.1-8B). We also examplify outputs from stage 1 and 2. Data for fine-tuning a TQA model is provided in tqa_model_finetune.json .
README.md: The main documentation file for the repository.
requirements.txt: Lists all necessary Python dependencies required to run the project.

3. ⚙️ Usage

This repo contains code for collecting stepwise DPO data. For fine-tuning, we use LlamaFactory with parameters specified in the paper. To collect data, follow the below steps:

SFT: Finetune a TQA model with step separation. We provide the fine-tuning data in data/generated_data/tqa_model_finetune.json.
Calculate state values:
- Deploy an LLM judge model with VLLM. (You can skip this step if you only want to use Monte Carlo Sampling to estimate state values.)
```
vllm serve  MODEL_PATH   --api-key token-abc123 --max_model_len 14000  --tensor-parallel-size 2  --disable-custom-all-reduce  --enforce-eager
```
- Run the following command to start stage 1 collection. This stage involves (1) sampling multiple reasoning chains and (2) calculating state values for sampled chains.
```
python state_selection_score.py
        --model_name     qwen2.5-7b
        --dataset_name      wtq
        --judge_model_name      Qwen2.5-72B-Instruct
        --model_path      PATH_TO_THE_FINETUNED_TQA_MODEL
        --judge_model_path     PATH_TO_THE_JUDGE_MODEL
        --dataset_path      datasets/wtq_dev.jsonl
        --base_url       BASE_URL_TO_QUERY_JUDGE_MODEL (no need to speficy if not using the judge model)
        --state_evaluator     correctness
        --add_sep
```
- We examplify the output of stage 1 in data/generated_data/p2_tqa_stage1_output_wtq_val.json
  - sft: marked as True if all reasoning chains lead to correct answers. We do not consider these instances in stage 2 sampling.
  - tree: a list of tree nodes collected from each reasoning chain if sft is false. Each tree node is structured as {'state content': {'MC value': float, 'number of reasoning steps': [], 'state expansion history':[], 'llm judge evaluation': {}}}
Sample stepwise DPO data:
- Deploy an LLM judge model with VLLM. (You can skip this step if you only want to use Monte Carlo Sampling to estimate state values.)
```
vllm serve  MODEL_PATH   --api-key token-abc123 --max_model_len 14000  --tensor-parallel-size 2  --disable-custom-all-reduce  --enforce-eager
```
- Run the following command to start stage 2 collection. This stage involves sampling stepwise DPO data.
```
python state_sample_dpo.py
        --model_name     qwen2.5-7b
        --dataset_name      wtq
        --judge_model_name      Qwen2.5-72B-Instruct
        --model_path      PATH_TO_THE_FINETUNED_TQA_MODEL
        --judge_model_path     PATH_TO_THE_JUDGE_MODEL
        --dataset_path      datasets/wtq_dev.jsonl
        --base_url       BASE_URL_TO_QUERY_JUDGE_MODEL (no need to speficy if not using the judge model)
        --s1_input     PATH_TO_THE_OUTPUT_FROM_STAGE1
        --state_evaluator     correctness
```
- We examplify the output of stage 2 in data/generated_data/p2_tqa_stage_2_output_wtq_mix_example.json
  - q_idx: question index from the original source dataset.
  - bad_step: a sampled rejected step.
  - good_step: a sampled preferred step.
  - history: previous steps.
  - score: [bad step, good step] Aggregated scores based on value function.
  - original_score: [bad step (llm_evaluation score, mc estimation), good step (...)] scores from the llm judge and from monte carlo sampling.
DPO: Finetune a TQA model with sampled DPO data. We provided already sampled DPO data in data/generated_data/p2_tqa_step_dpo_correct_0.9_qwen_8k.json and data/generated_data/p2_tqa_step_dpo_correct_0.9_llama_7k.json.

📝 Citation

If you find our paper useful, please cite our paper:

@article{zhou2025p2tqaprocessbasedpreferencelearning,
  title={p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models},
  author={Wei Zhou and Mohsen Mesgar and Heike Adel and Annemarie Friedrich},
  journal={arXiv preprint arXiv:2505.17565},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
data		data
LICENSE-AGPL		LICENSE-AGPL
README.md		README.md
overview.jpg		overview.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

1. 🛠️ Installation

2. 🧩 Project Structure

3. ⚙️ Usage

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

boschresearch/p2-TQA

Folders and files

Latest commit

History

Repository files navigation

p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

1. 🛠️ Installation

2. 🧩 Project Structure

3. ⚙️ Usage

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages