p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models
This repository contains the official code and data for the AACL 2025 paper, "p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models."
```
git clone [https://github.com/Table-R1/Table-R1.git](https://github.com/Table-R1/Table-R1.git)
cd Table-R1
pip install -r requirements.txt
```
```
Table-R1/
├── p2tqa/
│ ├── code/
│ │ ├── prompts.py
│ │ ├── state_sample_dpo.py
│ │ ├── state_selection_score.py
│ │ └── utils.py
│ └── data/
│ ├── datasets/
│ └── generated_data/
├── README.md
└── requirements.txt
```
-
p2tqa/code/: Contains the core Python scripts for generating steo DPO data.prompts.py: Prompts used in the project.state_selection_score.py: Stage 1 generation which involves sampling reasoning chains and calculating state values.state_sample_dpo.py: Stage 2 generation, which collect step DPO pairs based on state values.utils.py: Contains miscellaneous utility functions.
-
upload_code/data/: Data used/sampled in the project.datasets/: Source datasets to create DPO data from (e.g., WTQ, TabFact, HiTab).generated_data/: Generated Data. We include data for rejected sampling fine-tuning, full-chain DPO, and our method from two models (Qwen-2.5-7B and LlaMA-3.1-8B). We also examplify outputs from stage 1 and 2. Data for fine-tuning a TQA model is provided intqa_model_finetune.json.
-
README.md: The main documentation file for the repository. -
requirements.txt: Lists all necessary Python dependencies required to run the project.
This repo contains code for collecting stepwise DPO data. For fine-tuning, we use LlamaFactory with parameters specified in the paper. To collect data, follow the below steps:
-
SFT: Finetune a TQA model with step separation. We provide the fine-tuning data in
data/generated_data/tqa_model_finetune.json. -
Calculate state values:
- Deploy an LLM judge model with VLLM. (You can skip this step if you only want to use Monte Carlo Sampling to estimate state values.)
vllm serve MODEL_PATH --api-key token-abc123 --max_model_len 14000 --tensor-parallel-size 2 --disable-custom-all-reduce --enforce-eager- Run the following command to start stage 1 collection. This stage involves (1) sampling multiple reasoning chains and (2) calculating state values for sampled chains.
python state_selection_score.py --model_name qwen2.5-7b --dataset_name wtq --judge_model_name Qwen2.5-72B-Instruct --model_path PATH_TO_THE_FINETUNED_TQA_MODEL --judge_model_path PATH_TO_THE_JUDGE_MODEL --dataset_path datasets/wtq_dev.jsonl --base_url BASE_URL_TO_QUERY_JUDGE_MODEL (no need to speficy if not using the judge model) --state_evaluator correctness --add_sep- We examplify the output of stage 1 in
data/generated_data/p2_tqa_stage1_output_wtq_val.jsonsft: marked as True if all reasoning chains lead to correct answers. We do not consider these instances in stage 2 sampling.tree: a list of tree nodes collected from each reasoning chain ifsftis false. Each tree node is structured as{'state content': {'MC value': float, 'number of reasoning steps': [], 'state expansion history':[], 'llm judge evaluation': {}}}
-
Sample stepwise DPO data:
- Deploy an LLM judge model with VLLM. (You can skip this step if you only want to use Monte Carlo Sampling to estimate state values.)
vllm serve MODEL_PATH --api-key token-abc123 --max_model_len 14000 --tensor-parallel-size 2 --disable-custom-all-reduce --enforce-eager- Run the following command to start stage 2 collection. This stage involves sampling stepwise DPO data.
python state_sample_dpo.py --model_name qwen2.5-7b --dataset_name wtq --judge_model_name Qwen2.5-72B-Instruct --model_path PATH_TO_THE_FINETUNED_TQA_MODEL --judge_model_path PATH_TO_THE_JUDGE_MODEL --dataset_path datasets/wtq_dev.jsonl --base_url BASE_URL_TO_QUERY_JUDGE_MODEL (no need to speficy if not using the judge model) --s1_input PATH_TO_THE_OUTPUT_FROM_STAGE1 --state_evaluator correctness- We examplify the output of stage 2 in
data/generated_data/p2_tqa_stage_2_output_wtq_mix_example.jsonq_idx: question index from the original source dataset.bad_step: a sampled rejected step.good_step: a sampled preferred step.history: previous steps.score:[bad step, good step]Aggregated scores based on value function.original_score: [bad step (llm_evaluation score, mc estimation), good step (...)]scores from the llm judge and from monte carlo sampling.
-
DPO: Finetune a TQA model with sampled DPO data. We provided already sampled DPO data in
data/generated_data/p2_tqa_step_dpo_correct_0.9_qwen_8k.jsonanddata/generated_data/p2_tqa_step_dpo_correct_0.9_llama_7k.json.
If you find our paper useful, please cite our paper:
@article{zhou2025p2tqaprocessbasedpreferencelearning,
title={p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models},
author={Wei Zhou and Mohsen Mesgar and Heike Adel and Annemarie Friedrich},
journal={arXiv preprint arXiv:2505.17565},
year={2025}
}
