LLM Fact-Checking Research

This repository contains replication code for a research project evaluating large language models ability to fact-check PolitiFact claims using various approaches.

Citation

[TODO: Add citation information when paper is up on arxiv]

Repository Structure

code/ - all code for the project
data/ - all data for the project
figures/ - generated publication figures
reports/ - generated reports/text files
tables/ - generated publication tables

Replication

Dependencies

These experiments were conducted using:

Python 3.12.7
GNU bash, version 3.2.57(1)-release (arm64-apple-darwin24)

Set up virtual environment and install dependencies:

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install local packages
cd code/package
pip install -e ./
cd ../data_collection/politifact_scraper
pip install -e ./
cd ../../..

Download data

Download the required data from Zenodo via the link below.

Zenodo DOI/link: https://doi.org/10.5281/zenodo.17693220

Then, extract the data. This will create a directory called data/ which must be saved in the root directory of this repository.

How to replicate

All steps from start to finish—including data collection, cleaning, and analysis + figure generation—can be replicated by running the below bash scripts, after setting up virtual environments and downloading the data as specified above.

# From the root directory of this project, run the below

# Change to code directory
cd code/

# Run pipelines
bash 00-db-collection-and-generation-pipeline.sh  # Scrape data, build DB
bash 01-run-factchecking-tests.sh                 # Run LLM tests
bash 02-data-analysis-pipeline.sh                 # Clean & analyze data
bash 03-generate-results-and-figures.sh           # Generate outputs

Replication notes

Certain scripts within the above pipeline are commented out by default because they will take extremely long to run or incur thousands of dollars of costs for the user. Moreover, scraped data and data generated by LLMs is unlikely to be exactly the same if collected at a later date (see this article for more details on LLM nondeterminism). We therefore prioritize transparency about our pipeline and replication, given the data we have. Nonetheless, we include all steps to show our work and allow users to replicate the entire pipeline, should they choose to do so. The bash pipeline scripts print notes about what is excluded as it executes.

NewsGuard data

Unfortunately, we cannot share the proprietary NewsGuard data that we purchased for this study. The pipeline is set up to replicate with the Lin et al (2023) domain-quality list that we included in the domain quality sensitivity analysis of the Appendix. Should you have your own version of NewsGuard data, you can save it in the proper location and rerun the pipeline and it will be included automatically. The code/data_analysis/enrich_web_url_data.py has more information about including the NewsGuard data. You will also need to uncomment one script call at the bottom of the 03-generate-results-and-figures.sh pipeline script to generate the main text figure. See the notes in that script for details.

API Keys

Various scripts require API keys. By default, these are commented out in the pipeline scripts above, as they will incur costs for the user.

Should you want to run these scripts, you will need to set the following environment variables in your system:

OPENAI_API_KEY
GEMINI_API_KEY
TOGETHER_API_KEY

Questions?

For questions, please reach out to Matt DeVerna by visiting his personal website for his latest contact email.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
figures		figures
reports		reports
tables		tables
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Fact-Checking Research

Citation

Repository Structure

Replication

Dependencies

Download data

How to replicate

Replication notes

NewsGuard data

API Keys

Questions?

About

Uh oh!

Releases

Packages

Languages

License

osome-iu/fact_check_rag_osome

Folders and files

Latest commit

History

Repository files navigation

LLM Fact-Checking Research

Citation

Repository Structure

Replication

Dependencies

Download data

How to replicate

Replication notes

NewsGuard data

API Keys

Questions?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages