This repository is the codebase for our paper on evaluating factual reasoning in large language models.
Here you’ll find everything needed to
- Generate and inspect our three Trilemma data sets (city locations, drug indications, word definitions),
- Run zero-shot prompts,
- Train and evaluate a suite of probe models (from mean-difference to our sAwMIL),
Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge?
We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL
(short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL
is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL
on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.
- The Trilemma of Truth in Large Language Models
- Table of Contents
- 📘 Repository Overview
- ⚡ Installation
- 📝 Usage & Examples
- 🗂️ Dataset
- ✍️ How to Cite?
- 📝 To Do
- 📃 Licenses
This repository contains the code used to generate the results presented in the paper. Along with the code, we provide the usage examples and results.
- datasets folder contains the datasets (e.g., statement) that we use. The subfolders contain the notebooks that we used to generate datasets, as well as generate the syntehtic entities and statements
- outputs/probes/prompt contains the scores for the zero-shot prompting (for every mode, dataset and instruction phrasing). These can be load using the
DataHandler
class. - outputs/probes/mean_diff contains an example of results for the mean-difference probe (
Llama-3-8b
model,city_locations
dataset, based on the activations of the 7th decoder). - configs contains experiment configurations;
Hydra
uses these to run experiments. - outputs/activations/llama-3-8b contains activations for the
city_locations
dataset (13th decoder). - outputs/probes contains example of coefficients and statistics for the probes trained on the
llama-3-8b
activations (city_locations
dataset).
- Activations and the coefficients for the trained probes (we only include activations for the 13th decoder of the
llama-3-8b
model andcity_locations
dataset) - Codes to generate plots.
The code for the sAwMIL
is partially based on the garydoranjr/misvm repository (contains the sbMIL
implementation for older versions of Python and cvxopt). We adapt MISVM code for python=3.11.11
and cvxopt=1.3.2
. The patched code for the sAwMIL
is located in probes/sawmil script.
Note
The alpha standalone sAwMIL
package is available at PyPi and carlomarxd/sawmil.
Clone the repository:
git clone https://github.com/carlomarxdk/trilemma-of-truth.git
cd trilemma-of-truth
Warning
Activation files stored in outputs/activations/llama-3-8b might take up to 4GB (you may decide to exclude them when cloning the repository). These files are stored using the GitHub LFS, you can ignore these files while clonning with
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/carlomarxdk/trilemma-of-truth.git
cd trilemma-of-truth
### if later you want to load these files, you can run the following:
# git lfs pull
Install dependencies:
pip install -r requirements.txt
Additionally, refer to macOS using Homebrew, Pyenv, and Pipenv for help.
Get HuggingFace Access Tokens for gated models:
Note
If you intend to use LLMs, you need to update the configs/model
files for some of the models. For example, in case of base_gemma.yaml
, you need to update the token
field with a valid Access Token, see huggingface.co/settings/tokens.
Same applies to base_llama
, _llama-3-8b-med
, _llama-3.1-8b-bio
.
We use Hydra
to run and manage our experiments. Refer to Hydra Documentation for help.
In Hydra
you can specify HYDRA_FULL_ERROR=1
before each command. For example:
HYDRA_FULL_ERROR=1 python run_zero_shot.py model=llama-3-8b
1. Collect Hidden Activations
To run experiments (e.g., train probes) on your machine, you need to collect hidden activations. The command below would collect hidden activations for every statement in the datasets, you only have to specify the name of the model, see configs/activations.yamls for more information on the attributes.
# To collect hidden activations for (every statement) specific model
python collect_activations.py model=llama-3-8b # see configs/activations.yaml for all the paramaters
After you collected the activations, you can load them using the code in notebooks/load_and_split_dataset notebook.
Files that store activations are pretty heavy. You can run compress_activations.py
to further reduce the size (the DataHandler
object can handle both uncompressed and compressed activations):
python compress_activations.py model=llama-3-8b # see configs/activations.yaml for all the paramaters
This method reduces the size of the file by 15-60% (earlier layers have lower compression rate).
You can collect the zero-shot prompting scores without having activations.
# Collect scores with the zero-shot prompting method (aka replies to multiple choice questions)
python run_zero_shot.py model=llama-3-8b variation=default batch_size=12 # see configs/probe_prompt.yaml for all the available paramaters
Note that we provide scores for every model in outputs/probes/prompt folder. We provide an example on how to load the scores from the zero-shot prompting in notebooks/load_and_split_dataset notebook.
Note that you must collect activations before training this probe. Generally, you need to train three SVM probes: one with task=0
, one with task=1
and task=2
, see Task Specification.
# Train one-vs-all probe (an example without the hyperparameter search)
python run_training.py --config-name=probe_mil.yaml \
model=llama-3-8b datapack=city_locations probe=sawmil task=0 search=False
After you collect all the activations and train three one-vs-all sAwMIL
probes, you can proceed with training the multiclass one.
The run_mc_training.py
runs only with the task=-1
.
python run_mc_training.py --config-name=probe_mil.yaml \
model=llama-3-8b datapack=city_locations probe=sawmil task=-1 search=False
A small example is provided in the make_predictions.ipynb
notebook.
These probes use only the last token representation (instead of bags) The Single Instance Learning probes use only representations of the last tokens (instead of the bags).
Generally, you need to train three SVM probes: one with task=0
, one with task=1
and task=2
, see Task Specification.
python run_training.py --config-name=probe_sil.yaml \
model=llama-3-8b datapack=city_locations probe=svm task=1
After you collect all the activations and train three one-vs-all SVM
probes, you can proceed with training the multiclass one.
The run_mc_training.py
runs only with the task=-1
.
python run_mc_training.py --config-name=probe_sil.yaml \
model=llama-3-8b datapack=city_locations probe=svm task=-1
The mean-difference probe is trained to separate true-vs-false, thus, use task=3
.
python run_training.py --config-name=probe_sil.yaml \
model=llama-3-8b datapack=city_locations probe=mean_diff task=3
To check the performance of the probe on another dataset you can run run_generalization.py
. It will load the probe trained on datapack
and use the test split of the datapack@datapack_test
.
python run_generalization.py --config-name=probe_mil.yaml model=llama-3-8b datapack=city_locations datapack@datapack_test=med_indications
probe=sawmil search=True task=-1
The code for interventions is located in run_intervention.py
.
python run_intervention.py --config-name=interventions.yaml model=llama-3-8b datapack=city_locations task=0
You can train probe using different task configurations (see misc/task.py). We have 5 tasks:
- True-vs-All (
task=0
): Separate true instances from all others (false and neither-valued cases); - False-vs-All (
task=1
): Separate false instances from all others (true and neither cases); - Neither-vs-All (
task=2
): Separate neither instances from all others (true and false cases); - True-vs-False (
task=3
): Separate true and false cases (the neither statements are filtered out); - Multiclass (
task=-1
): Multiclass setup, where labels correspond to0=true
,1=false
and2=neither
.
The dataset scripts and files are located in the datasets/
folder. This includes everything from data generation to the final preprocessed splits used in our experiments.
datasets/generators/
: Jupyter notebooks for data preprocessing and generation, along with intermediate data.datasets/generators/synthetic/
: Contains synthetic object/name lists (*_raw.txt
) and manually filtered name list (*_checked.csv
).datasets/
: Final preprocessed CSV files used to assemble the following datasets:- City Locations:
["city_locations.csv", "city_locations_synthetic.csv"]
- Medical Indications:
["med_indications", "med_indications_synthetic"]
- Word Definitions:
["word_instances", "word_types", "word_synonyms", "word_types_synthetic", "word_instances_synthetic", "word_synonyms_synthetic"]
- City Locations:
These datasets are used across our scripts to train probes and evaluate results.
You can load and assemble datasets using the DataHandler
class:
from data_handler import DataHandler
dh = DataHandler(
model='llama-3-8b',
datasets=['city_locations', 'city_locations_synthetic'],
activation_type='full', # load the representation of all the tokens in each statement (alternatively, you can use `last`)
with_calibration=True, # Include a calibration set
load_scores=False # if you run a zero-shot prompting with `default`,
#`shuffled` or `tf` template -- it will append these scores to the data (if they are calculated)
)
dh.assemble(
test_size=0.25,
calibration_size=0.25,
seed=42,
exclusive_split=True # Ensures entities don’t appear in multiple splits
# `True` would make the train, test and calibartion splits approximately split according to your specifications
# in this case, test size is going to be approximatelly 25% of all the samples.
)
For more usage examples, see the notebooks/ folder.
The final preprocessed datasets - including standardized splits - are also available on Hugging Face Datasets. These are ideal if you want to skip local preprocessing and directly load ready-to-use datasets into your workflow. They follow the same structure and splitting scheme we use internally. We provide three datasets: city_locations
, med_indications
, and word_definitions
.
Important
Note I: These Hugging Face -- hosted datasets are not used in our experiments.
Note II: All experiments in this repository (e.g., collect_activations.py
, probe evaluations) rely on the DataHandler
class, which assembles the datasets locally from the datasets/
folder.
Note III: The calibration split is labeled as validation
, following Hugging Face naming conventions (train
, validation
, test
).
How to use HF? First, install the 🤗 Datasets and pandas
libraries:
pip install datasets pandas
Then load the data with the datasets
package. The dataset identifier is carlomarxx/trilemma-of-truth
.
from datasets import load_dataset
# 1. Load the full dataset with train/validation/test splits
ds = load_dataset("carlomarxx/trilemma-of-truth", name="word_definitions")
# Convert to pandas
df = ds["train"].to_pandas()
# Access the first example
print(ds["train"][0])
# 2. Load a specific split [train, validation, test]
ds = load_dataset("carlomarxx/trilemma-of-truth", name="word_definitions", split="train")
@inproceedings{trilemma2025preprint,
title={The Trilemma of Truth in Large Language Models},
author={Savcisens, Germans and Eliassi‐Rad, Tina},
booktitle={arXiv preprint arXiv:2506.23921},
year={2025}
}
The citation for the latest version:
@software{trilemma2025code,
author = {Savcisens, Germans and
Eliassi-Rad, Tina},
title = {carlomarxdk/trilemma-of-truth: SEE VERSION AT THE TOP OF THE REPOSITORY}, #example: v0.5.1
month = aug,
year = 2025,
publisher = {Zenodo},
version = {SEE VERSION AT THE TOP OF THE REPOSITORY}, #example: v0.5.1
doi = {INSERT ZENODO DOI AT THE TOP}, #example: 10.5281/zenodo.15779092
url = {https://doi.org/_INSERT ZENODO DOI AT THE TOP_}, #example: 10.5281/zenodo.15779092
}
@misc{trilemma2025data,
author = { Germans Savcisens and Tina Eliassi-Rad },
title = { trilemma-of-truth (Revision cd49e0e) },
year = 2025,
url = { https://huggingface.co/datasets/carlomarxx/trilemma-of-truth },
doi = { 10.57967/hf/5900 },
publisher = { Hugging Face }
}
Warning
We have refactored the code to improve readability. Please let us know if something does not work.
- Check
run_zero_shot.py
- Check
collect_activations.py
- Check
run_training.py
for SIL probes (SVM and Mean Difference) - Check
run_training.py
forsAwMIL
- Add the multiclass SIL and MIL script
- Check the multiclass SIL (SVM)
- Check the multiclass MIL (
sAwMIL
) - Upload
llama-3-8b
activations for thecity_locations
dataset - Add code for interventions and cross-dataset generalization
- Check the script for interventions and the cross-dataset generalization
- Add scripts/notebooks for plot generation
- Add examples: data loading
- Describe the contents of the repository
Contacts:
- Germans Savcisens (@carlomarxdk)
- Tina Eliassi-Rad (@eliassi)
Important
This code is licensed under the MIT License. See LICENSE for more information. The data is licensed under the Creative Commons Attribution 4.0 (CC BY 4.0).