The Trilemma of Truth in Large Language Models

This repository is the codebase for our paper on evaluating factual reasoning in large language models.
Here you’ll find everything needed to

Generate and inspect our three Trilemma data sets (city locations, drug indications, word definitions),
Run zero-shot prompts,
Train and evaluate a suite of probe models (from mean-difference to our sAwMIL),

Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.

📘 Repository Overview

This repository contains the code used to generate the results presented in the paper. Along with the code, we provide the usage examples and results.

What is included?

datasets folder contains the datasets (e.g., statement) that we use. The subfolders contain the notebooks that we used to generate datasets, as well as generate the syntehtic entities and statements
outputs/probes/prompt contains the scores for the zero-shot prompting (for every mode, dataset and instruction phrasing). These can be load using the DataHandler class.
outputs/probes/mean_diff contains an example of results for the mean-difference probe (Llama-3-8b model, city_locations dataset, based on the activations of the 7th decoder).
configs contains experiment configurations; Hydra uses these to run experiments.
outputs/activations/llama-3-8b contains activations for the city_locations dataset (13th decoder).
outputs/probes contains example of coefficients and statistics for the probes trained on the llama-3-8b activations (city_locations dataset).

What is not included?

Activations and the coefficients for the trained probes (we only include activations for the 13th decoder of the llama-3-8b model and city_locations dataset)
Codes to generate plots.

`sAwMIL` (Sparse Aware Multiple Instance Learning) Implementation

The code for the sAwMIL is partially based on the garydoranjr/misvm repository (contains the sbMIL implementation for older versions of Python and cvxopt). We adapt MISVM code for python=3.11.11 and cvxopt=1.3.2. The patched code for the sAwMIL is located in probes/sawmil script.

Note

The alpha standalone sAwMIL package is available at PyPi and carlomarxd/sawmil.

⚡ Installation

Clone the repository:

git clone https://github.com/carlomarxdk/trilemma-of-truth.git
cd trilemma-of-truth

Warning

Activation files stored in outputs/activations/llama-3-8b might take up to 4GB (you may decide to exclude them when cloning the repository). These files are stored using the GitHub LFS, you can ignore these files while clonning with

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/carlomarxdk/trilemma-of-truth.git
cd trilemma-of-truth
### if later you want to load these files, you can run the following:
# git lfs pull

Install dependencies:

pip install -r requirements.txt

Additionally, refer to macOS using Homebrew, Pyenv, and Pipenv for help.

Get HuggingFace Access Tokens for gated models:

Note

If you intend to use LLMs, you need to update the configs/model files for some of the models. For example, in case of base_gemma.yaml, you need to update the token field with a valid Access Token, see huggingface.co/settings/tokens. Same applies to base_llama, _llama-3-8b-med, _llama-3.1-8b-bio.

📝 Usage & Examples

We use Hydra to run and manage our experiments. Refer to Hydra Documentation for help.

Run the Scripts

0. Return full error log in `Hydra`

In Hydra you can specify HYDRA_FULL_ERROR=1 before each command. For example:

HYDRA_FULL_ERROR=1 python run_zero_shot.py model=llama-3-8b

1. Collect Hidden Activations

To run experiments (e.g., train probes) on your machine, you need to collect hidden activations. The command below would collect hidden activations for every statement in the datasets, you only have to specify the name of the model, see configs/activations.yamls for more information on the attributes.

# To collect hidden activations for (every statement) specific model
python collect_activations.py model=llama-3-8b # see configs/activations.yaml for all the paramaters

After you collected the activations, you can load them using the code in notebooks/load_and_split_dataset notebook.

(Optional) Compress the activations

Files that store activations are pretty heavy. You can run compress_activations.py to further reduce the size (the DataHandler object can handle both uncompressed and compressed activations):

python compress_activations.py model=llama-3-8b # see configs/activations.yaml for all the paramaters

This method reduces the size of the file by 15-60% (earlier layers have lower compression rate).

2. Run zero-shot prompt (and collect scores)

You can collect the zero-shot prompting scores without having activations.

# Collect scores with the zero-shot prompting method (aka replies to multiple choice questions)
python run_zero_shot.py model=llama-3-8b variation=default batch_size=12 # see configs/probe_prompt.yaml for all the available paramaters

Note that we provide scores for every model in outputs/probes/prompt folder. We provide an example on how to load the scores from the zero-shot prompting in notebooks/load_and_split_dataset notebook.

3. Train sAwMIL probe

3.1. One-vs-all

Note that you must collect activations before training this probe. Generally, you need to train three SVM probes: one with task=0, one with task=1 and task=2, see Task Specification.

# Train one-vs-all probe (an example without the hyperparameter search)
python run_training.py --config-name=probe_mil.yaml \
model=llama-3-8b datapack=city_locations probe=sawmil task=0 search=False

3.2 Multiclass

After you collect all the activations and train three one-vs-all sAwMIL probes, you can proceed with training the multiclass one. The run_mc_training.py runs only with the task=-1.

python run_mc_training.py --config-name=probe_mil.yaml \
model=llama-3-8b datapack=city_locations probe=sawmil task=-1 search=False

A small example is provided in the make_predictions.ipynb notebook.

4. Single Instance Probe

These probes use only the last token representation (instead of bags) The Single Instance Learning probes use only representations of the last tokens (instead of the bags).

4.1 Train one-vs-all SVM probe

Generally, you need to train three SVM probes: one with task=0, one with task=1 and task=2, see Task Specification.

python run_training.py --config-name=probe_sil.yaml \
model=llama-3-8b datapack=city_locations probe=svm task=1

4.2 Train multiclass SVM probe

After you collect all the activations and train three one-vs-all SVM probes, you can proceed with training the multiclass one. The run_mc_training.py runs only with the task=-1.

python run_mc_training.py --config-name=probe_sil.yaml \
model=llama-3-8b datapack=city_locations probe=svm task=-1

4.3 Train the mean-difference probe

The mean-difference probe is trained to separate true-vs-false, thus, use task=3 .

python run_training.py --config-name=probe_sil.yaml \
model=llama-3-8b datapack=city_locations probe=mean_diff task=3

5. Extra

5.1 Gerneralization Performance

To check the performance of the probe on another dataset you can run run_generalization.py. It will load the probe trained on datapack and use the test split of the datapack@datapack_test.

python run_generalization.py --config-name=probe_mil.yaml model=llama-3-8b datapack=city_locations datapack@datapack_test=med_indications
probe=sawmil search=True  task=-1

5.2 Interventions

The code for interventions is located in run_intervention.py.

python run_intervention.py --config-name=interventions.yaml model=llama-3-8b datapack=city_locations  task=0

Task specification

You can train probe using different task configurations (see misc/task.py). We have 5 tasks:

True-vs-All (task=0): Separate true instances from all others (false and neither-valued cases);
False-vs-All (task=1): Separate false instances from all others (true and neither cases);
Neither-vs-All (task=2): Separate neither instances from all others (true and false cases);
True-vs-False (task=3): Separate true and false cases (the neither statements are filtered out);
Multiclass (task=-1): Multiclass setup, where labels correspond to 0=true, 1=false and 2=neither.

🗂️ Dataset

The dataset scripts and files are located in the datasets/ folder. This includes everything from data generation to the final preprocessed splits used in our experiments.

Structure

datasets/generators/: Jupyter notebooks for data preprocessing and generation, along with intermediate data.
datasets/generators/synthetic/: Contains synthetic object/name lists (*_raw.txt) and manually filtered name list (*_checked.csv).
datasets/: Final preprocessed CSV files used to assemble the following datasets:
- City Locations: ["city_locations.csv", "city_locations_synthetic.csv"]
- Medical Indications: ["med_indications", "med_indications_synthetic"]
- Word Definitions: ["word_instances", "word_types", "word_synonyms", "word_types_synthetic", "word_instances_synthetic", "word_synonyms_synthetic"]

These datasets are used across our scripts to train probes and evaluate results.

Load Data with `DataHandler`

You can load and assemble datasets using the DataHandler class:

from data_handler import DataHandler

dh = DataHandler(
    model='llama-3-8b',
    datasets=['city_locations', 'city_locations_synthetic'],
    activation_type='full', # load the representation of all the tokens in each statement (alternatively, you can use `last`)
    with_calibration=True,    # Include a calibration set
    load_scores=False # if you run a zero-shot prompting with `default`, 
    #`shuffled` or `tf` template -- it will append these scores to the data (if they are calculated) 
)

dh.assemble(
    test_size=0.25,
    calibration_size=0.25,
    seed=42,
    exclusive_split=True      # Ensures entities don’t appear in multiple splits 
    # `True` would make the train, test and calibartion splits approximately split according to your specifications
    # in this case, test size is going to be approximatelly 25% of all the samples. 
)

For more usage examples, see the notebooks/ folder.

Processed Data on Hugging Face 🤗

The final preprocessed datasets - including standardized splits - are also available on Hugging Face Datasets. These are ideal if you want to skip local preprocessing and directly load ready-to-use datasets into your workflow. They follow the same structure and splitting scheme we use internally. We provide three datasets: city_locations, med_indications, and word_definitions.

Important

Note I: These Hugging Face -- hosted datasets are not used in our experiments.

Note II: All experiments in this repository (e.g., collect_activations.py, probe evaluations) rely on the DataHandler class, which assembles the datasets locally from the datasets/ folder.

Note III: The calibration split is labeled as validation, following Hugging Face naming conventions (train, validation, test).

How to use HF? First, install the 🤗 Datasets and pandas libraries:

pip install datasets pandas

Then load the data with the datasets package. The dataset identifier is carlomarxx/trilemma-of-truth.

from datasets import load_dataset

# 1. Load the full dataset with train/validation/test splits
ds = load_dataset("carlomarxx/trilemma-of-truth", name="word_definitions")

# Convert to pandas
df = ds["train"].to_pandas()

# Access the first example
print(ds["train"][0])

# 2. Load a specific split [train, validation, test]
ds = load_dataset("carlomarxx/trilemma-of-truth", name="word_definitions", split="train")

✍️ How to Cite?

Preprint

@inproceedings{trilemma2025preprint,
      title={The Trilemma of Truth in Large Language Models},
      author={Savcisens, Germans and Eliassi‐Rad, Tina},
      booktitle={arXiv preprint arXiv:2506.23921},
      year={2025}
    }

Code

The citation for the latest version:

@software{trilemma2025code,
  author       = {Savcisens, Germans and
                  Eliassi-Rad, Tina},
  title        = {carlomarxdk/trilemma-of-truth: SEE VERSION AT THE TOP OF THE REPOSITORY}, #example: v0.5.1
  month        = aug,
  year         = 2025,
  publisher    = {Zenodo},
  version      = {SEE VERSION AT THE TOP OF THE REPOSITORY},  #example: v0.5.1
  doi          = {INSERT ZENODO DOI AT THE TOP}, #example: 10.5281/zenodo.15779092
  url          = {https://doi.org/_INSERT ZENODO DOI AT THE TOP_}, #example: 10.5281/zenodo.15779092
}

Data

@misc{trilemma2025data,
  author       = { Germans Savcisens and Tina Eliassi-Rad },
  title        = { trilemma-of-truth (Revision cd49e0e) },
  year         = 2025,
  url          = { https://huggingface.co/datasets/carlomarxx/trilemma-of-truth },
  doi          = { 10.57967/hf/5900 },
  publisher    = { Hugging Face }
}

📝 To Do

Warning

We have refactored the code to improve readability. Please let us know if something does not work.

📃 Licenses

Contacts:

Germans Savcisens (@carlomarxdk)
Tina Eliassi-Rad (@eliassi)

Important

This code is licensed under the MIT License. See LICENSE for more information. The data is licensed under the Creative Commons Attribution 4.0 (CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
configs		configs
datasets		datasets
docs/figures		docs/figures
misc		misc
notebooks		notebooks
outputs		outputs
probes		probes
response		response
runners		runners
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Pipfile		Pipfile
README.md		README.md
__init__.py		__init__.py
collect_activations.py		collect_activations.py
compress_activations.py		compress_activations.py
data_handler.py		data_handler.py
making_predictions.ipynb		making_predictions.ipynb
requirements.txt		requirements.txt
run_generalization.py		run_generalization.py
run_intervention.py		run_intervention.py
run_mc_training.py		run_mc_training.py
run_training.py		run_training.py
run_zero_shot.py		run_zero_shot.py
utils.py		utils.py
utils_hydra.py		utils_hydra.py
utils_jupyter.py		utils_jupyter.py

License

carlomarxdk/trilemma-of-truth

Folders and files

Latest commit

History

Repository files navigation