Causal-Faithfulness (Accepted to EMNLP Main 2025!)

This repo computes faithfulness of Natural Language Explanations(NLE) of decoder transformer LLMs.

The Activation patching is adapted from https://github.com/kmeng01/rome

Install the requirements first by running pip install -r requirements.txt.

The dataset is stored in data/ for the 3 tasks. For CoS-E and esnli, there are only 100 manually edited counterfactuals. While in ComVE it contains much more since there is no need for manual annotation, we just swap the correct sentence in.

e-snli is taken from https://github.com/OanaMariaCamburu/e-SNLI

ComVE from https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation, we edit task A, to only present the incorrect statement as the original input and patch in the correct as the counterfactual. The golden explanations are derived from Task C for plausibility scoring.

Activation Patching

The causal matrix of size T x L where T refers to the input length and L is the number of layers within the model. Note that T refers to the length from the corrupted position onwards as patching tokens before the corruption is meaningless, will yield 0 indirect effects

We currently support Gemma2 models as the counterfactuals created are aligned with the tokenizer scheme. Ie STR requires edits of similar token length, -> cf_subject and subject when tokenized occupies the same length.

CaF score

The faithfulness score, Causal Faithfulness is computed as the cosine similarity between C_a and C_e where C_a is the causal matrix with each element in the rows referring to the token position and columns to layer positions, thus C^j_a,i refers to the causal indirect effect after patching in at position i and layer j, similarly for the explanation C_e.

Support for other LLMs

If using STR, the most important thing is to ensure that the edits are of similar length, else the patching will not correspond to the same tokens, which will invalidate the findings.

Other models can be supported by making changes in the layername function in utils/causal_trace.py, by ensuring that the model.layer names can be detected by the function. This is used to hook the inner computations such that the activations cna be cached. Also make changes to get_model_path in utils/model_data_utils.py to load the models.

How it works

Run run_faithfulness.sh which would run two files: get_predictions.py and main.py.

get_predictions.py gets the answer and NLE from the model as well as the low scores (logits and probs) of both answer and NLE. low scores refer to pure corruption without any patching, ie p_*(y).

main.py performs AP over L and T for each sample and stores the values in a file. You can change the metric field to assess different evaluations - from [cc_shap,cf_edit,plausibility] Take note to include your OPENAI_API_KEY for plausibility.

run get_score.sh to compute the values

compute_scores.py gets the scores according to metric, causal is our metric while cf_edit and ccshap refers to the other tests.
plot.py plots out the causal plots.

Citation

Please cite this work if you found it useful!

@inproceedings{yeo-etal-2025-towards,
    title = "Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models",
    author = "Yeo, Wei Jie  and
      Satapathy, Ranjan  and
      Cambria, Erik",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.529/",
    doi = "10.18653/v1/2025.emnlp-main.529",
    pages = "10436--10458",
    ISBN = "979-8-89176-332-6",
    abstract = "Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment-tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model{'}s internal computations and avoiding out-of-distribution concerns that could otherwise undermine the validity of faithfulness assessments."
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
__pycache__		__pycache__
data		data
images		images
score_utils		score_utils
utils		utils
.gitignore		.gitignore
README.md		README.md
get_predictions.py		get_predictions.py
get_score.sh		get_score.sh
main.py		main.py
requirements.txt		requirements.txt
run_faithfulness.sh		run_faithfulness.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Causal-Faithfulness (Accepted to EMNLP Main 2025!)

Activation Patching

CaF score

Support for other LLMs

How it works

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Causal-Faithfulness (Accepted to EMNLP Main 2025!)

Activation Patching

CaF score

Support for other LLMs

How it works

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages