Measure and remove surface bias in LLM judges. Detect, certify, debias, ship.
pytruth is a Python library and HTTP service for diagnosing and repairing
biases in LLM-as-judge evaluation. It instruments any judge (OpenAI, Anthropic,
or open-weights) with:
- Surface perturbations — orbits of meaning-preserving rewrites
- Cognitive probes — designed experiments for sycophancy, position bias, reasoning theater, authority cues, confidence vs correctness, identity proxies
- The Depth Stack — five layers of bias beyond surface: epistemic, causal, adversarial (Goodhart), mechanistic, structural consistency
- A trained debiased judge — invariance-regularized LoRA fine-tune of
meta-llama/Llama-3.1-8B-Instruct, shipped per-domain - A minimal frontend — terminal-style web UI for scoring, comparing, and probing
pip install pytruth # core
pip install pytruth[openai,anthropic] # closed-API judges
pip install pytruth[hf] # open-weights judges
pip install pytruth[serve] # FastAPI + frontend
pip install pytruth[train] # LoRA training stack
pip install pytruth[all] # everythingimport asyncio
import pytruth as pt
async def main():
judge = pt.Judge.from_mock() # or .from_openai("gpt-4o")
# Score one (prompt, response)
result = await judge.score("What is 2+2?", "The answer is 4.")
print(result.score, result.score_dist, result.confidence)
# Probe a judge end-to-end (orbits + probes + L6 consistency)
report = await pt.probe(judge, domain="general", suite="standard")
print(report.bias_card.to_markdown())
# Compare multiple judges
comp = await pt.compare([j1, j2, j3], domain="code", suite="full")
print(comp.to_dataframe())
# The full diagnostic — surface + cognitive + L2 + L3 + L4 + L6
full = await pt.api.run_full_pipeline(
judge, domain="general", suite="full",
epistemic_pairs=epistemic_pairs,
adversarial_candidates=candidates,
halo_items=halo_items,
)
asyncio.run(main())pytruth probe mock --suite standard --output bias.md
pytruth probe openai:gpt-4o --domain code --output bias.json --fmt json
pytruth compare openai:gpt-4o anthropic:claude-sonnet-4-7 mock --domain code
pytruth serve mock --host 0.0.0.0 --port 8000
pytruth train --base meta-llama/Llama-3.1-8B-Instruct --domain code --data prefs.jsonlpt.serve(judge) boots a FastAPI service with the minimal Ollama-style
frontend at /.
POST /v1/score { prompt, response, rubric? } → { score, score_dist, confidence }
POST /v1/compare { prompt, a, b, check_invariance } → { winner, margin, invariance_passed, ... }
POST /v1/probe { domain, suite, n_per_probe } → BiasCard JSON
GET /v1/health
GET /v1/judges
GET / → minimal HTML frontendLLM-as-judge is the de facto standard for ranking model outputs, RLHF reward modeling, and offline regression testing. It is also unreliable: judges reward length, formatting, hedged-confident phrasing, citations (real or fake), and the appearance of reasoning — independently of correctness.
pytruth extends standard surface-perturbation detection along three axes:
- Distributional scoring — every
ScoreResultcarries a 10-bin probability distribution, not just a float. Sub-integer effects the regex parser would drop are visible. - The Depth Stack — five layers of bias below surface:
- L2 epistemic — judge as plausibility prior; novel-but-correct vs canonical-but-vague
- L3 causal — direct vs mediated effects (length is both); halo audit; trigger-phrase token search
- L4 adversarial — Goodhart-style exploit-generator co-trained against the judge; mines bias families nobody enumerated
- L5 mechanistic — multi-layer linear probes; deflection check (does bias re-emerge in earlier layers post-debias?)
- L6 structural — test-retest σ, Condorcet ranking cycles, score↔justification alignment, calibration ECE
- Closes the loop — the same orbit + probe data becomes training signal for a small open-weights judge with an invariance regularizer. Released judges ship with a quantitative Bias Card.
See DESIGN.md for the full architecture.
pytruth/
├── pyproject.toml
├── README.md
├── DESIGN.md # full architecture
├── pytruth/
│ ├── __init__.py # top-level API (Judge, probe, compare, serve, ...)
│ ├── api.py # high-level pipelines (probe, compare, run_full_pipeline)
│ ├── cli.py # `pytruth probe|compare|serve|train`
│ ├── judges/ # OpenAI, Anthropic, HF, Ensemble, Mock
│ ├── orbits/ # deterministic, paraphrase, math, code,
│ │ # summarization, support, creative, compositional
│ ├── probes/ # position, sycophancy, authority,
│ │ # reasoning_theater, confidence, identity, self_preference
│ ├── oracles/ # math (sympy), code (sandboxed exec), nli, ensemble, ...
│ ├── stats/ # bootstrap, mixed-effects, BiasCard
│ ├── depth/ # L2 epistemic, L3 causal, L4 adversarial, L6 consistency
│ ├── interp/ # L5: linear probes, multi-layer stack, deflection check
│ ├── train/ # OrbitDataset, depth-stack losses, JudgeTrainer
│ └── serve/ # FastAPI + Ollama-style frontend
├── examples/ # probe_judge.py, compare_judges.py, serve_judge.py, run_e2e_demo.py
├── tests/pytruth/ # smoke + unit tests
└── docs/
├── getting_started.md
├── building_a_domain.md
├── bias_card_spec.md
└── api_reference.md
Apache-2.0