Skip to content

RyanRana/pytruth

Repository files navigation

pytruth

Measure and remove surface bias in LLM judges. Detect, certify, debias, ship.

pytruth is a Python library and HTTP service for diagnosing and repairing biases in LLM-as-judge evaluation. It instruments any judge (OpenAI, Anthropic, or open-weights) with:

  1. Surface perturbations — orbits of meaning-preserving rewrites
  2. Cognitive probes — designed experiments for sycophancy, position bias, reasoning theater, authority cues, confidence vs correctness, identity proxies
  3. The Depth Stack — five layers of bias beyond surface: epistemic, causal, adversarial (Goodhart), mechanistic, structural consistency
  4. A trained debiased judge — invariance-regularized LoRA fine-tune of meta-llama/Llama-3.1-8B-Instruct, shipped per-domain
  5. A minimal frontend — terminal-style web UI for scoring, comparing, and probing
pip install pytruth                    # core
pip install pytruth[openai,anthropic]  # closed-API judges
pip install pytruth[hf]                # open-weights judges
pip install pytruth[serve]             # FastAPI + frontend
pip install pytruth[train]             # LoRA training stack
pip install pytruth[all]               # everything

30-second tour

import asyncio
import pytruth as pt

async def main():
    judge = pt.Judge.from_mock()           # or .from_openai("gpt-4o")

    # Score one (prompt, response)
    result = await judge.score("What is 2+2?", "The answer is 4.")
    print(result.score, result.score_dist, result.confidence)

    # Probe a judge end-to-end (orbits + probes + L6 consistency)
    report = await pt.probe(judge, domain="general", suite="standard")
    print(report.bias_card.to_markdown())

    # Compare multiple judges
    comp = await pt.compare([j1, j2, j3], domain="code", suite="full")
    print(comp.to_dataframe())

    # The full diagnostic — surface + cognitive + L2 + L3 + L4 + L6
    full = await pt.api.run_full_pipeline(
        judge, domain="general", suite="full",
        epistemic_pairs=epistemic_pairs,
        adversarial_candidates=candidates,
        halo_items=halo_items,
    )

asyncio.run(main())

CLI

pytruth probe   mock --suite standard --output bias.md
pytruth probe   openai:gpt-4o --domain code --output bias.json --fmt json
pytruth compare openai:gpt-4o anthropic:claude-sonnet-4-7 mock --domain code
pytruth serve   mock --host 0.0.0.0 --port 8000
pytruth train   --base meta-llama/Llama-3.1-8B-Instruct --domain code --data prefs.jsonl

HTTP API

pt.serve(judge) boots a FastAPI service with the minimal Ollama-style frontend at /.

POST /v1/score    { prompt, response, rubric? } → { score, score_dist, confidence }
POST /v1/compare  { prompt, a, b, check_invariance } → { winner, margin, invariance_passed, ... }
POST /v1/probe    { domain, suite, n_per_probe } → BiasCard JSON
GET  /v1/health
GET  /v1/judges
GET  /             → minimal HTML frontend

Why pytruth exists

LLM-as-judge is the de facto standard for ranking model outputs, RLHF reward modeling, and offline regression testing. It is also unreliable: judges reward length, formatting, hedged-confident phrasing, citations (real or fake), and the appearance of reasoning — independently of correctness.

pytruth extends standard surface-perturbation detection along three axes:

  • Distributional scoring — every ScoreResult carries a 10-bin probability distribution, not just a float. Sub-integer effects the regex parser would drop are visible.
  • The Depth Stack — five layers of bias below surface:
    • L2 epistemic — judge as plausibility prior; novel-but-correct vs canonical-but-vague
    • L3 causal — direct vs mediated effects (length is both); halo audit; trigger-phrase token search
    • L4 adversarial — Goodhart-style exploit-generator co-trained against the judge; mines bias families nobody enumerated
    • L5 mechanistic — multi-layer linear probes; deflection check (does bias re-emerge in earlier layers post-debias?)
    • L6 structural — test-retest σ, Condorcet ranking cycles, score↔justification alignment, calibration ECE
  • Closes the loop — the same orbit + probe data becomes training signal for a small open-weights judge with an invariance regularizer. Released judges ship with a quantitative Bias Card.

See DESIGN.md for the full architecture.

Repository layout

pytruth/
├── pyproject.toml
├── README.md
├── DESIGN.md                     # full architecture
├── pytruth/
│   ├── __init__.py               # top-level API (Judge, probe, compare, serve, ...)
│   ├── api.py                    # high-level pipelines (probe, compare, run_full_pipeline)
│   ├── cli.py                    # `pytruth probe|compare|serve|train`
│   ├── judges/                   # OpenAI, Anthropic, HF, Ensemble, Mock
│   ├── orbits/                   # deterministic, paraphrase, math, code,
│   │                             # summarization, support, creative, compositional
│   ├── probes/                   # position, sycophancy, authority,
│   │                             # reasoning_theater, confidence, identity, self_preference
│   ├── oracles/                  # math (sympy), code (sandboxed exec), nli, ensemble, ...
│   ├── stats/                    # bootstrap, mixed-effects, BiasCard
│   ├── depth/                    # L2 epistemic, L3 causal, L4 adversarial, L6 consistency
│   ├── interp/                   # L5: linear probes, multi-layer stack, deflection check
│   ├── train/                    # OrbitDataset, depth-stack losses, JudgeTrainer
│   └── serve/                    # FastAPI + Ollama-style frontend
├── examples/                     # probe_judge.py, compare_judges.py, serve_judge.py, run_e2e_demo.py
├── tests/pytruth/                # smoke + unit tests
└── docs/
    ├── getting_started.md
    ├── building_a_domain.md
    ├── bias_card_spec.md
    └── api_reference.md

Documentation

License

Apache-2.0

About

measure and remove surface bias in LLM judges. detect, certify, debias, ship.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors