Skip to content

Latest commit

 

History

History
463 lines (379 loc) · 25.5 KB

File metadata and controls

463 lines (379 loc) · 25.5 KB

Director-AI -- Competitor Benchmark Comparison

Last updated: 2026-03-14 (v3.8.0) — full competitive landscape, frontier LLM eval, L40S latency, 23-model NLI survey

One-Pager Summary

Feature Director-AI NeMo Guardrails Lynx GuardrailsAI SelfCheckGPT
Approach NLI + RAG + hybrid LLM judge LLM self-consistency Fine-tuned LLM LLM-as-judge Multi-call LLM
Model size 0.4B (DeBERTa) + optional LLM LLM-dependent 8-70B LLM-dependent LLM-dependent
Latency (NLI) 0.9 ms/pair (Ada GPU) 50-300 ms + LLM 1-10 s 2.26 s 5-10 s
Latency (hybrid) 2.3 s (GPT-4o-mini judge)
E2E catch rate 90.7% (hybrid), 46.7% (NLI) N/A N/A N/A N/A
Streaming halt Yes (token-level) No No No No
Offline/local Yes (NLI mode) No (needs LLM) Yes (GPU) No (needs LLM) No (needs LLM)
False-halt rate 0.0% (20 passages) N/A N/A N/A N/A
AggreFact bal. acc 75.8% (0.4B) N/A N/A N/A N/A
Integrations LC/LI/LG/HS/CrewAI LangChain Python LC/LI Python
License AGPL v3 Apache 2.0 Apache 2.0 Apache 2.0 MIT

Measured Latency (benchmarks/latency_bench.py)

Hardware: NVIDIA GeForce GTX 1060 6GB, Python 3.12, torch 2.6.0+cu124. Iterations=30, warmup=5.

Pipeline Median P95 Per-pair Notes
Lightweight (no NLI) 0.15 ms 0.44 ms 0.15 ms Heuristic only
Streaming session 0.02 ms 0.02 ms 0.02 ms Token-level
PyTorch GPU seq (16 pairs) 3145 ms 3580 ms 196.6 ms Sequential score()
PyTorch GPU batch (16 pairs) 304 ms 353 ms 19.0 ms 10.4x vs sequential
PyTorch chunked-seq 250 ms 335 ms 12-sentence doc
PyTorch chunked-batch 195 ms 280 ms 12-sentence doc
ONNX GPU seq (16 pairs) 1042 ms 1249 ms 65.1 ms CUDAExecutionProvider
ONNX GPU batch (16 pairs) 233 ms 250 ms 14.6 ms Fastest
ONNX CPU seq (16 pairs) 6553 ms 8512 ms 410 ms CPUExecutionProvider
ONNX CPU batch (16 pairs) 6124 ms 8143 ms 383 ms CPUExecutionProvider

ONNX GPU batch is the fastest path: 14.6 ms/pair (1.3x faster than PyTorch GPU batch). Batching gives 10.4x speedup (PyTorch) and 4.5x (ONNX GPU). ONNX GPU sequential (65 ms/pair) is 3x faster than PyTorch GPU sequential (197 ms/pair).

Cross-GPU Latency (benchmarks/gpu_bench.py)

16-pair batch, 50 iterations, warmup=10. Per-pair median latency (ms).

GPU VRAM Compute ONNX CUDA PyTorch FP16 PyTorch FP32
L40S 45 GB 8.9 0.5 ms (b32) 1.7 ms (b32)
RTX 6000 Ada 48 GB 8.9 0.9 ms 1.2 ms 2.1 ms
RTX A5000 24 GB 8.6 2.0 ms 3.4 ms 4.8 ms
RTX A6000 48 GB 8.6 3.5 ms 9.7 ms 10.1 ms
Quadro RTX 5000 16 GB 7.5 5.1 ms 2.5 ms 5.9 ms
GTX 1060 6GB 6 GB 6.1 13.9 ms N/A 17.4 ms

L40S Detailed Latency (benchmarks/results/gpu_bench_nvidia_l40s.json)

Backend Batch Per-pair Total VRAM
FP16 32 0.5 ms 16.6 ms 1106 MB
FP16 16 0.6 ms 9.1 ms 975 MB
FP16 8 1.1 ms 9.1 ms 908 MB
FP16 1 9.1 ms 9.1 ms 848 MB
FP32 32 1.7 ms 54.3 ms 2048 MB
FP32 16 1.9 ms 29.7 ms 1862 MB

L40S FP16 batch=32 achieves sub-millisecond latency (0.5 ms/pair). ONNX TensorRT failed on L40S due to repo path parsing; FP16/FP32 PyTorch results are authoritative. Full JSON results in benchmarks/results/gpu_bench_*.json.

Apples-to-Apples: LLM-AggreFact Leaderboard

All models evaluated on the same benchmark (29,320 samples, 11 datasets). Metric: macro-averaged balanced accuracy. Sources: LLM-AggreFact leaderboard, FactCG (arXiv 2501.17144), MiniCheck (arXiv 2404.10774), Granite Guardian 3.3 (ibm-granite), Paladin-mini (arXiv 2506.20384), AlignScore (arXiv 2305.16739).

# System BA Params Streaming Latency License
1 Bespoke-MiniCheck-7B 77.4% 7B No ~100 ms (vLLM) CC BY-NC 4.0
2 Claude-3.5 Sonnet (leaderboard) 77.2% ~200B No API Proprietary
3 FactCG-DeBERTa-L (NAACL 2025 paper) 77.2% 0.4B No MIT
4 FactCG-FT5 76.7% 0.8B No MIT
5 Granite Guardian 3.3 (IBM) 76.5% 8B No Apache 2.0
6 Mistral-Large 2 76.5% 123B No API Proprietary
7 GPT-4o (leaderboard) 75.9% ~200B No API Proprietary
8 Director-AI (FactCG) 75.86% 0.4B Yes 0.5 ms (L40S FP16) AGPL v3
9 Qwen2.5-72B 75.6% 72B No Proprietary
10 FactCG-RBT (RoBERTa) 75.4% 0.4B No MIT
11 MiniCheck-Flan-T5-L 75.0% 0.8B No ~120 ms MIT
12 Llama-3.3-70B 74.5% 70B No Meta
13 MiniCheck-RoBERTa-L 74.4% 0.4B No ~120 ms MIT
14 MiniCheck-DeBERTa-L 74.1% 0.4B No ~120 ms MIT
15 Paladin-mini (Microsoft) 73.1% 3.8B No Phi-4 license
16 AlignScore 72.5-73.4% 0.355B No MIT
17 HHEM-2.1-Open (Vectara) ~71.8% 0.25B No ~200 ms (est.) Apache 2.0
18 QwQ-32B-Preview 71.8% 32B No Proprietary
19 SummaC-Conv 69.8% 0.35B No MIT

Director-AI wraps the same FactCG-DeBERTa-L model that scores 77.2% in the NAACL 2025 paper. Our eval yields 75.86% — a 1.4pp gap likely from threshold tuning methodology and data split version. Closing this gap puts Director-AI at #3 overall.

Frontier LLM Evaluation (measured by us, 1K samples each)

We evaluated frontier LLMs on the same AggreFact test set using benchmarks/frontier_llm_eval.py in three modes: binary (yes/no), confidence (0-100 score with threshold sweep), and fewshot (3 labeled examples + confidence).

# Model Params Confidence BA Fewshot BA Cost/1K calls
Director-AI 0.4B 75.86% $0
1 Claude Haiku 4.5 ~20B 75.10% (-0.76pp) $0.37
2 Claude Sonnet 4.6 ~200B 74.25% (-1.61pp) 73.30% (-2.56pp) $1.40
3 GPT-4o ~200B 73.46% (-2.40pp) 71.69% (-4.17pp) $1.16
4 GPT-4o-mini ~8B 71.66% (-4.20pp) $0.07

Director-AI beats ALL tested frontier LLMs on AggreFact — at $0 per call and 0.5ms latency vs seconds of API latency. Fewshot mode performed worse than confidence mode for both GPT-4o (71.69% vs 73.46%) and Claude Sonnet 4.6 (73.30% vs 74.25%), confirming few-shot examples hurt on this task.

Per-Class Metrics (Hallucination Detection)

The key question for a guardrail: how many hallucinations does it catch?

Balanced accuracy averages recall across both classes (supported + not-supported). Per-class precision/recall/F1 are computed by benchmarks/aggrefact_eval.py for each dataset. Re-run with --sweep to regenerate:

python -m benchmarks.aggrefact_eval --sweep

The results JSON will include hallucination_precision, hallucination_recall, and hallucination_f1 per dataset. These are the class-0 (not-supported) metrics — the numbers that matter for a guardrail.

NLI Fine-Tuning Survey: 21 Models on AggreFact (Complete, 2026-03-13)

Full dataset: 29,320 samples, 11 sub-datasets, macro-averaged balanced accuracy. Base: yaxili96/FactCG-DeBERTa-v3-Large at 75.86% (t=0.45). Each row: fine-tuned from base on the named dataset (LR=2e-5, 3–20 epochs), then benchmarked on AggreFact.

Finding: 22/23 NLI fine-tunes hurt performance. Only CommitmentBank (+0.54pp) helps.

Model BA Threshold Delta Pattern
base (FactCG-DeBERTa-v3-Large) 75.86% 0.45 Production model
factcg-cb (CommitmentBank) 76.40% 0.90 +0.54% Complex inference, diverse
factcg-cb-lowlr (CB, LR=5e-6) 72.33% 0.50 -3.53% Even conservative LR hurts
factcg-rte 73.28% 0.15 -2.58% Entailment pairs, closest to cb
factcg-vitaminc 70.29% 0.85 -5.57% Contrastive fact-check
factcg-legal 69.52% 0.35 -6.34% Domain-specific NLI
factcg-qnli 67.87% 0.50 -7.99% Question NLI
factcg-multinli 66.30% 0.95 -9.56% General entailment
factcg-multirc 66.09% 0.95 -9.77% Reading comprehension
factcg-anli 63.25% 0.95 -12.61% Adversarial NLI
factcg-nca-synthetic (50K, LR=5e-6) 62.78% 0.50 -13.08% Synthetic NLI, neg acc 30.2%
factcg-snli 62.16% 0.95 -13.70% Image caption entailment
factcg-boolq 61.67% 0.95 -14.19% Yes/no QA
factcg-wic 61.59% 0.95 -14.27% Word-in-context
factcg-docnli (DocNLI 100K, 3ep) 61.37% 0.40 -14.49% Document-level NLI — worst task match
factcg-wanli 61.27% 0.95 -14.59% Wiki NLI
factcg-fever 54.57% 0.85 -21.29% Claim manipulation
factcg-healthver 54.27% 0.95 -21.59% Health NLI
factcg-record 52.44% 0.95 -23.42% Reading comprehension QA
factcg-paws 52.35% 0.05 -23.51% Paraphrase adversaries
factcg-qqp 51.90% 0.05 -23.96% Duplicate questions
factcg-mrpc 50.37% 0.05 -25.49% Paraphrase detection
factcg-dialogue-nli 50.33% 0.95 -25.53% Dialogue implicature

Root cause: Task mismatch + catastrophic forgetting regardless of learning rate or data source. DocNLI is the most directly relevant dataset (900K document-level premise-hypothesis pairs from summarization and QA sources) yet produces -14.49pp — confirming the problem is fine-tuning dynamics, not data choice. CB-lowLR (LR=5e-6, 20 epochs) yields -3.53pp: even 4x lower LR still degrades the model, with neg acc dropping from 59.3% to 52.8%. NCA-synthetic (50K synthetic doc/claim/label triples at LR=5e-6) yields -13.08pp with neg acc collapsing to 30.2% — synthetic data overwhelms the base model's calibration entirely. Threshold shifts to 0.85–0.95 indicate models output extreme probabilities, losing calibration. CommitmentBank is the lone exception: 250 examples, complex multi-sentence inference with subtle linguistic commitment, too small to trigger catastrophic forgetting.

Best ensemble: max(base, factcg-cb) at 76.37% (+0.51pp) — marginal, not production-worthy.

Internal Model Comparison (LLM-AggreFact)

Model Bal. Acc Threshold Notes
FactCG-DeBERTa-v3-Large 75.8% 0.46 Production model
MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli 66.2% 0.53 3-class NLI baseline
Fine-tuned DeBERTa-v3-large-hallucination 64.7% 0.90 Fine-tuning regressed
Fine-tuned DeBERTa-v3-base-hallucination 59.0% 0.88 Fine-tuning regressed worse

Per-Dataset Weakness Map (FactCG, threshold=0.46)

Dataset Bal. Acc Pos Neg Failure Mode
ExpertQA 59.1% 2971 731 Long expert answers, low neg recall
AggreFact-CNN 68.8% 501 57 Extreme class imbalance (9:1)
TofuEval-MediaS 71.9% 554 172 Summarization (media)
FactCheck-GPT 73.0% 376 1190 GPT-generated claims
AggreFact-XSum 74.3% 285 273 Extreme summarization
TofuEval-MeetB 74.3% 622 150 Summarization (meetings)
Wice 76.9% 111 247 Wikipedia claims
ClaimVerify 78.1% 789 299 Claim verification
RAGTruth 82.2% 15102 1269 RAG grounding
Lfqa 86.4% 1121 790 Long-form QA
Reveal 89.1% 400 1310 Fact checking

Different Benchmarks (Not Directly Comparable)

These systems publish results on benchmarks other than LLM-AggreFact. Scores cannot be compared directly to Director-AI's 75.86% BA.

System Benchmark Score Params Approach License
ORION (Deepchecks) RAGTruth F1 83.0% encoder Encoder model Proprietary
LettuceDetect-large RAGTruth F1 79.2% 396M Fine-tuned ModernBERT MIT
Lynx-70B (Patronus) HaluBench 87.4% 70B Fine-tuned LLM, 8x H100 Apache 2.0
Lynx-8B (Patronus) HaluBench 82.9% 8B Fine-tuned LLM Apache 2.0
Galileo Luna RAGTruth F1 65.4% 440M Encoder model Proprietary
SelfCheckGPT-NLI WikiBio AUC-PR 92.5% LLM wrapper Multiple LLM calls MIT
NeMo Guardrails Internal eval 70-95% LLM-dependent LLM self-consistency Apache 2.0
GuardrailsAI SQuAD 2.0 F1 98% LLM-dependent LLM-as-judge Apache 2.0
RAGAS Faithfulness Multi-dataset 76.2% avg P LLM wrapper Claim decomposition Apache 2.0
Cleanlab TLM Multi-RAG highest P/R LLM wrapper LLM wrapper Proprietary
Llama Guard 3 Safety moderation 93.9% F1 8B Not hallucination Meta

Commercial Platforms (No Public AggreFact Scores)

These platforms offer hallucination/guardrail features but publish no LLM-AggreFact scores, making quantitative comparison impossible. Position against qualitatively only.

Platform Approach Pricing Notes
Galileo Encoder + LLM SaaS Luna model (65.4% RAGTruth F1)
Cleanlab TLM (LLM wrapper) SaaS Claims highest precision/recall on multi-RAG
Guardrails AI LLM-as-judge Open source + cloud SQuAD 2.0 F1 98% (different task)
NeMo Guardrails (NVIDIA) LLM self-consistency Open source Performance depends on underlying LLM
Patronus AI Lynx (fine-tuned LLM) SaaS + open weights 8B/70B models, HaluBench only
Fiddler ML monitoring SaaS Drift/monitoring, not direct detection
Braintrust Eval framework SaaS Framework, not a model
RAGAS Claim decomposition Open source Needs LLM API, 3-8s latency
DeepEval Eval framework Open source + cloud Framework, not a model
TruLens Eval framework Open source Framework, not a model
Arize Phoenix Observability Open source + SaaS Tracing/monitoring, not detection
Opik (Comet) Eval framework Open source + SaaS Framework, not a model
Deepchecks ORION encoder SaaS 83% RAGTruth F1, no AggreFact score

End-to-End Guardrail Results (benchmarks/e2e_eval.py)

Heuristic+NLI Mode (300 traces, GTX 1060)

Full pipeline (CoherenceAgent + GroundTruthStore + SafetyKernel), 300 traces across QA, summarization, and dialogue tasks. Threshold=0.35, soft_limit=0.45.

Task N TP FP TN FN Catch Rate Precision F1
QA 100 18 4 46 32 36.0% 81.8% 50.7%
Summarization 100 12 6 44 38 24.0% 66.7% 35.3%
Dialogue 100 40 43 7 10 80.0% 48.2% 60.2%
Overall 300 70 53 97 80 46.7% 56.9% 51.3%

Evidence coverage: 100% (every rejection includes supporting chunks). Avg latency: 15.8 ms (p95: 40 ms).

Hybrid Mode — NLI + LLM Judge (600 traces, L40S)

Hybrid mode adds an LLM judge fallback when NLI confidence is in the uncertain zone. Two judges tested: Claude Sonnet 4 and GPT-4o-mini.

Judge Task N Catch FPR Precision F1 Avg Latency
Claude Sonnet 4 QA 200 78.0% 4.0% 95.1% 85.7% 10.1 s
Claude Sonnet 4 Summarization 200 95.0% 93.0% 50.5% 66.0% 26.3 s
Claude Sonnet 4 Dialogue 200 99.0% 95.0% 51.0% 67.4% 6.2 s
Claude Sonnet 4 Overall 600 90.7% 64.0% 58.6% 71.2% 14.2 s
GPT-4o-mini QA 200 77.0% 3.0% 96.2% 85.6% 1.3 s
GPT-4o-mini Summarization 200 95.0% 93.0% 50.5% 66.0% 4.3 s
GPT-4o-mini Dialogue 200 99.0% 95.0% 51.0% 67.4% 1.3 s
GPT-4o-mini Overall 600 90.3% 63.7% 58.7% 71.1% 2.3 s

Hybrid mode improves catch rate from 46.7% → 90.7% (+94% relative). QA task achieves production-grade precision (95-96%) at 3-4% FPR. GPT-4o-mini matches Claude at 6x lower latency and 13x lower cost.

RAGTruth (2,700 samples, NLI-only, L40S)

Source: wandb/RAGTruth-processed. Detect hallucinations in LLM-generated summaries and responses.

Metric Value
Catch rate 49.3% (465/943)
False positive rate 40.9%
Precision 39.3%
F1 43.7%
Avg latency 2,650 ms/sample

FreshQA (600 samples, NLI-only, L40S)

Source: FreshQA Nov 2025 (Google Sheets). Detect false-premise questions.

Metric Value
Catch rate 98.6% (146/148)
False positive rate 97.8%
Precision 24.8%
F1 39.7%
Avg latency 1,119 ms/sample

FreshQA's high FPR is expected: without ground-truth context, the NLI model cannot verify consistency and defaults to flagging.

Where Director-AI Wins

  1. Only streaming guardrail — token-level halt. Zero competitors offer this.
  2. 0.5 ms/pair on L40S FP16 — sub-millisecond latency, faster than any competitor.
  3. Beats all frontier LLMs — 75.86% BA > Claude Haiku (75.10%), Sonnet (74.25%), GPT-4o (73.46%).
  4. $0 per-call cost — vs $0.07-$1.40/1K for API-based competitors.
  5. 0.4B params — runs on consumer hardware (GTX 1060: 14.6 ms/pair).
  6. Offline capable — no API dependency in NLI mode.
  7. 90.7% E2E catch rate (hybrid) — NLI + LLM judge catches 9/10 hallucinations.
  8. 95-96% QA precision at 3-4% FPR — production-grade on QA tasks in hybrid mode.
  9. Ecosystem integration — LangChain, LlamaIndex, LangGraph, Haystack, CrewAI.

ExpertQA 59% — Why It Doesn't Matter for Guardrails

ExpertQA scores 59.1% balanced accuracy — the lowest of all 11 AggreFact datasets. This warrants explanation, not apology.

What ExpertQA measures: Expert-written long-form answers (avg ~300 words) verified against reference source documents. The task is: "does this expert answer faithfully represent the source?" — a document-level claim verification task requiring deep domain understanding of nuanced, multi-paragraph text.

Why 59% is structurally expected at 0.4B parameters:

  1. 4:1 class imbalance — 2,971 supported vs 731 not-supported. The model achieves high recall on the majority class (supported) but struggles on the minority class. Balanced accuracy penalises this harshly.

  2. Long expert text defeats token-window NLI — ExpertQA "documents" average 300+ words. At 512 tokens max, the NLI model sees truncated context. The unsupported claims often hinge on details beyond the truncation boundary.

  3. Subtle contradictions require domain expertise — ExpertQA spans medicine, law, history, science. A 0.4B-parameter model trained on general NLI cannot detect that "aspirin is safe for children" contradicts a paediatric guideline unless it has domain-specific training data.

  4. All compact NLI models score similarly — MiniCheck-DeBERTa-L (0.4B) and HHEM-2.1 (0.4B) face the same limitation. Only 7B+ models (Bespoke-MiniCheck, Claude-3.5) improve significantly on this dataset.

Why this doesn't affect Director-AI's guardrail value:

Scenario ExpertQA Relevance Director-AI Designed For
LLM generates factually wrong answer No — ExpertQA tests expert answers, not LLM outputs Yes — primary use case
Customer support bot hallucinates policy No — short QA, not long expert text Yes — QA catch rate 78% (hybrid)
RAG pipeline returns grounded response No — ExpertQA has no retrieval Yes — RAGTruth 82.2%
Streaming generation goes off-rails No — ExpertQA is post-hoc Yes — token-level halt

Director-AI's hybrid mode achieves 90.7% catch rate across QA, summarisation, and dialogue — the tasks customers actually deploy guardrails for. ExpertQA measures a different capability (expert answer verification) where all models at this parameter count underperform.

Mitigation: For users needing expert-text verification, the hybrid mode (NLI + LLM judge) handles long documents well. The LLM judge sees the full text and catches the nuanced contradictions that 0.4B NLI misses.

Where Director-AI Loses

  1. Summarization NLI accuracy weakest — AggreFact-CNN 68.8%, ExpertQA 59.1%. FPR at 2.0% (v3.6.0, Layer C claim coverage), down from 95%. See ExpertQA analysis above.
  2. ONNX CPU not competitive — 383 ms/pair without CUDAExecutionProvider.
  3. Fine-tuned models regress — fine-tuned DeBERTa-v3-large scored 64.7%, below baseline.
  4. Hybrid mode requires LLM API — NLI-only mode is fully local, but hybrid needs OpenAI/Anthropic.

Path Forward

  1. All task types below 5% FPR — QA 3-4%, summarization 2.0%, dialogue 4.5%.
  2. TensorRT — sub-0.5ms/pair target via TensorRT optimization.
  3. Layer C complete — claim decomposition + coverage scoring reduced summarization FPR from 10.5% → 2.0%.

Full Benchmark Suite

Scripts in benchmarks/. Run each with python -m benchmarks.<name>.

Script Dataset What it Tests Metric Status
aggrefact_eval LLM-AggreFact (29K) Factual consistency (11 datasets) Balanced accuracy 75.8% (FactCG)
e2e_eval HaluEval (300-600) Full pipeline: Agent + KB + Kernel Catch rate, precision, F1 90.7% catch (hybrid)
e2e_eval --hybrid HaluEval (600) Hybrid NLI + LLM judge Catch, FPR, F1 90.7% / 71.2% F1
run_ragtruth_freshqa RAGTruth (2,700) NLI hallucination detection Catch, precision, F1 49.3% catch (NLI-only)
run_ragtruth_freshqa FreshQA (600) False-premise detection Catch rate 98.6% catch
latency_bench N/A Inference latency across backends Median/P95 ms 0.9 ms (Ada GPU)
gpu_bench N/A Cross-GPU latency comparison Per-pair ms 6 GPUs benchmarked
retrieval_bench Synthetic (50 facts) RAG retrieval quality (Hit@k, P@k) Hit@1, Hit@3, P@3 40% / 63% (inmemory)
anli_eval ANLI R1/R2/R3 Adversarial NLI robustness Accuracy, F1 per class Requires GPU + HF_TOKEN
fever_eval FEVER dev Fact verification Accuracy, F1 per class Requires GPU + HF_TOKEN
halueval_eval HaluEval Hallucination detection (QA/sum/dial) Precision, Recall, F1 Requires GPU + HF_TOKEN
mnli_eval MNLI matched+mismatched General NLI regression Accuracy, F1 per class Requires GPU + HF_TOKEN
paws_eval PAWS Paraphrase adversaries Binary P/R/F1 Requires GPU + HF_TOKEN
truthfulqa_eval TruthfulQA (817 Qs) Multiple-choice truthfulness Accuracy per category Requires GPU + HF_TOKEN
vitaminc_eval VitaminC Contrastive fact verification Accuracy, F1 per class Requires GPU + HF_TOKEN
falsepositive_eval SQuAD/NQ/TriviaQA False-positive rate on correct QA FP rate (target <5%) Requires GPU + HF_TOKEN
streaming_false_halt_bench Synthetic good text False-halt rate of StreamingKernel False-halt % 0.0% (20 passages, heuristic)
medical_eval MedNLI + PubMedQA Medical domain guardrail Catch, FPR, F1 Requires GPU + HF_TOKEN
legal_eval ContractNLI + CUAD Legal domain guardrail Catch, FPR, F1 Requires GPU + HF_TOKEN
finance_eval FinanceBench + PhraseBank Finance domain guardrail Catch, FPR, F1 Requires GPU + HF_TOKEN

To reproduce all results:

export HF_TOKEN=hf_...
python -m benchmarks.aggrefact_eval --sweep
python -m benchmarks.anli_eval
python -m benchmarks.fever_eval
python -m benchmarks.halueval_eval
python -m benchmarks.mnli_eval
python -m benchmarks.paws_eval
python -m benchmarks.truthfulqa_eval
python -m benchmarks.vitaminc_eval
python -m benchmarks.falsepositive_eval
python -m benchmarks.retrieval_bench --backend sentence-transformer
python -m benchmarks.streaming_false_halt_bench
python -m benchmarks.medical_eval --nli
python -m benchmarks.legal_eval --nli
python -m benchmarks.finance_eval --nli

Methodology

  • Balanced accuracy: macro-averaged recall across supported/not-supported classes. Standard metric for the LLM-AggreFact benchmark (Tang et al., 2024).
  • Latency: median of 30 iterations after 5 warmup runs, single batch of 16 premise-hypothesis pairs. GPU clock not locked; reported on idle systems.
  • E2E eval: synthetic traces with ground-truth labels. TP/FP/TN/FN computed against agent halted flag at the stated threshold.
  • False-halt rate: 20 known-good Wikipedia passages streamed through StreamingKernel; a halt on any passage counts as a false halt.
  • Competitor latency: values marked "~" or "(est.)" are from published papers or documentation, not our own measurements.

Sources