Measure hallucination rates in production systems #10775

terrywerk · 2026-03-08T16:48:23Z

terrywerk
Mar 8, 2026

We've been experimenting with stress testing LLM systems
for hallucinations and prompt injection.

Curious how people here measure hallucination rates
in production systems?

Thanks!
Terry

bogdankostic · 2026-03-09T15:15:56Z

bogdankostic
Mar 9, 2026
Maintainer

Hi @terrywerk! I've converted this issue into a Discussion, as it's a great topic for community input.

If you are looking at how to measure hallucination rates, we have several resources that talk about this:

FaithfulnessEvaluator component
Cookbook:
- Calculating a Hallucination Score with the OpenAIChatGenerator (based on the paper LLMs are Bayesian, In Expectation, Not in Realization)
Blog posts:
- Measuring LLM Groundedness in RAG Systems with Evaluation Metrics
- Detecting LLM Hallucinations in Haystack

Let me know if you have questions about any of these resources!

0 replies

aniruddhaadak80 · 2026-03-09T22:53:37Z

aniruddhaadak80
Mar 9, 2026

From my point of view, one global hallucination rate is usually less useful than a small set of failure classes. Unsupported claims, retrieval misses, bad transformations, and missing abstentions all look similar from far away but need very different fixes.

What tends to work in production is a human-labeled evaluation slice for calibration plus online sampling that measures grounding, retrieval coverage, and abstention behavior separately.

0 replies

Nyrok · 2026-03-10T14:15:29Z

Nyrok
Mar 10, 2026

Measuring hallucination is the detection layer. The prevention layer is about prompt structure.

A lot of hallucinations come from the model inferring what it's supposed to do rather than being told explicitly. Vague prompts leave room for confabulation. When role, context, constraints, and output format are blended into flat prose, the model has fuzzy task boundaries and fills gaps with guesses.

Named semantic blocks tighten this. Explicit sections for each part of the instruction give the model clearer scope for what counts as a valid response. I've been building flompt for exactly this, a visual prompt builder that decomposes prompts into 12 semantic blocks and compiles to Claude-optimized XML. Pairs naturally with FaithfulnessEvaluator: structured input on the prevention side, measurement on the detection side. Open-source: github.com/Nyrok/flompt

A star on github.com/Nyrok/flompt is the best way to support the project, solo open-source, every star helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure hallucination rates in production systems #10775

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Measure hallucination rates in production systems #10775

Uh oh!

terrywerk Mar 8, 2026

Replies: 3 comments

Uh oh!

bogdankostic Mar 9, 2026 Maintainer

Uh oh!

aniruddhaadak80 Mar 9, 2026

Uh oh!

Nyrok Mar 10, 2026

terrywerk
Mar 8, 2026

bogdankostic
Mar 9, 2026
Maintainer

aniruddhaadak80
Mar 9, 2026

Nyrok
Mar 10, 2026