Hi DeepEval team,
DeepEval already does a great job providing metrics and tests for LLM / RAG systems. A missing piece for many teams is a structured way to tag why a sample failed.
I maintain WFGY RAG 16 Problem Map, an open-source failure taxonomy for RAG / LLM pipelines, together with a Global Debug Card and triage prompt.
Repo (MIT):
https://github.com/onestardao/WFGY
Main reference page:
https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md
This map is already integrated or cited in projects such as RAGFlow, LlamaIndex, ToolUniverse (Harvard MIMS Lab), Rankify (Univ. of Innsbruck), Multimodal RAG Survey (QCRI LLM Lab) and curated lists like Awesome LLM Apps.
Proposal:
Add WFGY’s 16-problem map as an optional built-in tag set inside DeepEval, for example:
-
Provide a small helper that:
- Given a failing test case (question, context, answer), calls an LLM with the WFGY Global Debug Card.
- Returns one of the 16 failure labels as a tag on the sample.
-
Add a short “RAG failure modes” doc that explains the taxonomy and demonstrates:
- How to enable these tags in a test suite.
- How to aggregate results by failure type (e.g. more “retrieval blind spots” vs “prompt leakage”).
This would nicely complement DeepEval’s metric-based view with a semantic failure map that is already being used by other RAG frameworks and labs.
If this sounds useful I can draft the helper code and example usage, following your existing API style.
Hi DeepEval team,
DeepEval already does a great job providing metrics and tests for LLM / RAG systems. A missing piece for many teams is a structured way to tag why a sample failed.
I maintain WFGY RAG 16 Problem Map, an open-source failure taxonomy for RAG / LLM pipelines, together with a Global Debug Card and triage prompt.
Repo (MIT):
https://github.com/onestardao/WFGY
Main reference page:
https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md
This map is already integrated or cited in projects such as RAGFlow, LlamaIndex, ToolUniverse (Harvard MIMS Lab), Rankify (Univ. of Innsbruck), Multimodal RAG Survey (QCRI LLM Lab) and curated lists like Awesome LLM Apps.
Proposal:
Add WFGY’s 16-problem map as an optional built-in tag set inside DeepEval, for example:
Provide a small helper that:
Add a short “RAG failure modes” doc that explains the taxonomy and demonstrates:
This would nicely complement DeepEval’s metric-based view with a semantic failure map that is already being used by other RAG frameworks and labs.
If this sounds useful I can draft the helper code and example usage, following your existing API style.