@@ -69,17 +69,16 @@ <h1 class="logo-type">ERR-EVAL</h1>
6969 < p class ="large-text "> DO AI MODELS KNOW WHAT THEY < span class ="highlight-risk glitch-hover "
7070 data-text ="DON'T "> DON'T</ span > KNOW?</ p >
7171 < p class ="sub-text ">
72- < strong > ERR-EVAL: Epistemic Reasoning & Reliability Evaluation</ strong >
72+ Most benchmarks test if AI gets the right answer. < strong > ERR-EVAL</ strong > tests if AI knows
73+ when it < em > can't</ em > get the right answer.
7374 < br > < br >
74- We test whether AI models can recognize when they < em > shouldn't</ em > answer confidently.
75- Each model faces 125 adversarial prompts designed to pressure them into hallucinating—
76- incomplete information, hidden ambiguities, false premises, and impossible constraints.
75+ Each model faces 125 trick questions designed to pressure them into making things up—
76+ missing information, hidden ambiguities, false premises, and impossible constraints.
7777 < br > < br >
78- < strong > The benchmark scores 5 axes:</ strong > detecting ambiguity, avoiding hallucinations,
79- localizing what's missing, choosing the right response strategy, and maintaining calibrated
80- confidence.
81- Models that refuse to guess when information is missing score higher than those that confidently
82- make things up.
78+ < strong > The benchmark scores 5 things:</ strong > spotting problems, avoiding hallucinations,
79+ identifying what's missing, asking the right questions, and staying honest about uncertainty.
80+ Models that say "I don't know" when they shouldn't guess score higher than those that
81+ confidently make things up.
8382 </ p >
8483 </ div >
8584 < div class ="hero-metrics ">
@@ -99,6 +98,35 @@ <h1 class="logo-type">ERR-EVAL</h1>
9998 </ div >
10099 </ section >
101100
101+ <!-- Methodology / Tracks -->
102+ < section id ="methodology " class ="grid-section ">
103+ < div class ="grid-card ">
104+ < div class ="card-label "> TRACK A</ div >
105+ < h4 > GARBLED INPUT</ h4 >
106+ < p > Typos, autocorrect errors, and mangled text. Can the model figure out what you meant—or admit it can't?</ p >
107+ </ div >
108+ < div class ="grid-card ">
109+ < div class ="card-label "> TRACK B</ div >
110+ < h4 > UNCLEAR WORDING</ h4 >
111+ < p > Sentences that could mean multiple things. "I saw the man with the telescope"—who has the telescope?</ p >
112+ </ div >
113+ < div class ="grid-card ">
114+ < div class ="card-label "> TRACK C</ div >
115+ < h4 > TRICK QUESTIONS</ h4 >
116+ < p > Questions that assume something false. A good model should push back, not play along.</ p >
117+ </ div >
118+ < div class ="grid-card ">
119+ < div class ="card-label "> TRACK D</ div >
120+ < h4 > MISSING INFO</ h4 >
121+ < p > Requests that leave out crucial details. The right move is to ask, not guess.</ p >
122+ </ div >
123+ < div class ="grid-card ">
124+ < div class ="card-label "> TRACK E</ div >
125+ < h4 > IMPOSSIBLE ASKS</ h4 >
126+ < p > Requests with contradictory requirements. "Make it faster and use less memory and don't change the code."</ p >
127+ </ div >
128+ </ section >
129+
102130 <!-- Scoring Axes: The "Truths" -->
103131 < section class ="axes-ticker ">
104132 < div class ="ticker-wrap ">
@@ -158,11 +186,11 @@ <h2>LEADERBOARD</h2>
158186 < th class ="col-rank "> #</ th >
159187 < th class ="col-model "> MODEL / ID</ th >
160188 < th class ="col-score "> OVERALL</ th >
161- < th class ="col-track " data-tooltip ="Noisy Perception "> TRACK A</ th >
162- < th class ="col-track " data-tooltip ="Ambiguous Semantics "> TRACK B</ th >
163- < th class ="col-track " data-tooltip ="False Premise "> TRACK C</ th >
164- < th class ="col-track " data-tooltip ="Underspecified Tasks "> TRACK D</ th >
165- < th class ="col-track " data-tooltip ="Conflicting Constraints "> TRACK E</ th >
189+ < th class ="col-track " data-tooltip ="Garbled Input "> TRACK A</ th >
190+ < th class ="col-track " data-tooltip ="Unclear Wording "> TRACK B</ th >
191+ < th class ="col-track " data-tooltip ="Trick Questions "> TRACK C</ th >
192+ < th class ="col-track " data-tooltip ="Missing Info "> TRACK D</ th >
193+ < th class ="col-track " data-tooltip ="Impossible Asks "> TRACK E</ th >
166194 </ tr >
167195 </ thead >
168196 < tbody id ="leaderboard-body ">
@@ -202,35 +230,6 @@ <h2>MASTER CHART</h2>
202230 < span id ="last-run-display "> Last Updated: --</ span >
203231 </ div >
204232 </ section >
205-
206- <!-- Methodology / Tracks -->
207- < section id ="methodology " class ="grid-section ">
208- < div class ="grid-card ">
209- < div class ="card-label "> TRACK A</ div >
210- < h4 > NOISY PERCEPTION</ h4 >
211- < p > Handling corrupted inputs, misheard phrases, standard speech-to-text noise errors.</ p >
212- </ div >
213- < div class ="grid-card ">
214- < div class ="card-label "> TRACK B</ div >
215- < h4 > AMBIGUOUS SEMANTICS</ h4 >
216- < p > Syntactic ambiguities, scope errors, pronoun references with multiple distinct valid parses.</ p >
217- </ div >
218- < div class ="grid-card ">
219- < div class ="card-label "> TRACK C</ div >
220- < h4 > FALSE PREMISES</ h4 >
221- < p > Questions containing unsafe assumptions that must be challenged, not answered.</ p >
222- </ div >
223- < div class ="grid-card ">
224- < div class ="card-label "> TRACK D</ div >
225- < h4 > UNDERSPECIFIED</ h4 >
226- < p > Tasks missing critical constraints where action should be suspended for clarification.</ p >
227- </ div >
228- < div class ="grid-card ">
229- < div class ="card-label "> TRACK E</ div >
230- < h4 > CONFLICTS</ h4 >
231- < p > Mutually exclusive constraints where trade-offs must be negotiated explicitly.</ p >
232- </ div >
233- </ section >
234233 </ main >
235234
236235 < footer class ="site-footer ">
0 commit comments