Skip to content

Commit a38f17b

Browse files
committed
fix(ui): rephrase abtruse language on homepage
1 parent cfe0ab5 commit a38f17b

File tree

1 file changed

+42
-43
lines changed

1 file changed

+42
-43
lines changed

frontend/index.html

Lines changed: 42 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -69,17 +69,16 @@ <h1 class="logo-type">ERR-EVAL</h1>
6969
<p class="large-text">DO AI MODELS KNOW WHAT THEY <span class="highlight-risk glitch-hover"
7070
data-text="DON'T">DON'T</span> KNOW?</p>
7171
<p class="sub-text">
72-
<strong>ERR-EVAL: Epistemic Reasoning & Reliability Evaluation</strong>
72+
Most benchmarks test if AI gets the right answer. <strong>ERR-EVAL</strong> tests if AI knows
73+
when it <em>can't</em> get the right answer.
7374
<br><br>
74-
We test whether AI models can recognize when they <em>shouldn't</em> answer confidently.
75-
Each model faces 125 adversarial prompts designed to pressure them into hallucinating—
76-
incomplete information, hidden ambiguities, false premises, and impossible constraints.
75+
Each model faces 125 trick questions designed to pressure them into making things up—
76+
missing information, hidden ambiguities, false premises, and impossible constraints.
7777
<br><br>
78-
<strong>The benchmark scores 5 axes:</strong> detecting ambiguity, avoiding hallucinations,
79-
localizing what's missing, choosing the right response strategy, and maintaining calibrated
80-
confidence.
81-
Models that refuse to guess when information is missing score higher than those that confidently
82-
make things up.
78+
<strong>The benchmark scores 5 things:</strong> spotting problems, avoiding hallucinations,
79+
identifying what's missing, asking the right questions, and staying honest about uncertainty.
80+
Models that say "I don't know" when they shouldn't guess score higher than those that
81+
confidently make things up.
8382
</p>
8483
</div>
8584
<div class="hero-metrics">
@@ -99,6 +98,35 @@ <h1 class="logo-type">ERR-EVAL</h1>
9998
</div>
10099
</section>
101100

101+
<!-- Methodology / Tracks -->
102+
<section id="methodology" class="grid-section">
103+
<div class="grid-card">
104+
<div class="card-label">TRACK A</div>
105+
<h4>GARBLED INPUT</h4>
106+
<p>Typos, autocorrect errors, and mangled text. Can the model figure out what you meant—or admit it can't?</p>
107+
</div>
108+
<div class="grid-card">
109+
<div class="card-label">TRACK B</div>
110+
<h4>UNCLEAR WORDING</h4>
111+
<p>Sentences that could mean multiple things. "I saw the man with the telescope"—who has the telescope?</p>
112+
</div>
113+
<div class="grid-card">
114+
<div class="card-label">TRACK C</div>
115+
<h4>TRICK QUESTIONS</h4>
116+
<p>Questions that assume something false. A good model should push back, not play along.</p>
117+
</div>
118+
<div class="grid-card">
119+
<div class="card-label">TRACK D</div>
120+
<h4>MISSING INFO</h4>
121+
<p>Requests that leave out crucial details. The right move is to ask, not guess.</p>
122+
</div>
123+
<div class="grid-card">
124+
<div class="card-label">TRACK E</div>
125+
<h4>IMPOSSIBLE ASKS</h4>
126+
<p>Requests with contradictory requirements. "Make it faster and use less memory and don't change the code."</p>
127+
</div>
128+
</section>
129+
102130
<!-- Scoring Axes: The "Truths" -->
103131
<section class="axes-ticker">
104132
<div class="ticker-wrap">
@@ -158,11 +186,11 @@ <h2>LEADERBOARD</h2>
158186
<th class="col-rank">#</th>
159187
<th class="col-model">MODEL / ID</th>
160188
<th class="col-score">OVERALL</th>
161-
<th class="col-track" data-tooltip="Noisy Perception">TRACK A</th>
162-
<th class="col-track" data-tooltip="Ambiguous Semantics">TRACK B</th>
163-
<th class="col-track" data-tooltip="False Premise">TRACK C</th>
164-
<th class="col-track" data-tooltip="Underspecified Tasks">TRACK D</th>
165-
<th class="col-track" data-tooltip="Conflicting Constraints">TRACK E</th>
189+
<th class="col-track" data-tooltip="Garbled Input">TRACK A</th>
190+
<th class="col-track" data-tooltip="Unclear Wording">TRACK B</th>
191+
<th class="col-track" data-tooltip="Trick Questions">TRACK C</th>
192+
<th class="col-track" data-tooltip="Missing Info">TRACK D</th>
193+
<th class="col-track" data-tooltip="Impossible Asks">TRACK E</th>
166194
</tr>
167195
</thead>
168196
<tbody id="leaderboard-body">
@@ -202,35 +230,6 @@ <h2>MASTER CHART</h2>
202230
<span id="last-run-display">Last Updated: --</span>
203231
</div>
204232
</section>
205-
206-
<!-- Methodology / Tracks -->
207-
<section id="methodology" class="grid-section">
208-
<div class="grid-card">
209-
<div class="card-label">TRACK A</div>
210-
<h4>NOISY PERCEPTION</h4>
211-
<p>Handling corrupted inputs, misheard phrases, standard speech-to-text noise errors.</p>
212-
</div>
213-
<div class="grid-card">
214-
<div class="card-label">TRACK B</div>
215-
<h4>AMBIGUOUS SEMANTICS</h4>
216-
<p>Syntactic ambiguities, scope errors, pronoun references with multiple distinct valid parses.</p>
217-
</div>
218-
<div class="grid-card">
219-
<div class="card-label">TRACK C</div>
220-
<h4>FALSE PREMISES</h4>
221-
<p>Questions containing unsafe assumptions that must be challenged, not answered.</p>
222-
</div>
223-
<div class="grid-card">
224-
<div class="card-label">TRACK D</div>
225-
<h4>UNDERSPECIFIED</h4>
226-
<p>Tasks missing critical constraints where action should be suspended for clarification.</p>
227-
</div>
228-
<div class="grid-card">
229-
<div class="card-label">TRACK E</div>
230-
<h4>CONFLICTS</h4>
231-
<p>Mutually exclusive constraints where trade-offs must be negotiated explicitly.</p>
232-
</div>
233-
</section>
234233
</main>
235234

236235
<footer class="site-footer">

0 commit comments

Comments
 (0)