Skip to content

fix(engine): replace full-vocab percentile with top-N rank scoring#74

Merged
FlorentPoinsaut merged 8 commits intomainfrom
fix/67-top-n-scoring
Apr 28, 2026
Merged

fix(engine): replace full-vocab percentile with top-N rank scoring#74
FlorentPoinsaut merged 8 commits intomainfrom
fix/67-top-n-scoring

Conversation

@FlorentPoinsaut
Copy link
Copy Markdown
Member

Summary

Fixes #67 — percentile compression where ranks 1–1500 all display as 99%.

Root cause

The old formula (effective_vocab - rank) / effective_vocab divided by ~150 000, making each rank step worth only 0.00067%, so all semantically meaningful guesses were rounded to 99% by Math.floor(score * 100).

Fix

Replace with a top-N neighbourhood score:

# rank within the top-N window
if rank > self._top_n:
    return 0.0
return (self._top_n - rank) / self._top_n
# rank 1 → 0.999 (99%), rank 1000 → 0.0 (0%), exact match → 1.0 (100%)

This restores a continuous, visible gradient across the full 0–99% range, with 100% reserved exclusively for exact matches.

Changes

File Change
game/engine.py New scoring formula; top_n constructor param (default from env); ValueError guard; updated docstrings
config.py Add SCORING_TOP_N env var (default 1000)
overlay/static/index.html Recalibrate gauge gradient: blue→green at 60%, green→gold at 90%
tests/test_engine.py Inject _top_n in test helper; add test_word_beyond_top_n_returns_zero

Score mapping (TOP_N = 1000)

Rank Score Display Colour
exact match 1.0 100% — (game ends)
1 0.999 99% 🟡 gold
100 0.9 90% 🟡 gold
101 0.899 89% 🟢 green
400 0.6 60% 🟢 green
401 0.599 59% 🔵 blue
1000 0.001 0% 🔵 blue
> 1000 0.0 0% 🔵 blue

Tests

20 tests pass, including the new test_word_beyond_top_n_returns_zero.

FlorentPoinsaut and others added 2 commits April 26, 2026 06:13
Fixes #67

The previous formula mapped rank across the entire ~150 000-word
vocabulary, compressing ranks 1-1500 into 99% and destroying the
score gradient that makes the game engaging.

Replace with a top-N neighbourhood score:
  rank <= top_n → (top_n - rank) / top_n   (rank 1 → 0.999, rank top_n → 0)
  rank >  top_n → 0.0

This restores a continuous, visible gradient from 0% (outside the
neighbourhood) to 99% (closest non-exact word), with 100% reserved
for exact matches only.

Changes:
- game/engine.py: new formula + top_n constructor param + ValueError guard
- config.py: add SCORING_TOP_N env var (default 1000)
- overlay/static/index.html: recalibrate gauge gradient to match thresholds
  (blue 0%, green 60%, gold 90%, red 100%)
- tests/test_engine.py: inject _top_n in helper, add beyond-top-N test
The linear formula (top_n - rank) / top_n with top_n=1000 assigned 0%
to any word ranked beyond the 1000th nearest neighbour in frWac. Since
the vocabulary contains ~150 000 entries, even loosely related words
easily exceed rank 1000, causing every manual guess to display 0%.

Replace with a logarithmic formula over a larger top-N window (100 000
by default):

    score = 1 - log(rank + 1) / log(top_n + 1)

This gives a visible, continuous gradient with no compression:
  rank      1 →  94 %   (very close synonyms)
  rank     10 →  79 %
  rank    100 →  61 %
  rank  1 000 →  42 %
  rank 10 000 →  22 %
  rank 100 000 →  0 %  (hard cutoff)

Both config defaults (SCORING_TOP_N) are updated from 1 000 to 100 000.
Tests updated to reflect the new formula and _top_n=4 in the mock helper.
@FlorentPoinsaut
Copy link
Copy Markdown
Member Author

Correctif : scores 0% sur les tests manuels

Diagnostic

La formule linéaire (top_n - rank) / top_n avec top_n = 1000 était trop restrictive pour le modèle frWac (≈ 150 000 mots). Tout mot ayant un rang > 1 000 recevait automatiquement 0%, y compris des mots sémantiquement liés qui dépassent fréquemment ce seuil dans un vocabulaire aussi large.

Correction

Remplacement par une formule logarithmique sur une fenêtre de 100 000 voisins :

score = 1.0 - math.log(rank + 1) / math.log(self._top_n + 1)

Distribution des scores (top_n = 100 000)

Rang Score Affichage Couleur
match exact 1.0 100 % 🏆 victoire
1 0.942 94 % 🟡 or
10 0.799 79 % 🟢 vert
100 0.613 61 % 🟢 vert
1 000 0.421 42 % 🔵 bleu
10 000 0.227 22 % 🔵 bleu
100 000 0.0 0 % 🔵 bleu
> 100 000 0.0 0 % 🔵 bleu

Cette distribution résout simultanément :

20/20 tests passent.

…wer agents

NLP/Data:
- Lower SCORING_TOP_N default from 100 000 to 10 000 so that gold (≥90%)
  is reachable for a true synonym and the blue zone stays informative
- Fix score_guess docstring: remove incorrect '0.99 = top 1%' claim;
  describe the logarithmic scale accurately

Reviewer:
- Import MODEL_PATH and SCORING_TOP_N from config instead of calling
  os.getenv() directly (violates project convention)
- Fix stale class docstring: 'or 1000 when unset' → 'or 10 000 when unset'
- Wrap __init__ signature to comply with PEP 8 E501 (88 chars max)
- Avoid redundant clean_word() calls in score_guess (computed once)

Tester:
- Rename test_score_guess_raises_when_not_loaded →
  test_similarity_raises_when_not_loaded (it tested similarity())
- Add test_score_guess_raises_when_not_loaded (line 109 was uncovered)
- Add test_invalid_top_n_raises_value_error (top_n=0, line 36 uncovered)
- Add test_negative_top_n_raises_value_error (top_n=-1)
- Add test_similarity_unknown_word_returns_none (line 85 uncovered)
- Add test_score_at_exactly_top_n_returns_zero (boundary condition)

Coverage: game/engine.py 91% → 97% — 25/25 tests pass
…rror

Importing config at module level triggered _require('TWITCH_CHANNEL')
during pytest collection, causing an ERROR in CI environments without
a .env file.

Moving 'import config' inside __init__ defers execution until actual
instantiation. This also removes the duplicated os.getenv() defaults
(_DEFAULT_MODEL_PATH, _DEFAULT_TOP_N): config.py remains the single
source of truth for both values.
Executing _require('TWITCH_CHANNEL') at module scope caused pytest to
crash during collection/instantiation in CI environments without a .env.

Introduce config.validate() which must be called once at application
startup (main.py). TWITCH_CHANNEL defaults to '' at import time; the
guard fires at startup as before, keeping production fail-fast behaviour.

game/engine.py can now import config at module level cleanly, with
config.py as the single source of truth for SCORING_TOP_N and
MODEL_PATH (no duplication).
Every valid guess now scores strictly > 0, with no configurable cutoff.

Formula: score = 1 - log(rank+1) / log(vocab_size+1)
  where vocab_size = len(model.key_to_index) set at load() time.

Because rank <= vocab_size - 1 < vocab_size for any in-vocab word,
the score is always positive. No hard cutoff, no SCORING_TOP_N config.

Score distribution (frWac ~150 000 words):
  rank      1 →  94 %
  rank     10 →  80 %
  rank    100 →  61 %
  rank  1 000 →  42 %
  rank 10 000 →  23 %
  rank 149 999 →  0.003 %

Remove: top_n param, _top_n/_top_n_override attrs, SCORING_TOP_N config,
        _DEFAULT_TOP_N module constant, and related tests.
Add: _vocab_size attr set in load(), test_all_vocab_words_score_above_zero.
… match

Cache _max_score = 1 - log(2)/log(V+1) at load() time and rescale:

    score = score_raw * 0.99 / _max_score

This maps rank 1 exactly to 0.99 (99%) while preserving the logarithmic
gradient. 1.0 (100%) remains exclusive to exact matches.

Score distribution (frWac ~150 000 words):
  rank      1 →  99 %   (closest neighbour)
  rank     10 →  85 %
  rank    100 →  65 %
  rank  1 000 →  44 %
  rank 10 000 →  24 %
  rank 149 999 →   0.003 %  (always > 0)

Update tests: replace absolute 0.5 thresholds with relative comparisons
(unrelated < close), add _max_score=None to _make_engine() helper.
Replace rescaled log formula with formula E (offset k=9) recommended
by NLP/Data agent after analysis of cemantix.certitudes.org approach:

    score = 0.99 * log((V+9) / (rank+9)) / log((V+9) / 10)

Mathematical guarantees (V = 150 000, frWac):
  - rank 1 → exactly 99% (100% reserved for exact match)
  - step rank 1→2 = 0.98 pp ≤ 1 pp → no integer % gaps (1–99 all reachable)
  - score > 0 for every in-vocabulary word
  - strictly monotone decreasing

Score distribution:
  rank      1 →  99 %
  rank      2 →  98 %
  rank      3 →  97 %
  rank     10 →  92 %
  rank    100 →  74 %
  rank  1 000 →  51 %
  rank 10 000 →  27 %
  rank 149 999 →   0.0001 %

Remove _max_score attr (no longer needed). Update test docstring.
@FlorentPoinsaut FlorentPoinsaut merged commit d61393b into main Apr 28, 2026
2 checks passed
@FlorentPoinsaut FlorentPoinsaut deleted the fix/67-top-n-scoring branch April 28, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P0] Percentile compression: ranks 1–1500 all display as 99%

1 participant