Skip to content

Improve vocab decoding#27

Merged
benlebrun merged 2 commits intomainfrom
fix-byte-decoding
Mar 10, 2025
Merged

Improve vocab decoding#27
benlebrun merged 2 commits intomainfrom
fix-byte-decoding

Conversation

@benlebrun
Copy link
Member

This PR fixes an issue in which vocabulary decoding was failing because some tokenizers contained special tokens with characters that were not in the byte decoder. We now convert special tokens to bytes by directly encoding their string representation instead of passing it through the byte decoder.

I also got rid of some decoding functionality that wasn't being used by any of our tests. We can add it back in if we identify a tokenizer for which it is needed.

@gitstream-cm
Copy link

gitstream-cm bot commented Mar 10, 2025

🥷 Code experts: sritchie

sritchie, benlebrun have most 👩‍💻 activity in the files.
benlebrun, sritchie have most 🧠 knowledge in the files.

See details

genlm_backend/tokenization/bytes.py

Activity based on git-commit:

sritchie benlebrun
MAR
FEB
JAN 77 additions & 350 deletions 303 additions & 10 deletions
DEC 273 additions & 0 deletions
NOV
OCT

Knowledge based on git-blame:
benlebrun: 74%
sritchie: 26%

requirements-dev.txt

Activity based on git-commit:

sritchie benlebrun
MAR 8 additions & 0 deletions
FEB
JAN
DEC
NOV
OCT

Knowledge based on git-blame:
sritchie: 100%

tests/test_vocabulary.py

Activity based on git-commit:

sritchie benlebrun
MAR
FEB
JAN 37 additions & 17 deletions 1 additions & 1 deletions
DEC 90 additions & 2 deletions
NOV
OCT

Knowledge based on git-blame:
benlebrun: 66%
sritchie: 34%

To learn more about /:\ gitStream - Visit our Docs

@benlebrun benlebrun merged commit f3e99ce into main Mar 10, 2025
7 checks passed
@benlebrun benlebrun deleted the fix-byte-decoding branch March 10, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant