Skip to content

duplicate tokens in tokenizers #78

@mmoskal

Description

@mmoskal

For example, the llama tokenizer has "<0x20>" as 35 and "▁" (space) as 29871, as well as "<0x21>" as 36 and "!" as 29991, etc.

We need to:

  • pick the canonical form (29871 probably)
  • have a mapping on the side that if 29871 is allowed also allows 35 in TokenSet (apply it after "compute_bias()" etc).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions