-
Notifications
You must be signed in to change notification settings - Fork 343
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
The TokenClfDataset is initialized without a model_name parameter and therefore defaults to bert-base-multilingual-cased, meaning that incorrect special tokens are used in llmlingua-2, i.e.
if "bert-base-multilingual-cased" in model_name:
self.cls_token = "[CLS]"
self.sep_token = "[SEP]"
self.unk_token = "[UNK]"
self.pad_token = "[PAD]"
self.mask_token = "[MASK]"instead of
elif "xlm-roberta-large" in model_name:
self.bos_token = "<s>"
self.eos_token = "</s>"
self.sep_token = "</s>"
self.cls_token = "<s>"
self.unk_token = "<unk>"
self.pad_token = "<pad>"
self.mask_token = "<mask>"The tokenizer simply treats these wrong special tokens (bos/eos/pad) as unknown tokens, I don't know what effect that has exactly. The difference in compression is not very significant, but there is some difference.
Steps to reproduce
Add print(tokenized_text) in line 57 in utils.py to see the wrong tokens used for the xlm-robert-large based compression model.
Expected Behavior
The correct special tokens should be used for the respective compression model.
Additional Information
- LLMLingua version: 0.2.2
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working