[Bug]: LLMLingua-2 uses wrong special tokens by default

### Describe the bug

The `TokenClfDataset` is [initialized without a `model_name`](https://github.com/microsoft/LLMLingua/blob/60abc0f94939b24e000fe6a33a954de72055fa0c/llmlingua/prompt_compressor.py#L2341) parameter and therefore [defaults to `bert-base-multilingual-cased`](https://github.com/microsoft/LLMLingua/blob/60abc0f94939b24e000fe6a33a954de72055fa0c/llmlingua/utils.py#L19), meaning that incorrect special tokens are used in llmlingua-2, i.e.

```python
    if "bert-base-multilingual-cased" in model_name:
            self.cls_token = "[CLS]"
            self.sep_token = "[SEP]"
            self.unk_token = "[UNK]"
            self.pad_token = "[PAD]"
            self.mask_token = "[MASK]"
```
instead of 
```python
    elif "xlm-roberta-large" in model_name:
            self.bos_token = "<s>"
            self.eos_token = "</s>"
            self.sep_token = "</s>"
            self.cls_token = "<s>"
            self.unk_token = "<unk>"
            self.pad_token = "<pad>"
            self.mask_token = "<mask>"
```

The tokenizer simply treats these wrong special tokens (bos/eos/pad) as unknown tokens, I don't know what effect that has exactly. The difference in compression is not very significant, but there is some difference.

### Steps to reproduce

Add `print(tokenized_text)` in line 57 in `utils.py` to see the wrong tokens used for the xlm-robert-large based compression model.

### Expected Behavior
The correct special tokens should be used for the respective compression model.

### Additional Information

- LLMLingua version: 0.2.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

Describe the bug

Steps to reproduce

Expected Behavior

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

Description

Describe the bug

Steps to reproduce

Expected Behavior

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions