OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured.

When I am training the XLM-Roberta based QE system, I pre-downloaded the pre-trained XLM-Roberta model from huggingface's library and modified the field `system.model.encoder.model_name` in `xlmroberta.yaml` from the default `xlm-roberta-base` to my local path that contains the pre-downloaded XLM-Roberta model. However, when running the code, I found OpenKiwi will always download the files `config.json` and `sentencepiece.bpe.model` rather than directly use the pre-downloaded ones.

Finally I found this is caused by the following code in `kiwi/systems/encoders/xlmroberta.py:48~49`:
```
        if tokenizer_name not in XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST:
            tokenizer_name = 'xlm-roberta-base'
```
which means if `model_name` is configured to some local path, it will be rewrite to `xlm-roberta-base`. However, for Bert and XLM, there is no such issue. Is that a bug or under some consideration?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. #102

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenKiwi always download the tokenizer files for XLMRoberta even if a local path is configured. #102

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions