GPT-2 encoder breaks in new version of PyTorch/huggingface

After switching to pytorch_april_patched and installing -r requirements.txt

```
Producing dataset wiki...
encoding file testdata/wikiextracted/AA/wiki_01.txt ...
Traceback (most recent call last):
  File "train.py", line 1036, in <module>
    eval(f'test_{g.args.test}()')
  File "<string>", line 1, in <module>
  File "train.py", line 940, in test_checkpoint_wiki
    data_setup()
  File "train.py", line 333, in data_setup
    g.corpus = get_lm_corpus(g.args.data, g.args.dataset, use_bpe=g.args.bpe)
  File "/home/ubuntu/data_utils.py", line 381, in get_lm_corpus
    corpus = Corpus(datadir, dataset, use_bpe, **kwargs)
  File "/home/ubuntu/data_utils.py", line 309, in __init__
    self.valid = self.vocab.encode_file(valid_path, ordered=True)
  File "/home/ubuntu/utils/vocabulary.py", line 204, in encode_file
    tokens: List[int] = self.tokenizer.encode(text) + [self.EOT]
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode
    return self.convert_tokens_to_ids(self.tokenize(text))
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr>
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
KeyError: 8212
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions