Sub-tokenization with certain transformers 

@pjox and I are working on a model trained with Roberta and using the BPE tokenizer, in particular [zeldarose](https://github.com/LoicGrobol/zeldarose) which uses slightly different special tokens. 

We have some problem when the data is tokenized. 
In particular, the sub-tokenisation from the tokenizers somehow get messed up when `is_split_into_words=True` and with the library transformers of version 4.15.0 (tokenizers library version 0.10.3): 

The code here (preprocess.py:304): 

```python
# sub-tokenization
encoded_result = self.tokenizer(text_tokens, add_special_tokens=True, is_split_into_words=True,
            max_length=max_seq_length, truncation=True, return_offsets_mapping=True)
```

`text_tokens = ['We', 'are', 'studying', 'the', 'material', 'La', '3', 'A', '2', 'Ge', '2', '(', 'A', '=', 'Ir', ',', 'Rh', ')', '.', 'The', 'critical', 'temperature', 'T', 'C', '=', '4', '.', '7', 'K', 'discovered', 'for', 'La', '3', 'Ir', '2', 'Ge', '2', 'in', 'this', 'work', 'is', 'by', 'about', '1', '.', '2', 'K', 'higher', 'than', 'that', 'found', 'for', 'La', '3', 'Rh', '2', 'Ge', '2', '.']`

the output offsets are as follows: `[(0, 0), (0, 2), (1, 3), (1, 8), (1, 3), (1, 8), (1, 2), (1, 1), (1, 1), (1, 1), (1, 2), (1, 1), (1, 1), (1, 1), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 1), (1, 3), (1, 8), (1, 11), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 10), (1, 3), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 4), (1, 4), (1, 2), (1, 2), (1, 5), (1, 1), (1, 1), (1, 1), (1, 1), (1, 6), (1, 4), (1, 4), (1, 5), (1, 3), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 1), (0, 0)]`

the first two items are correct, from the third, the sequence get messed up, the third should be (0, 3), then (0,8), etc... and this get wrongly reconstructed by the delft code after that. If the pair does not starts the code is unclear, I don't understand why adding `<PAD>`: 

```
                else:
                    # propagate the data to the new sub-token or 
                    # dummy/empty input for sub-tokens
                    label_ids.append("<PAD>")
                    chars_blocks.append(self.empty_char_vector)
                    # 2 possibilities, either empty features for sub-tokens or repeating the 
                    # feature vector of the prefix sub-token 
                    #feature_blocks.append(self.empty_features_vector)
                    feature_blocks.append(features_tokens[word_idx])
```

if I pass the string and set `is_split_into_words=False`:

```python
self.tokenizer("".join(text_tokens), add_special_tokens=True, is_split_into_words=False,             max_length=max_seq_length, truncation=True, return_offsets_mapping=True)
```
I obtain the correct result: `[(0, 0), (0, 2), (2, 7), (7, 10), (10, 13), (13, 14), (14, 17), (17, 24), (24, 26), (26, 27), (27, 28), (28, 29), (29, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 37), (37, 38), (38, 40), (40, 42), (42, 45), (45, 53), (53, 64), (64, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 74), (74, 81), (81, 84), (84, 86), (86, 87), (87, 89), (89, 90), (90, 92), (92, 93), (93, 95), (95, 99), (99, 103), (103, 105), (105, 107), (107, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 122), (122, 124), (124, 128), (128, 130), (130, 135), (135, 138), (138, 140), (140, 141), (141, 143), (143, 144), (144, 146), (146, 147), (147, 148), (0, 0)]`

The option `is_split_into_words` was though only for split by space, which is not the case for most of our use cases. 


Here there is an explanation but I did not understand it well: https://github.com/huggingface/transformers/issues/8217 
(in any case it works only with the python tokenizers)

Probably, we should consider 

```python
self.tokenizer(text_tokens, add_special_tokens=True, is_split_into_words=False, max_length=max_seq_length, truncation=True, return_offsets_mapping=True)
```

which will return a list of list, for each token: 
```
[
    [(0, 0), (0, 2), (0, 0)], 
    [(0, 0), (0, 3), (0, 0)], 
    [(0, 0), (0, 8), (0, 0)], 
    [...]
]
```

and then, with some additional works, we should be able to reconstruct the output correctly.

I've also find that updating the transformers library to 4.25.1 solves the problem on my M1 Mac, but open to new problems on Linux.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-tokenization with certain transformers #150

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sub-tokenization with certain transformers #150

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions