Skip to content

Sub-tokenization with certain transformers  #150

@lfoppiano

Description

@lfoppiano

@pjox and I are working on a model trained with Roberta and using the BPE tokenizer, in particular zeldarose which uses slightly different special tokens.

We have some problem when the data is tokenized.
In particular, the sub-tokenisation from the tokenizers somehow get messed up when is_split_into_words=True and with the library transformers of version 4.15.0 (tokenizers library version 0.10.3):

The code here (preprocess.py:304):

# sub-tokenization
encoded_result = self.tokenizer(text_tokens, add_special_tokens=True, is_split_into_words=True,
            max_length=max_seq_length, truncation=True, return_offsets_mapping=True)

text_tokens = ['We', 'are', 'studying', 'the', 'material', 'La', '3', 'A', '2', 'Ge', '2', '(', 'A', '=', 'Ir', ',', 'Rh', ')', '.', 'The', 'critical', 'temperature', 'T', 'C', '=', '4', '.', '7', 'K', 'discovered', 'for', 'La', '3', 'Ir', '2', 'Ge', '2', 'in', 'this', 'work', 'is', 'by', 'about', '1', '.', '2', 'K', 'higher', 'than', 'that', 'found', 'for', 'La', '3', 'Rh', '2', 'Ge', '2', '.']

the output offsets are as follows: [(0, 0), (0, 2), (1, 3), (1, 8), (1, 3), (1, 8), (1, 2), (1, 1), (1, 1), (1, 1), (1, 2), (1, 1), (1, 1), (1, 1), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 1), (1, 3), (1, 8), (1, 11), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 10), (1, 3), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 4), (1, 4), (1, 2), (1, 2), (1, 5), (1, 1), (1, 1), (1, 1), (1, 1), (1, 6), (1, 4), (1, 4), (1, 5), (1, 3), (1, 2), (1, 1), (1, 2), (1, 1), (1, 2), (1, 1), (1, 1), (0, 0)]

the first two items are correct, from the third, the sequence get messed up, the third should be (0, 3), then (0,8), etc... and this get wrongly reconstructed by the delft code after that. If the pair does not starts the code is unclear, I don't understand why adding <PAD>:

                else:
                    # propagate the data to the new sub-token or 
                    # dummy/empty input for sub-tokens
                    label_ids.append("<PAD>")
                    chars_blocks.append(self.empty_char_vector)
                    # 2 possibilities, either empty features for sub-tokens or repeating the 
                    # feature vector of the prefix sub-token 
                    #feature_blocks.append(self.empty_features_vector)
                    feature_blocks.append(features_tokens[word_idx])

if I pass the string and set is_split_into_words=False:

self.tokenizer("".join(text_tokens), add_special_tokens=True, is_split_into_words=False,             max_length=max_seq_length, truncation=True, return_offsets_mapping=True)

I obtain the correct result: [(0, 0), (0, 2), (2, 7), (7, 10), (10, 13), (13, 14), (14, 17), (17, 24), (24, 26), (26, 27), (27, 28), (28, 29), (29, 31), (31, 32), (32, 33), (33, 34), (34, 35), (35, 37), (37, 38), (38, 40), (40, 42), (42, 45), (45, 53), (53, 64), (64, 66), (66, 67), (67, 68), (68, 69), (69, 70), (70, 71), (71, 74), (74, 81), (81, 84), (84, 86), (86, 87), (87, 89), (89, 90), (90, 92), (92, 93), (93, 95), (95, 99), (99, 103), (103, 105), (105, 107), (107, 112), (112, 113), (113, 114), (114, 115), (115, 116), (116, 122), (122, 124), (124, 128), (128, 130), (130, 135), (135, 138), (138, 140), (140, 141), (141, 143), (143, 144), (144, 146), (146, 147), (147, 148), (0, 0)]

The option is_split_into_words was though only for split by space, which is not the case for most of our use cases.

Here there is an explanation but I did not understand it well: huggingface/transformers#8217
(in any case it works only with the python tokenizers)

Probably, we should consider

self.tokenizer(text_tokens, add_special_tokens=True, is_split_into_words=False, max_length=max_seq_length, truncation=True, return_offsets_mapping=True)

which will return a list of list, for each token:

[
    [(0, 0), (0, 2), (0, 0)], 
    [(0, 0), (0, 3), (0, 0)], 
    [(0, 0), (0, 8), (0, 0)], 
    [...]
]

and then, with some additional works, we should be able to reconstruct the output correctly.

I've also find that updating the transformers library to 4.25.1 solves the problem on my M1 Mac, but open to new problems on Linux.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions