Receiving the warning messgae 'Token indices are too long' even after validating doc length is under max sequence length #9277

markFriel · 2021-09-23T08:50:13Z

markFriel
Sep 23, 2021

Hi, I'm using the transformer pipeline for NER and I am trying to finetune the pipeline. I have texts that can be quite long but I split the texts into chunks to ensure that when tokenised that the resulting document is below the maximum sequence length of 512. However, when I pass the texts and annotations into an Example and update the model I receive the warning about Token indices exceeding the max sequence length.

Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 512). Running this sequence through the model will result in indexing errors

I have implemented a method that checks the number of tokens in the document and would raise a warning if the doc length is above the 512 max sequence length. No warning is thrown from this function but I still receive the warning about the tokenisation.

# Function to check the number of tokens in the would be document 
def check_text_size_batch(texts:List[text], model):

    for text in texts:
        doc = model.make_doc(text)
        if len(doc) > 512:
            warnings.warn('A document is over the max sequence length')

# Function that creates a batch of Examples
def create_batch_examples(texts: List[str], annotations: List[dict], nlp: spacy.pipeline) -> List[spacy.language.Example]:
  
    batch = []
    for i in range(len(texts)):
        example = Example.from_dict(nlp.make_doc(texts[i]), {"entities": annotations[i]["entities"]})
        batch.append(example)

    return batch


# Training loop
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in ["ner", 'transformer']]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        for itn in range(epochs):
            random.shuffle(training_data)
            losses = {}
            batches = minibatch(training_data, size=batch_size)
            for batch in batches:
                texts, annotations = zip(*batch)
                check_text_size_batch(texts, nlp)
                examples = create_batch_examples(texts, annotations, nlp)
                nlp.update(examples, sgd=optimizer, drop=hyperparameters["dropout"], losses=losses)

Any help in understanding why this would happen would be much appreciated. Is my logic wrong regarding using the make doc function? I presumed the tokenization that is happening there is the same that would occur in the actual pipeline. Not really sure why the warning would be thrown

Thanks.

Answered by adrianeboyd

Sep 23, 2021

The short answer is that you can ignore this warning and you don't need to do anything to truncate or split your docs. The transformer component in spacy uses overlapping strided spans internally by default (see the settings in your config under [components.transformer.model.get_spans]) to be able to process longer texts.

The spacy tokenization (word tokenization with len(doc)) is not the same as the internal transformer tokenization (BPE, wordpiece, etc.). Usually the transformer tokenization has more tokens, but not necessarily. If 128 spacy tokens correspond to more than the transformer max_length tokens (which is unusual but can happen with things like long URLs), then spacy-transformers

View full answer

adrianeboyd · 2021-09-23T11:48:26Z

adrianeboyd
Sep 23, 2021

The short answer is that you can ignore this warning and you don't need to do anything to truncate or split your docs. The transformer component in spacy uses overlapping strided spans internally by default (see the settings in your config under [components.transformer.model.get_spans]) to be able to process longer texts.

The spacy tokenization (word tokenization with len(doc)) is not the same as the internal transformer tokenization (BPE, wordpiece, etc.). Usually the transformer tokenization has more tokens, but not necessarily. If 128 spacy tokens correspond to more than the transformer max_length tokens (which is unusual but can happen with things like long URLs), then spacy-transformers truncates that individual span. Notes about that are here (just in the code at this point): https://github.com/explosion/spacy-transformers/blob/6e644ecb09a1fc92feaf45a80a966dba11d40642/spacy_transformers/truncate.py. But if you know that the spacy tokenization is mapping to more than max_length tokens regularly, then you should lower window in the transformer settings in your config.

There's just no (good) way for spacy to suppress this warning from the tokenize step in transformers. The output from the tokenize step is not sent directly to the model, but split up into smaller spans first. (If it did try to use this long token sequence, the transformer model would crash with an error.)

Edited:

Ah, wait. It's likely that you're only seeing this message because the spans are indeed too long. If it's happening once or twice in a large corpus you can probably ignore it, but if it's happening frequently, you should lower window and stride.

4 replies

markFriel Sep 29, 2021
Author

Hi,

Sorry for the late reply I was trying to do a bit of verification on this, and yes In my texts I have URLs that are very long, so I'm assuming that is where the warning message is being raised.

Can I ask why the default window is 128 rather than the max length of 512? Would the wider window not give better performance?
Or is this a decision made for computational reasons? I presume you still have to pad the sequences up to the 512 and the padding doesn't contribute to the computation of gradients?

Apologies if some of that is obvious, just trying to understand how spacy interacts with the HuggingFace components.

Thanks

adrianeboyd Sep 29, 2021

The window is 128 spacy word tokens, not 128 transformer tokens. Usually with the transformer tokenizers (wordpiece, etc.), there is more than one transformer token per word token. We want the number of transformer tokens in each span to be reasonable, but rarely go over 512, so it's a bit of an estimate. For certain types of texts or tokenizers, 1:4 might not be the best ratio. You can adjust it in your own custom models if you need to.

The batches are just padded up to the longest sequence in the batch, not necessarily up to 512.

tomateit Nov 18, 2021

@adrianeboyd
Hi! Can you please tell, what actually transformer do with the spans to obtain the resulting doc tensor?
As long as there's no component that would use strided_spans's window and stride information to decide, which parts of the resulting span vectors shall be "mean reduced" to get from the overlapping 'transformer'ed doc chunks' tensors the whole doc tensor (which respects spacy tokenization), then how do I get a spacy-tokenwise tensor for each and every doc token?

adrianeboyd Nov 19, 2021

@tomateit: It might make more sense to open a new thread with the details about what you're trying to do. Some related discussions are:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Receiving the warning messgae 'Token indices are too long' even after validating doc length is under max sequence length #9277

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Receiving the warning messgae 'Token indices are too long' even after validating doc length is under max sequence length #9277

Uh oh!

Uh oh!

markFriel Sep 23, 2021

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

adrianeboyd Sep 23, 2021

Uh oh!

markFriel Sep 29, 2021 Author

Uh oh!

adrianeboyd Sep 29, 2021

Uh oh!

tomateit Nov 18, 2021

Uh oh!

adrianeboyd Nov 19, 2021

markFriel
Sep 23, 2021

Replies: 1 comment 4 replies

adrianeboyd
Sep 23, 2021

markFriel Sep 29, 2021
Author