Receiving the warning messgae 'Token indices are too long' even after validating doc length is under max sequence length #9277
-
Hi, I'm using the transformer pipeline for NER and I am trying to finetune the pipeline. I have texts that can be quite long but I split the texts into chunks to ensure that when tokenised that the resulting document is below the maximum sequence length of 512. However, when I pass the texts and annotations into an Example and update the model I receive the warning about Token indices exceeding the max sequence length. Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 512). Running this sequence through the model will result in indexing errors I have implemented a method that checks the number of tokens in the document and would raise a warning if the doc length is above the 512 max sequence length. No warning is thrown from this function but I still receive the warning about the tokenisation. # Function to check the number of tokens in the would be document
def check_text_size_batch(texts:List[text], model):
for text in texts:
doc = model.make_doc(text)
if len(doc) > 512:
warnings.warn('A document is over the max sequence length')
# Function that creates a batch of Examples
def create_batch_examples(texts: List[str], annotations: List[dict], nlp: spacy.pipeline) -> List[spacy.language.Example]:
batch = []
for i in range(len(texts)):
example = Example.from_dict(nlp.make_doc(texts[i]), {"entities": annotations[i]["entities"]})
batch.append(example)
return batch
# Training loop
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in ["ner", 'transformer']]
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(epochs):
random.shuffle(training_data)
losses = {}
batches = minibatch(training_data, size=batch_size)
for batch in batches:
texts, annotations = zip(*batch)
check_text_size_batch(texts, nlp)
examples = create_batch_examples(texts, annotations, nlp)
nlp.update(examples, sgd=optimizer, drop=hyperparameters["dropout"], losses=losses) Any help in understanding why this would happen would be much appreciated. Is my logic wrong regarding using the make doc function? I presumed the tokenization that is happening there is the same that would occur in the actual pipeline. Not really sure why the warning would be thrown Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
The short answer is that you can ignore this warning and you don't need to do anything to truncate or split your docs. The The spacy tokenization (word tokenization with There's just no (good) way for spacy to suppress this warning from the tokenize step in Edited: Ah, wait. It's likely that you're only seeing this message because the spans are indeed too long. If it's happening once or twice in a large corpus you can probably ignore it, but if it's happening frequently, you should lower |
Beta Was this translation helpful? Give feedback.
The short answer is that you can ignore this warning and you don't need to do anything to truncate or split your docs. The
transformer
component in spacy uses overlapping strided spans internally by default (see the settings in your config under[components.transformer.model.get_spans]
) to be able to process longer texts.The spacy tokenization (word tokenization with
len(doc)
) is not the same as the internal transformer tokenization (BPE, wordpiece, etc.). Usually the transformer tokenization has more tokens, but not necessarily. If128
spacy tokens correspond to more than the transformermax_length
tokens (which is unusual but can happen with things like long URLs), thenspacy-transformers