Train warning: Token indices sequence length is longer than the specified maximum #13032

billziss-gh · 2023-09-29T18:06:04Z

billziss-gh
Sep 29, 2023

I am trying to train a new textcat component and I am receiving the warning: Token indices sequence length is longer than the specified maximum sequence length for this model (621 > 512). Running this sequence through the model will result in indexing errors.

Some other discussion (e.g. #9277) seems to suggest that this warning comes from the libraries underlying Spacy and not Spacy itself and that it can be ignored, because the transformer component in Spacy uses strided spans. Nevertheless I would like to fix my training data to avoid the problem if possible (e.g. remove long URL's). Any advice?

Here is output from the training:

/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
✔ Created output directory: newscat
ℹ Saving to output directory: newscat
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2023-09-29 16:23:36,444] [INFO] Set up nlp object from config
[2023-09-29 16:23:36,467] [INFO] Pipeline: ['transformer', 'textcat']
[2023-09-29 16:23:36,472] [INFO] Created vocabulary
[2023-09-29 16:23:36,472] [INFO] Finished initializing nlp object
Downloading (…)lve/main/config.json: 100%|█████| 481/481 [00:00<00:00, 1.21MB/s]
Downloading (…)olve/main/vocab.json: 100%|███| 899k/899k [00:00<00:00, 2.73MB/s]
Downloading (…)olve/main/merges.txt: 100%|███| 456k/456k [00:00<00:00, 1.86MB/s]
Downloading (…)/main/tokenizer.json: 100%|█| 1.36M/1.36M [00:00<00:00, 16.1MB/s]
Downloading model.safetensors: 100%|██████████| 499M/499M [00:01<00:00, 325MB/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2023-09-29 16:25:27,809] [INFO] Initialized pipeline components: ['transformer', 'textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'textcat']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  -------------  ------------  ----------  ------
Token indices sequence length is longer than the specified maximum sequence length for this model (621 > 512). Running this sequence through the model will result in indexing errors
  0       0           0.00          0.06        0.32    0.00
^C

Here is my config:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","textcat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v2"}
threshold = 0.0

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

And the spacy info:

============================== Info about spaCy ==============================

spaCy version    3.6.1                         
Location         /opt/conda/lib/python3.10/site-packages/spacy
Platform         Linux-6.1.42+-x86_64-with-glibc2.31
Python version   3.10.12                       
Pipelines        en_core_web_lg (3.6.0), en_core_web_sm (3.6.0)

Answered by adrianeboyd

Oct 2, 2023

It's probably not possible to 100% avoid this problem for random/gibberish input using spacy's default tokenizer because it can end up with very long single tokens that will be split into many transformer wordpiece tokens.

In practice with relatively sensible natural language input, you can probably avoid most of these cases by removing the url_match tokenizer pattern and letting the tokenizer split long URLs up on punctuation, which will get the spacy tokenization closer to the wordpiece tokenization.

Docs on customizing the tokenizer for training: https://spacy.io/usage/training#custom-tokenizer

But if this warning is rare, it is probably fine to ignore it. Since it only affects a singl…

View full answer

adrianeboyd · 2023-10-02T10:11:36Z

adrianeboyd
Oct 2, 2023

It's probably not possible to 100% avoid this problem for random/gibberish input using spacy's default tokenizer because it can end up with very long single tokens that will be split into many transformer wordpiece tokens.

In practice with relatively sensible natural language input, you can probably avoid most of these cases by removing the url_match tokenizer pattern and letting the tokenizer split long URLs up on punctuation, which will get the spacy tokenization closer to the wordpiece tokenization.

Docs on customizing the tokenizer for training: https://spacy.io/usage/training#custom-tokenizer

But if this warning is rare, it is probably fine to ignore it. Since it only affects a single span within the doc, the effect on the resulting annotation for textcat is probably small.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Train warning: Token indices sequence length is longer than the specified maximum #13032

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Train warning: Token indices sequence length is longer than the specified maximum #13032

Uh oh!

billziss-gh Sep 29, 2023

Replies: 1 comment

Uh oh!

adrianeboyd Oct 2, 2023

billziss-gh
Sep 29, 2023

adrianeboyd
Oct 2, 2023