How does spacy handle transformers sequence limit? #8166

user06039 · 2021-05-20T17:06:18Z

user06039
May 20, 2021

I like spacy because of using it for many document level entity recognition which have sequence length of 2000 -3000 words. But in spacy v3 we have transformers to choose as pre-trained model. But we know that bert like models have a sequence limit of 512 words. So then how can I use spacy with my documents for entity recognition. I want to understand how does spacy overcome this, and what options do I have here?

Answered by polm

May 21, 2021

spaCy supports a number of strategies for splitting up documents to fit in the Transformers window. The quickstart uses strided spans with some overlap. You can read more about that strategy and other options in the spaCy Transformers docs.

View full answer

polm · 2021-05-21T04:11:11Z

polm
May 21, 2021

spaCy supports a number of strategies for splitting up documents to fit in the Transformers window. The quickstart uses strided spans with some overlap. You can read more about that strategy and other options in the spaCy Transformers docs.

6 replies

polm May 21, 2021

I am not able to totally understand strided spans with some overlap. Can you provide a small example to go through?

Say your document is "A B C D E F G H" and you split it into spans of length 3 with a stride of 3 (so no overlap). You will pass "A B C", "D E F", and "G H" to the model. This means that while normally the model uses context, G will know nothing about F, which is not great. So strided spans help work around this.

If you use spans of length 3 with a stride of 2, then you'll pass "A B C" "C D E", "E F G", and "G H" to the model. This will require more work but it means that all tokens can see their neighbors. For individual tokens the representation will be a combination of multiple model outputs, usually an average.

What I am wondering is if you split a document into sentences then some of my information is lost right during entity learning isn't it?

Technically anything except treating the whole document as one window is losing information, but we have to do something to deal with computational limitations.

Note that splitting into sentences and splitting into strided spans are different things.

In the case of your example, if the input was split that way you would lose information and get bad results. However, your example is really weird. I think your input is supposed to be one sentence ("John Mat" is your full name, not two people?) but you're suggesting the tokenizer splits it into three sentences:

My name is John
Mat and I work at
Google

That would be an error in sentence tokenization, and you would get bad results. In general, even without Transformers, spaCy can't recover named entities that go over a sentence boundary, as normally that shouldn't happen.

This happens if we decide to split a document into sentences right? How can spacy transformer deal in this scenario?

If your sentence tokenization is very bad that could hurt your results, yes. This is true for any model. I think that normally sentence tokenization is not like in your example though.

user06039 May 21, 2021
Author

@polm Thank you for the information. I think i understood the difference.

user06039 May 21, 2021
Author

@polm Hello sir, Just have one more doubt. Even though I understood the concept just want to confirm what window and stride should I use. I have text documents of 1000-2000 words for ner model. This is my current config file.

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 500
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I have a good gpu vram, but current setting is window=128, stride=96 which was the config.cfg file I found in spacy docs, I was not getting good results. Do you think I should increase the window and stride since my documents are longer? What is the ideal value in my case because of longer documents. What are all the other hyperparams I can tune here in my case?

I would really appreciate if you could shed some light on it sir.

polm May 22, 2021

OK, it looks like you're training an NER pipeline. Can you tell me more about your data, maybe with a few example sentences?

Longer contexts are often better, but for NER usually super long contexts aren't necessary or helpful. For example, if you read a sentence like "I got on the train to Gorfpuckle, and that was when all my trouble started", you don't need to see any neighboring sentences to tell that "Gorfpuckle" is probably a place. In fact it might be worth trying the sentence-based span getter for this reason.

Since you have longish documents, it's right to be concerned about context, but it might not be the most important setting for your problem.

(Also not a big deal but you don't need to call me "sir".)

user06039 May 22, 2021
Author

@polm My inputs are job descriptions, and I am trying to predict entities like job-title, skills, preferred skills, educations and certifications. Especially for predicting entities like preferred skills I need neighbouring sentences.

For examples:

B-PREFERRED SKILL
Python                          programming language is also preferred.

The JD length varies from 500 - 1500 words.

Uh oh!

How does spacy handle transformers sequence limit? #8166

Uh oh!

user06039 May 20, 2021

Replies: 1 comment · 6 replies

Uh oh!

polm May 21, 2021

Uh oh!

Uh oh!

polm May 21, 2021

Uh oh!

user06039 May 21, 2021 Author

Uh oh!

Uh oh!

user06039 May 21, 2021 Author

Uh oh!

polm May 22, 2021

Uh oh!

Uh oh!

user06039 May 22, 2021 Author

user06039
May 20, 2021

Replies: 1 comment 6 replies

polm
May 21, 2021

user06039 May 21, 2021
Author

user06039 May 21, 2021
Author

user06039 May 22, 2021
Author