Regex matcher and spancat - cannot get it to train #12479

Kau832 · 2023-03-28T10:14:54Z

Kau832
Mar 28, 2023

Hi everyone,

I am experimenting with spancat's rule based matching and wrote the below to match regex pattern on the word "Company" (including quotation marks) and match the five preceding tokens:

doc = nlp(text)

    doc.spans["test"] = SpanGroup(doc)
    label = "Company"
    expression = r'\b(\S+\s+){0,5}\S*\s*"Company"'
    for match in re.finditer(expression, doc.text):
        start, end = match.span()
        doc.spans["test"] = [doc.char_span(start, end, label=label)]
        for span in doc.spans["test"]:
            print(span.text, span.label_)

This is then saved to DocBin and gets me about 250 examples to work with (recognised by data debug). However, when I initialize training, the training never actually starts (does not show a single evaluation, and if it does after a long long time, it is all zero except for the high losses).

I tweaked the suggesters to match the lengths that data debug showed me (10-15), but this did not change anything. Is there a specific component or annotating component I need to add?

My config.cfg:

[paths]
train = "train"
dev = "valid"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "test"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat.suggester]
@misc = "spacy.ngram_range_suggester.v1"
min_size = 10
max_size = 15

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 100
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.0012
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0
spans_test_p = 0.0
spans_test_f = 1.0
spans_test_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Thanks!

Answered by kadarakos

Apr 17, 2023

Tried running the code you've provided. The first minor issue I encountered was that it is missing ' in the line {"TEXT": 'Start}. Just nitpicking a bit, but for visibility for other users I'm just mentioning it here that its always helpful to post code that run beforehand to make sure its easier for us to provide help on the parts that you would be actually interested in.

Nitpicking aside, when running the code it actually did not print anything, because the pattern did not match. If you print doc.spans after the line doc = nlp('This is a sentence. This is a test sentence written on 27th October 1984 (the "Start Date"). This is another sentence.') you will see the output:

{`ruler`: []}

B…

View full answer

kadarakos · 2023-03-29T11:11:50Z

kadarakos
Mar 29, 2023

Hey Kau832,

The issue seems to be that in the code store the spans are stored in doc.spans["test"], but then in the config uses spans_key = "phrases" so the spancat component is not finding the spans your would like to test with.

1 reply

Kau832 Mar 29, 2023
Author

Hi, thanks for the response. This was a simple editing error and I have fixed that in the original post above.

I tried training the model again, and just as always the training never outputed any values/score. I let it run for about 30 min today (normally I would have given up already) and got the following:

⚠ Aborting and saving the final best model. Encountered exception:
MemoryError((8455671, 96), dtype('float32'))
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code

  File "thinc\backends\numpy_ops.pyx", line 323, in thinc.backends.numpy_ops.NumpyOps.reduce_mean
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 3.02 GiB for an array with shape (8455671, 96) and data type float32

I have been training much larger models with multiple patterns on this machine and cannot figure out why this would happen.

I also tried combining this "test" spanGroup with another one I am using and trained the model using the combined span groups and while the training initialized as usual, it seemed to have totally ignored the "test" span key from my original post (and did not predict even the spans it was trained on in none of the 30 examples provided). Is it perhaps that the spanGroup was not created correctly, despite data debug saying the data is fine?

Edit: setting max_epochs to -1 did not solve this.

Kau832 · 2023-04-02T13:16:20Z

Kau832
Apr 2, 2023
Author

bump. Still trying to figure out why this does not work. I would normally use sentence_suggester to train my data and it trained just fine (although I had severe memory issues and the computer would freeze regularly during the 4 hour training). Could it be that an ngram span suggester would be just too much for my machine?

I run this on:

NVIDIA GeForce RTX 3060 Laptop GPU, 16GB RAM, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz, 2304 Mhz, 8 Core(s), 16 Logical Processor(s)

0 replies

kadarakos · 2023-04-03T14:29:34Z

kadarakos
Apr 3, 2023

Hey Kau832,

Getting memory errors is really annoying. The allocation error here is coming not from the GPU, but the CPU ops, which I see from the fact that the operation that fails to do the allocation is here: thinc\backends\numpy_ops.pyx.

The line that fails is this one means = numpy.zeros(shape=(B, O), dtype="float32"), where the reduce_mean op is trying allocate and array for the result, but fails to do so because its too big.

I think this is coming from the layer of spancat that is computing the span-vectors by reducing the token-vectors: https://github.com/explosion/spaCy/blob/master/spacy/ml/models/spancat.py#L21.

If I understand everything correctly in one of the batches of documents in your collection the suggester seems to produce 8455671 spans, which is a lot. To get a better picture I tried using the suggester with your configuration:

suggester = registry.get("misc", "spacy.ngram_range_suggester.v1")(min_size=10, max_size=15)

on a data set of documents with around 2000-2500 each and for a 100 documents I've found 19736 spans. Is it possible that a very long document ends up being in a batch that causes the memory error?

1 reply

Kau832 Apr 3, 2023
Author

Thank you for helping me make sense of the error. I really appreciate it

The number of spans does not seem right. I also have about 100 of documents, each being annotated on one pattern, resulting in about 1-3 annotations per document. It might be that the above pattern is just wrong and or I simply created the annotations wrongly (although data debug still looks fine).

I tried using a different pattern (with an ENT_TYPE attribute, one for DATE (example below) and one for ORG) and my computer does not freeze and does actually train. However, the training is extremely slow and only returns 0.00 values.

import spacy
from spacy.tokens import SpanGroup
from spacy.tokens import DocBin

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("span_ruler", name="ruler")

patterns = [{"label": "start_date", "pattern": [
    {"ENT_TYPE": "DATE", "OP": "{5,}"},
    {"TEXT": '('},
    {"TEXT": 'the'},
    {"TEXT": '"'},
    {"TEXT": 'Start},
    {"TEXT": 'Date'},
    {"TEXT": '"'},
    {"TEXT": ')'}
]}]

ruler.add_patterns(patterns)

    doc = nlp('This is a sentence. This is a test sentence written on 27th October 1984 (the "Start Date"). This is another sentence.')

    doc.spans["Test"] = SpanGroup(doc)
    db = DocBin()
    for span in doc.spans["ruler"]:
        doc.spans["Test"].append(span)
        for span in doc.spans["Test"]:
            print(span.text, span.label_)

Does this look okay? Again, the data debug makes clear sense of the data that this code produces. Just to make sure however, can I somehow look into .spacy data to actually see the annotations it made (just like I would with a .json file)? Also, I am aware the above pattern isn't probably very sophisticated but in most of my documents it gets matched accordingly, and so for weak labeling purposes, it's fine enough.

I am using the same suggester (range 11-15) as before. I use a seemingly default spancat config with only the suggester amended.

Do I need any annotating component or anything else to make this work? I saw that with sentence suggesters, for example, you'd probably want to use a sentencizer (as a new component and in annotating components). **Perhaps I am missing something in my config.cfg to make my suggester work with ENT_TYPE pattern?

My config has only the tok2vec and spancat components.

kadarakos · 2023-04-17T11:30:10Z

kadarakos
Apr 17, 2023

Tried running the code you've provided. The first minor issue I encountered was that it is missing ' in the line {"TEXT": 'Start}. Just nitpicking a bit, but for visibility for other users I'm just mentioning it here that its always helpful to post code that run beforehand to make sure its easier for us to provide help on the parts that you would be actually interested in.

Nitpicking aside, when running the code it actually did not print anything, because the pattern did not match. If you print doc.spans after the line doc = nlp('This is a sentence. This is a test sentence written on 27th October 1984 (the "Start Date"). This is another sentence.') you will see the output:

{`ruler`: []}

Before answering the rest of the questions I will focus on how to inspect what is in the .spacy file. That is a binary format that serializes the Doc objects: https://spacy.io/api/docbin. You can run:

nlp = spacy.load(nlp_path)
docbin = DocBin().from_disk(data_path)
docs = list(docbin.get_docs(nlp.vocab))

Hope this will help you to inspect your data and to move forward with debugging.

0 replies

Uh oh!

Regex matcher and spancat - cannot get it to train #12479

Uh oh!

Uh oh!

Kau832 Mar 28, 2023

Replies: 4 comments · 2 replies

Uh oh!

kadarakos Mar 29, 2023

Uh oh!

Uh oh!

Kau832 Mar 29, 2023 Author

Uh oh!

Uh oh!

Kau832 Apr 2, 2023 Author

Uh oh!

kadarakos Apr 3, 2023

Uh oh!

Uh oh!

Kau832 Apr 3, 2023 Author

Uh oh!

kadarakos Apr 17, 2023

Kau832
Mar 28, 2023

Replies: 4 comments 2 replies

kadarakos
Mar 29, 2023

Kau832 Mar 29, 2023
Author

Kau832
Apr 2, 2023
Author

kadarakos
Apr 3, 2023

Kau832 Apr 3, 2023
Author

kadarakos
Apr 17, 2023