Regex matcher and spancat - cannot get it to train #12479
-
Hi everyone, I am experimenting with spancat's rule based matching and wrote the below to match regex pattern on the word "Company" (including quotation marks) and match the five preceding tokens:
This is then saved to DocBin and gets me about 250 examples to work with (recognised by data debug). However, when I initialize training, the training never actually starts (does not show a single evaluation, and if it does after a long long time, it is all zero except for the high losses). I tweaked the suggesters to match the lengths that data debug showed me (10-15), but this did not change anything. Is there a specific component or annotating component I need to add? My config.cfg:
Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Hey Kau832, The issue seems to be that in the code store the spans are stored in |
Beta Was this translation helpful? Give feedback.
-
bump. Still trying to figure out why this does not work. I would normally use sentence_suggester to train my data and it trained just fine (although I had severe memory issues and the computer would freeze regularly during the 4 hour training). Could it be that an ngram span suggester would be just too much for my machine? I run this on: NVIDIA GeForce RTX 3060 Laptop GPU, 16GB RAM, 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz, 2304 Mhz, 8 Core(s), 16 Logical Processor(s) |
Beta Was this translation helpful? Give feedback.
-
Hey Kau832, Getting memory errors is really annoying. The allocation error here is coming not from the GPU, but the CPU ops, which I see from the fact that the operation that fails to do the allocation is here: The line that fails is this one I think this is coming from the layer of If I understand everything correctly in one of the batches of documents in your collection the suggester seems to produce suggester = registry.get("misc", "spacy.ngram_range_suggester.v1")(min_size=10, max_size=15) on a data set of documents with around 2000-2500 each and for a 100 documents I've found 19736 spans. Is it possible that a very long document ends up being in a batch that causes the memory error? |
Beta Was this translation helpful? Give feedback.
-
Tried running the code you've provided. The first minor issue I encountered was that it is missing Nitpicking aside, when running the code it actually did not print anything, because the pattern did not match. If you print
Before answering the rest of the questions I will focus on how to inspect what is in the nlp = spacy.load(nlp_path)
docbin = DocBin().from_disk(data_path)
docs = list(docbin.get_docs(nlp.vocab)) Hope this will help you to inspect your data and to move forward with debugging. |
Beta Was this translation helpful? Give feedback.
Tried running the code you've provided. The first minor issue I encountered was that it is missing
'
in the line{"TEXT": 'Start}
. Just nitpicking a bit, but for visibility for other users I'm just mentioning it here that its always helpful to post code that run beforehand to make sure its easier for us to provide help on the parts that you would be actually interested in.Nitpicking aside, when running the code it actually did not print anything, because the pattern did not match. If you print
doc.spans
after the linedoc = nlp('This is a sentence. This is a test sentence written on 27th October 1984 (the "Start Date"). This is another sentence.')
you will see the output:B…