Spacy "spancat" not training (possible config mistake) #10562

jussikuusisto · 2022-03-28T11:46:15Z

jussikuusisto
Mar 28, 2022

Hi,

Disclaimer:
Apologies is something similar has already covered, I did try to look through the spancat category and google similar issues, but didn't really find anything that would have been specifically relevant. Also, I'm pretty sure the issue is just a mistake in my config, but after looking at this for several days, I'm getting pretty blind to it, hence this post.

Situation:
I am trying to utilise Span Categorization from Spacy to train a model to categorise parts of emails as the signature. I've set up the config file using the fill-config option. My dataset has not been produced with Prodigy. My OS is Windows 10 and I'm using PyCharm terminal to run spacy train.

Issue:
When I attempt training the model, spacy train just stops after the initialization and the start of the actual training without producing any output or throwing errors.

Data:
The data I'm using is saved with DocBins.to_disk and the docs the files contain have:

span1 = Span(doc, signature.start, signature.end, label="SIGNATURE")
span2 = Span(doc, message.start, message.end, label="MESSAGE")
spans = SpanGroup(doc, name="sc", spans=[span1, span2])
doc.spans["sc"] = spans

where signature would be the span from doc containing the signature and message is the span containing the rest of the message. I think there's some redundancy in how I've set it up above, but I have attempted this in a number of ways...

EDIT: I didn't make an Entity Ruler and then save the entities to doc.spans. Is this a necessary step? Are my spans perhaps missing something and that's why they're not being used in the training?

Config:
Below is the config I've used. I have attempted to change a few things here, but nothing so far has actually helped. I suspect the issue is either something very simple in the config or a problem with how I save the data, but I just can't figure out exactly what. I'm also unsure whether I can use the [components.spancat.suggester] sizes the way I have, but I've also tried it with just [1, 2, 3, 4, 5] and the outcome is identical. Also note: I'm not using a GPU.

[paths]
train = "corpus/train.spacy"
dev = "corpus/valid.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
rows = [5000,2000,1000,1000]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.spancat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = "en_core_web_lg"
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Output:
As mentioned, no output files are produced, but here's the printout after running:
python -m spacy train .\config\signatures_test.cfg --output .\output\

2022-03-28 12:17:12.127184: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-03-28 12:17:12.127423: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-03-28 12:17:17,630] [INFO] Set up nlp object from config
[2022-03-28 12:17:17,642] [INFO] Pipeline: ['spancat']
[2022-03-28 12:17:17,645] [INFO] Created vocabulary
[2022-03-28 12:17:19,203] [INFO] Added vectors: en_core_web_lg
[2022-03-28 12:17:20,805] [INFO] Finished initializing nlp object
[2022-03-28 12:17:22,065] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE
---  ------  ------------  ----------  ----------  ----------  ------

After which it just returns me to command line.

Spacy Info:

============================== Info about spaCy ==============================

spaCy version    3.2.2                                                                                          
Location         C:\Users\Jussi Kuusisto\PycharmProjects\python_environment\venv\lib\site-packages\spacy        
Platform         Windows-10-10.0.19041-SP0                                                                      
Python version   3.8.6                                                                                          
Pipelines        en_core_web_lg (3.2.0), en_core_web_md (3.2.0), en_core_web_sm (3.2.0), en_core_web_trf (3.2.0)

Afterword:
Hopefully that's sufficient level of detail. I appreciate any help with this, and indeed apologise if this is obvious, in the wrong place, already covered etc. Thank you!

Answered by thomashacker

Mar 30, 2022

Hello 😄 Thanks for the detailed description!
At first glance, I don't see anything wrong with the config so my first guess is that either something is wrong with the data or you're running out of memory.

If it fails/hangs silently, one thing that you could do to further debug this is to run the training repeatedly and kill it with ctrl+c. The traceback will show what it was doing. If you do that a few times and it's always in the same place you can be confident that's where it's spending time processing.

I also saw that your batch_size in [nlp] is set to 1000, have you tried to lower that number?
And to spare you the many sizes in the suggester you can also use this:

[components.spancat.s…

View full answer

thomashacker · 2022-03-30T10:33:54Z

thomashacker
Mar 30, 2022

Hello 😄 Thanks for the detailed description!
At first glance, I don't see anything wrong with the config so my first guess is that either something is wrong with the data or you're running out of memory.

If it fails/hangs silently, one thing that you could do to further debug this is to run the training repeatedly and kill it with ctrl+c. The traceback will show what it was doing. If you do that a few times and it's always in the same place you can be confident that's where it's spending time processing.

I also saw that your batch_size in [nlp] is set to 1000, have you tried to lower that number?
And to spare you the many sizes in the suggester you can also use this:

[components.spancat.suggester]
@misc = "spacy.ngram_range_suggester.v1"
min_size = 1
max_size = 699

You could also enable the progress_bar of the logger and see if it's printed out before it fails.

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

12 replies

thomashacker Apr 1, 2022

Ah, it's good to hear that you found the problem! However, I'd like to hear more about your use case, if you're ok to share some details. Because the way I understand it is that you want to separate email content from signatures, right? Have you tried to do this rule-based?
I don't think that Spancat will perform well on your current approach since its predictions are based on context and signatures don't really have any. Additionally, these long span lengths make accurate predictions even harder... Instead of predicting whole spans, I'd suggest finding the indication when it ends and when the signature starts (e.g. after "Sincerely yours")

jussikuusisto Apr 1, 2022
Author

I think you are right that Spancat isn't ideal for something like this, but I wanted to test it out.

"the way I understand it is that you want to separate email content from signatures" That's exactly right.

I already have a rule based system in place, however, the difficulty with that is that people don't often sign off in the same way. It's hard to account for all types of variations so I wanted to explore an ML solution. I was already using Spacy for email categorization (which works wonderfully, even without a GPU solution) so I thought I'd check out whether I could use it for this kind of task. I shall see what the output is like and if it's unhelpful, I will probably just use the tokens and try some straight-forward ML approach (from sklearn or similar) or just return to improving my rule-based signature detector. :)

EDIT: In fact, the email signature detection is very much a part of the cleaning process for the email categorization project I'm working on.

jussikuusisto Apr 4, 2022
Author

Short "update":
This seems to sort of work if I'm only looking for very contained and specific "sign-offs" or signatures. An example:

>>> doc = model("Hi, that would work well for our purposes! Kind Regards, Jussi Kuusisto")
>>> doc.spans["sc"]
[Kind Regards, Jussi Kuusisto]
>>> doc.spans["sc"][0].label_
'SIGNATURE'
>>> doc.spans["sc"].attrs
{'scores': array([0.9709481], dtype=float32)}

I just need to figure out whether this is useable in the long run, but I did eventually get it to work by shortening the spans I was looking for quite drastically. I also dropped the label for the body of the message itself, as I only wanted to test how this might work in detecting signature starts for cleaning them from the messages. The scores are below 0.5, but that's to be expected as my training set is still quite small and it's more of a proof of concept at this stage.

Thanks for all your helpful suggestions!

thomashacker Apr 4, 2022

Oh wow, that's great to see that it seems to work so well! You should definitely keep experimenting. With the spancat still being a relatively new component, I'm super interested to see these kinds of use-cases, so feel free to share your findings in the discussion. 😄

jussikuusisto Apr 5, 2022
Author

Update:

The good news:
I guess you could call this a somewhat successful "proof of concept". I used about five times the number of examples compared to the previous (still below 500 in total, though) and the model has started to recognise the patterns even when using words it hasn't seen as examples:

>>> doc = model("Hi, this is exactly what we need! Grand. Jussi Data Scientist")
>>> doc.spans["sc"]
[Jussi Data Scientist, Grand. Jussi Data Scientist]
>>> doc.spans["sc"].attrs
{'scores': array([0.9903593, 0.8450801], dtype=float32)}

The model has not seen data with either my name, or the use of the word "Grand." as a sign off or job title "Data Scientist" prior to this. It has seen examples of regular sign-offs ("Kind Regards," "Best," "thanks," etc.) followed by either names or job titles or both. I'm using actual emails as examples and extracting the signatures manually first to produce the dataset.

I gave the "suggester" a mandate to look for spans between 2 to 15 ngrams and try to keep the signature examples I'm using to be under 15 tokens long. Also the example data now only contains spans labeled as "SIGNATURE" and no "MESSAGE" spans anymore to make things simpler.

The bad news is that this seems to only work if the email ends with the signature. If for instance I add "Please do not print this email" at the end of that example, the model does not recognise the signature span as a signature. Adding a company name, however, does not confuse it too much, even if it's one the model has not seen before during training. I think this, too, can be mitigated with enough examples containing all sorts of disclaimers after the email signature. The dataset is still very small.

I'm not entirely sure what it is actually looking for. It could just be that the model is looking for capitalized NEs at the end of the sign-off + name combo and anything else will confuse it.

While training, the highest scores are now topping out at ~0.65. Improvement upon the previous test, so I think more data is still going to be the next step in analysing the viability of this method to detect signatures.

Uh oh!

Spacy "spancat" not training (possible config mistake) #10562

Uh oh!

Uh oh!

jussikuusisto Mar 28, 2022

Replies: 1 comment · 12 replies

Uh oh!

thomashacker Mar 30, 2022

Uh oh!

thomashacker Apr 1, 2022

Uh oh!

Uh oh!

jussikuusisto Apr 1, 2022 Author

Uh oh!

jussikuusisto Apr 4, 2022 Author

Uh oh!

thomashacker Apr 4, 2022

Uh oh!

Uh oh!

jussikuusisto Apr 5, 2022 Author

jussikuusisto
Mar 28, 2022

Replies: 1 comment 12 replies

thomashacker
Mar 30, 2022

jussikuusisto Apr 1, 2022
Author

jussikuusisto Apr 4, 2022
Author

jussikuusisto Apr 5, 2022
Author