Problems migrating from 2.2.4 to Current Version - Misalignment and Training Loop #11303

Zagsss · 2022-08-14T19:30:04Z

Zagsss
Aug 14, 2022

Hello

My project is working with spacy 2.2.4 and uses tagged NER model with training loop.
Now I am updating it to spacy current version.
The training data contains 300 documents, some as big as 300kb.
My results range from 75% to 85%.

Problem 1 - Misalignment warnings
I used to have no problems with tagging words that finished in coma or period.
Example
"The deputy of this account is James West."
"The deputy of this account is [James West]."
Is there any configuration I should make to fix this? or should id tag the period inside the limits?

Problem 2 - Training Loop
The training loop in 2.2.4 for me, uses about 150 hours of training with incremental loop of 50 interactions.
So, the new one, using spacy file or custom loop, only take a couple of minutes?
(I can't right now test the results to compare, because of problem 1, of my data is misaligned)

Thanks in advance

polm · 2022-08-15T04:16:49Z

polm
Aug 15, 2022

It's not surprising that you might run into some changes in tokenization between v2 and v3, but periods at the end of sentence with normal-looking words shouldn't be a general problem - the latest tokenizer doesn't attach the period to "West" for me with your example sentence. Could you check your data again to be sure that's the problem?

If you have a lot of alignment errors, and it's safe to handle them in a uniform way, you can use the alignment_mode setting to always expand (include half-covered tokens) or contract (exclude half-covered tokens).

The training data contains 300 documents, some as big as 300kb

NER doesn't benefit from context more than like a paragraph, and working with very long documents can cause a variety of other issues, so it would probably be eaiser to work with your data if you could break your documents up on paragraph or page boundaries.

The training loop in 2.2.4 for me, uses about 150 hours of training with incremental loop of 50 interactions.
So, the new one, using spacy file or custom loop, only take a couple of minutes?

Sorry, I don't understand. Is an "interaction" an iteration or something else? If the new training is only taking a couple of minutes that sounds like something is wrong and your data is being skipped. Like you mention about your data, you should probably fix the alignment issues first.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Problems migrating from 2.2.4 to Current Version - Misalignment and Training Loop #11303

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Problems migrating from 2.2.4 to Current Version - Misalignment and Training Loop #11303

Uh oh!

Zagsss Aug 14, 2022

Replies: 1 comment

Uh oh!

polm Aug 15, 2022

Zagsss
Aug 14, 2022

polm
Aug 15, 2022