spacy debug data -V question #10352

kanayer · 2022-02-22T10:55:10Z

kanayer
Feb 22, 2022

Hello, could you please let me know if it is normal when the number of deprel tags displayed after running spacy debug data -Vis different from the one a user gets when counting themselves? For example, spacy debugger says there are 12 mark deprel tags in the training set while sublime editor and my own code counted 38. Same happens for the other tags. Is it supposed this way or is it a mistake on my side?

Also, if possible, could you please explain why doc object groups several sentences from the training set together?

Answered by adrianeboyd

Feb 25, 2022

debug data counts the labels for the projectivized, aligned trees, so if there are a lot of misaligned tokens or non-projective trees (-V also gives counts for this), the counts can look different. If you have a clear case where you think there's a bug in the counts, you can attach it here and we can double-check.

If each training doc contains only one sentence, then the parser does not learn to split sentences. Since a lot of training corpora provide annotation in sentences rather than longer documents, we recommend grouping them into paragraph-sized chunks for training. If you have the details for your training corpus it's probably even better if you can create real paragraphs rather th…

View full answer

adrianeboyd · 2022-02-25T09:04:26Z

adrianeboyd
Feb 25, 2022

debug data counts the labels for the projectivized, aligned trees, so if there are a lot of misaligned tokens or non-projective trees (-V also gives counts for this), the counts can look different. If you have a clear case where you think there's a bug in the counts, you can attach it here and we can double-check.

If each training doc contains only one sentence, then the parser does not learn to split sentences. Since a lot of training corpora provide annotation in sentences rather than longer documents, we recommend grouping them into paragraph-sized chunks for training. If you have the details for your training corpus it's probably even better if you can create real paragraphs rather than fake paragraphs in a custom conversion, but the details are usually corpus-specific so spacy convert just supports the -n option currently for most formats currently.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

spacy debug data -V question #10352

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

spacy debug data -V question #10352

Uh oh!

kanayer Feb 22, 2022

Replies: 1 comment

Uh oh!

adrianeboyd Feb 25, 2022

kanayer
Feb 22, 2022

adrianeboyd
Feb 25, 2022