Training data for spancat #11351

RandomArnab · 2022-08-22T08:04:53Z

RandomArnab
Aug 22, 2022

Wanted to try out spancategorizer for a text dataset with overlapping entities but can't seem to find anywhere as to how the training data format looks like. Can someone possibly share a training.jsonl file so as to ease this process?

Answered by ljvmiranda921

Aug 23, 2022

Hi @RandomArnab ,

In spaCy v3, we use the serialized .spacy files instead of JSONL. To prepare the training data for span categorization, you need to assign entities in the doc.span attribute. Here's an example below for an example sentence "Welcome to the Bank of China":

import spacy
from spacy import displacy
from spacy.tokens import Span

text = "Welcome to the Bank of China."

nlp = spacy.blank("en")
doc = nlp(text)

doc.spans["sc"] = [
   Span(doc, 3, 6, "ORG"), 
   Span(doc, 5, 6, "GPE"),
]

To serialize them into a .spacy file, you need to collect them inside a DocBin object and call the to_disk() method. Something like this:

from spacy.tokens import DocBin

doc_bin = DocBin(docs=my…

View full answer

ljvmiranda921 · 2022-08-23T01:15:00Z

ljvmiranda921
Aug 23, 2022

Hi @RandomArnab ,

In spaCy v3, we use the serialized .spacy files instead of JSONL. To prepare the training data for span categorization, you need to assign entities in the doc.span attribute. Here's an example below for an example sentence "Welcome to the Bank of China":

import spacy
from spacy import displacy
from spacy.tokens import Span

text = "Welcome to the Bank of China."

nlp = spacy.blank("en")
doc = nlp(text)

doc.spans["sc"] = [
   Span(doc, 3, 6, "ORG"), 
   Span(doc, 5, 6, "GPE"),
]

To serialize them into a .spacy file, you need to collect them inside a DocBin object and call the to_disk() method. Something like this:

from spacy.tokens import DocBin

doc_bin = DocBin(docs=my_list_of_docs)
doc_bin.to_disk("/some/path/to/train.spacy")

You can check some of our spaCy projects to better understand this process. We have one for overlapping spans.

3 replies

RandomArnab Aug 23, 2022
Author

Thanks a lot!

geetarajagopal Nov 1, 2022

Hello,

I am working on a NLP project using spacy where i am trying to use the SpanCat component. I already used ner but i am unable to get the desired performance. I prepared my training data for SpanCat using the method mentioned in this post and the several other posts including using SpanGroup, however when i train my model using these train.spacy and dev.spacy files, my model does not seem to provide any scores. PFB output while training my model:

Below is my code:

I tried different combinations by playing around with the default config file generated for spancat but none of them work. I tried using the spacy files from the projects provided by spacy(experimental/ner_spancat) and spancat seems to train properly for those spacy files.

Can someone please provide me pointers where i might be going wrong in preparing the training data? TIA

polm Nov 2, 2022

Hey, please do not post the same question in multiple threads. We will follow up in #11731.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training data for spancat #11351

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training data for spancat #11351

Uh oh!

RandomArnab Aug 22, 2022

Replies: 1 comment · 3 replies

Uh oh!

ljvmiranda921 Aug 23, 2022

Uh oh!

RandomArnab Aug 23, 2022 Author

Uh oh!

geetarajagopal Nov 1, 2022

Uh oh!

polm Nov 2, 2022

RandomArnab
Aug 22, 2022

Replies: 1 comment 3 replies

ljvmiranda921
Aug 23, 2022

RandomArnab Aug 23, 2022
Author