Training SpanCat without Prodigy? #12124

SharbelWired · 2023-01-19T00:27:39Z

SharbelWired
Jan 19, 2023

Hi everyone, I am trying to train a spancat model manually using the command line spacy train workflow. When I use Prodigy, I can label my text samples, highlight spans, and train via prodigy's training integrations with spacy. When I train a custom spancat model this way, everything works fine.

When I try to use annotations that were created in another labelling app (eg: Label Studio), I first tried to export in Label Studio using the Connl2003 format (following the docs suggestions => https://labelstud.io/guide/export.html#spaCy) , and using spacy convert. This did not work well for me since it created a single document, instead of the 300+ that are actually there. So, instead, I figured I would just import the raw JSON file from Label Studio, iterate over the documents, and manually create the DocBin using the start/ends for each label in the file.

I am using the following that takes in a json dict, then attempts to create a new DocBin with the spans and their corresponding labels associated with each doc.

When I train this, the results are basically 0s, but sometimes I do get marginal results. Again, when I train with Prodigy with MUCH less samples (eg only 10!) I see scores come back as expected. I must be doing something wrong when I am rebuilding the DocBin manually.. is the code below essentially what is needed to create a valid DocBin for spancat training?

def create_spacy_file(json_file, filename):
    db = DocBin()
    nlp = spacy.blank('en')
    for item in json_file:
        item_text = item["text"]
        doc = nlp(item_text)
        
        spans = []

        for annotation in item["lbl"]:
            start = annotation["start"]
            end = annotation["end"]
            label = annotation["labels"][0]
            span = doc.char_span(start, end, label=label)
            
            if span is not None:
                spans.append(span)

        group = SpanGroup(doc, name="sc", spans=spans)
        doc.set_ents(spans)
        doc.spans["sc"] = group
        db.add(doc)
    db.to_disk(filename)

This is the config that I am using, I basically used the quickstart and used the fill command after.. left things default for the most part:

[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

adrianeboyd · 2023-01-19T10:27:15Z

adrianeboyd
Jan 19, 2023

I'd recommend double-checking all the else cases here:

            if span is not None:
                spans.append(span)

If there's something wrong with the character offsets coming from your original annotation, it's possible that no spans are being added to the training docs. You can also use spacy debug data to check the details for the spans in your training data.

If the spans are being converted correctly, the next thing to check is how many of annotated spans are covered by the suggester, which suggests 1-3-grams by default. If you have longer spans this default wouldn't be suitable for your task.

4 replies

SharbelWired Jan 19, 2023
Author

Thanks @adrianeboyd your thinking was spot on and your second point unlocked the answer for me, so thank-you! I was actually able to figure out your first point about the spans being set as None by realizing that the indexes emitted from Label Studio did not jibe with what Spacy was expecting. Bottom line, it's a text pre-processing issue since the text sample contained a lot of repeated spaces, but I read that those need to be removed. I started to make some regex expressions to clean things, but thought I should first revisit the convert command. I did this because I actually started down this rabbit hole because Label Studio's Conll2003 export file doesn't load directly into Spacy the way LS' docs indicate. In general, LS docs say that you need to add an additional O to the first line of the Conll file, however in keeping the first line, which is the DocStart line, this prompts the convert command to treat the Conll file as a single document. When I removed the first line in the Connl file (Doc Start line), and re-ran the convert command, it created a DocBin with the proper number of Docs. To those using Label Studio, this was the key to allowing me to convert the Connl export correctly into a valid DocBin. From there, the code that I originally shared became smaller because it was just about copying the doc's entities to the scancat key.. like this:

for doc in connl_docs:
    doc.spans["sc"] = doc.ents

This got me to a point where I had a valid DocBin, and I was able to split out my train/dev/test datasets and so on. I was super geeked and quickly started a training session. Things immediately looked different, all of the training output info looked better, and I saw loss reducing and scores rising. However, it basically got stuck at a MaP of ~0.43, whereas when I did a training in Prodigy with a much smaller dataset, I achieved (much) higher scores. I then saw your post this morning, and changed the config to correlate with the suggester ranges I saw in the Prodigy config and I immediately got scores in the high 80s, which is what I was able to achieve in Prodigy.

I read in Prodigy's docs that they automatically set the proper ngram sizes/ranges based on your labelled data, so I guess that when doing labelling outside of Prodigy, you definitely need to look at the [components.spancat.suggester] section of your config, as it made a huge difference for me.

@adrianeboyd , again, thank-you so much for your response, your response was definitely the key for me solving this!

adrianeboyd Jan 20, 2023

Glad to hear it's working!

goonhoon Mar 6, 2023

Hi @SharbelWired! Apologies for reviving an older threat but I wonder if you could share the final .json to DocBin span converter script? I do not own Prodigy and thought that this might be a good way to try whether spancat brings better results as compared to NER for my use-case.

SharbelWired Mar 6, 2023
Author

Hi there, unfortunately I really don't have a convert script from Label Studio -> Spacy DocBin , that I can share since this was part of a larger code-base. The first block of code in my original post does work though. One thing though @goonhoon . If you are looking for some steps to get you going, to have Label Studio work with Spacy, you may want to export using Connl2003 instead. You can do something like this:

Export from Label Studio using the Connl2003 option
Use Spacy's built in convert CLI command to convert that Connl2003 file to a spacy docbin . GOTCHA ALERT: remove the first line of the Connl file to force spacy's convert to import the docs as multiple documents (it uses the blank lines as a delimiter if the docstart line is omitted).
By default the convert method will put your spancats in the doc.ents collection. Copy them into a SpanGroup object, and set that spangroup's name to the default sc name... then copy it into your doc.spans['sc']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training SpanCat without Prodigy? #12124

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training SpanCat without Prodigy? #12124

Uh oh!

SharbelWired Jan 19, 2023

Replies: 1 comment · 4 replies

Uh oh!

adrianeboyd Jan 19, 2023

Uh oh!

SharbelWired Jan 19, 2023 Author

Uh oh!

adrianeboyd Jan 20, 2023

Uh oh!

goonhoon Mar 6, 2023

Uh oh!

SharbelWired Mar 6, 2023 Author

SharbelWired
Jan 19, 2023

Replies: 1 comment 4 replies

adrianeboyd
Jan 19, 2023

SharbelWired Jan 19, 2023
Author

SharbelWired Mar 6, 2023
Author