Active learning in spacy 3 #11201

Lolologist · 2022-07-25T17:43:56Z

Lolologist
Jul 25, 2022

Hi all!
I've looked around and other than "use prodigy" I can't find any real examples of how to do active learning with spaCy 3. (3.3 specifically in this case fwiw)

There are some other discussions dissuading people from trying to run Language.update() manually and instead run spacy train but that's much more akin to just... retraining... instead of active learning, isn't it?

For context, I'd like to use a tool like Label Studio (no help needed on that end, just for reference) to label documents then send off to a model that we've already trained given some documents, on textcat and named entity recognition, for updating given a single new document and then use the resulting updated Language object for determining best next documents to label. I also understand that update != same results as just training from scratch, which we definitely intend to do after a big batch of docs.

We've been able to make an Example doc easily enough, and run nlp.update() on it, which does spit out losses. And running the same doc through (along with the resultant losses) multiple times does seem to further update the losses, but try as we might we can't get any behavior to change for our trained NER+textcat model. No change to any predicted entities, nor change to textcat prediction confidence.

Surely I'm doing something wrong, but looking at the documentation and other discussions here, I can't figure out how to update a custom spacy 3 model one doc at a time.

Some code:

nlp = spacy.load("weights/model-best")
text = """COVID exposure found at Amazon. An Amazon warehouse in Seattle, Wash. was forced to close down due to the large exposure of COVID to the workers there."""

pred_doc = nlp(text)

entities = [(0, 5, "INFECTIOUS_DISEASE"), (24, 30, "AFFECTED_ORGANIZATION"), (35, 41, "AFFECTED_ORGANIZATION"),
(55, 69, "RELEVANT_LOCATION"), (124, 129, "INFECTIOUS_DISEASE")]

cats = {"positive":1.0, "negative": 0.0}

example = Example.from_dict(pred_doc, {"entities" : entities, "cats" : cats})```

Answered by polm

Jul 26, 2022

There are some other discussions dissuading people from trying to run Language.update() manually and instead run spacy train but that's much more akin to just... retraining... instead of active learning, isn't it?

Active learning and incremental training are different things.

"Active learning" means you make some annotations, update the model, and make some more annotations. Exactly how you update the model is not critical.

"Incremental training" is one term for updating a model with a small number number of examples rather than completely retraining. "Online learning" is a closely related concept. These are both hard to do with neural networks due to the catastrophic forgetting problem.

…

View full answer

polm · 2022-07-26T04:33:27Z

polm
Jul 26, 2022

There are some other discussions dissuading people from trying to run Language.update() manually and instead run spacy train but that's much more akin to just... retraining... instead of active learning, isn't it?

Active learning and incremental training are different things.

"Active learning" means you make some annotations, update the model, and make some more annotations. Exactly how you update the model is not critical.

"Incremental training" is one term for updating a model with a small number number of examples rather than completely retraining. "Online learning" is a closely related concept. These are both hard to do with neural networks due to the catastrophic forgetting problem.

So it's entirely possible to do active learning without incremental updates. For CPU models with small/moderate amounts of data, training should be fast enough that it's no big deal to retrain models repeatedly in an annotation loop.

5 replies

Lolologist Jul 26, 2022
Author

Thanks polm!

We're using GPU-based models and are more interested in active learning that incremental training/online learning. As I understand it, my original ask above hopefully is "active learning". We want to take a model we've already trained, make some annotations, update the model. It's unclear to me how we go about updating the model, precisely. See above where I mention the issues with Language.update() not seeming to make any difference.

polm Jul 27, 2022

Calling update manually with interaction in the loop / updating the model one doc at a time is incremental learning.

Can you give an example of how you're calling update? You might want to look at the core training loop. In particular see this line - after calling nlp.update it calls finish_update for each component as relevant, which is probably what you're missing.

Lolologist Jul 27, 2022
Author

Regarding the difference between active and incremental learning, is what Prodigy does not incremental learning? It was my understanding that you make an annotation, it updates the weights right then and there with one new example, and moves on. Your above statement though seems to indicate that might not be the case.

Continuing the code from above:

get_examples = lambda: [example]
optimizer = model_nlp.initialize(get_examples)
x=0
loss = {}
while x <= 10:
    x+=1
    loss = model_nlp.update([example], losses=loss, drop = 0.1)
    print(x, loss)

Yes, we're updating the model with the same example ten times, mostly because we've been trying to force some sort of change in behavior, no matter how ludicrous.

After adding in the code you suggested (the finish_update) (not shown above) we did indeed get changes! Weirdness abounds, probably related to the catastrophic forgetting you mentioned. We'll try batch labeling followed by a full retrain for now, thanks very much for your help.

polm Jul 28, 2022

Regarding the difference between active and incremental learning, is what Prodigy does not incremental learning?

Not being very familiar with the internals I had to check, but yes, Prodigy does provide incremental learning, with the annotations happening "in the loop" of training.

Yes, we're updating the model with the same example ten times, mostly because we've been trying to force some sort of change in behavior, no matter how ludicrous.

If you find that necessary it's probably better to change the learning rate than to iterate on the data multiple times - at the very least it should be faster.

Also check the training loop in detail, as while finish_update was the most important thing you were missing, there are lots of parts you'll want to reproduce to get it right.

Lolologist Jul 28, 2022
Author

I really appreciate the direction. We'll look into trying to implement our loop using the train_while_improving method as a whole, or implement more of the important bits, or stick to full retrains. Glad to know that Prodigy is doing incremental!

The "ten times" thing was just a sanity check, we definitely wouldn't do that in real usage.

Cheers!

grofte · 2023-04-11T10:02:27Z

grofte
Apr 11, 2023

Active learning is about choosing which unlabelled data to label (manually) next. Then adding the new labels you train from scratch or fine-tune the existing model to get a new set of unlabelled data to look at. The problem is that neural networks may be inappropriately confident so you might miss some data points that should have been manually labelled. But hopefully your active learning framework is also able to detect outlier data points and route them to your manual process.

You should also double-check your existing labels. Something like doubtlab which is maintained by an Explosion employee. Or cleanlab from some MIT researchers.

0 replies

Uh oh!

Active learning in spacy 3 #11201

Uh oh!

Uh oh!

Lolologist Jul 25, 2022

Replies: 2 comments · 5 replies

Uh oh!

polm Jul 26, 2022

Uh oh!

Lolologist Jul 26, 2022 Author

Uh oh!

polm Jul 27, 2022

Uh oh!

Uh oh!

Lolologist Jul 27, 2022 Author

Uh oh!

polm Jul 28, 2022

Uh oh!

Lolologist Jul 28, 2022 Author

Uh oh!

grofte Apr 11, 2023

Lolologist
Jul 25, 2022

Replies: 2 comments 5 replies

polm
Jul 26, 2022

Lolologist Jul 26, 2022
Author

Lolologist Jul 27, 2022
Author

Lolologist Jul 28, 2022
Author

grofte
Apr 11, 2023