Training textcat without using the CLI tool #11566

thefirebanks · 2022-09-30T19:46:17Z

thefirebanks
Sep 30, 2022

Hello!

I know that the recommended way of training a model is using the CLI tool, but I need to add some custom code to measure performance during the training loop so I'm looking for a code-based solution to run the training loop, as opposed to running the CLI tool. I've seen previous examples go like this:

nlp = spacy.blank('en')
if 'textcat' not in nlp.pipe_names:
    textcat = nlp.add_pipe('textcat', config={
          "threshold": 0.5,
          "model": DEFAULT_SINGLE_TEXTCAT_MODEL
      })
    
# get names of other pipes to disable them during training
pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

train_data = get_train_data_in_spacy_format()

with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.create_optimizer()
    batch_sizes = compounding(4.0, 32.0, 1.001)
    for i in range(n_iter):
        print('Iteration: {}'.format(i))
        losses = {}
        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch in batches:
            examples = [Example.from_dict(nlp.make_doc(text), annotation) for text, annotation in batch]
            nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)

Which works well! However, my main concern is - what if I want to use pre-trained vectors from the en_core_web_lg models from spacy? How can I train a textcat model using the en_core_web_lg as a base model, instead of using a blank pipeline spacy.blank("en")?.

I saw a "similar" post here: #6898 and the answer "You can also try the training quickstart with textcat + CPU + accuracy for a language that has a pretrained _lg model available like English." <- is there a way of implementing this in my loop without the CLI tool?

Thanks!

Answered by ljvmiranda921

Oct 3, 2022

Hi @thefirebanks ,

However, my main concern is - what if I want to use pre-trained vectors from the en_core_web_lg models from spacy? How can I train a textcat model using the en_core_web_lg as a base model, instead of using a blank pipeline spacy.blank("en")?.

If you want to do this programmatically, you can use the train() function in spacy.cli.train. Import it and then provide the override you need. In order to use the en_core_web_lg model, you can specify the paths.vectors parameter in the configuration file (null -> en_core_web_lg).

From the code example, it's not clear to me what it's trying to measure. In any case, you can write a custom logger that suits your needs.

View full answer

ljvmiranda921 · 2022-10-03T09:03:09Z

ljvmiranda921
Oct 3, 2022

Hi @thefirebanks ,

However, my main concern is - what if I want to use pre-trained vectors from the en_core_web_lg models from spacy? How can I train a textcat model using the en_core_web_lg as a base model, instead of using a blank pipeline spacy.blank("en")?.

If you want to do this programmatically, you can use the train() function in spacy.cli.train. Import it and then provide the override you need. In order to use the en_core_web_lg model, you can specify the paths.vectors parameter in the configuration file (null -> en_core_web_lg).

From the code example, it's not clear to me what it's trying to measure. In any case, you can write a custom logger that suits your needs.

0 replies

thefirebanks · 2022-10-03T21:28:21Z

thefirebanks
Oct 3, 2022
Author

Thanks @ljvmiranda921 ! I'm mainly trying to measure how long it takes for the textcat model to train on certain datasets, as well as how fast it processes characters.

Do you know if there's a noticeable difference in accuracy when a textcat model is trained from scratch (using spacy.blank("en")) vs. when we use the pretrained vectors from en_core_web_lg?

2 replies

ljvmiranda921 Oct 4, 2022

Do you know if there's a noticeable difference in accuracy when a textcat model is trained from scratch (using spacy.blank("en")) vs. when we use the pretrained vectors from en_core_web_lg?

Hi @thefirebanks , in most cases, you should be able to see an improvement in accuracy when using these vectors. This is because you're already taking advantage of what was learned from larger corpora. Perhaps before diving headfirst on writing the training script, you can try running it from the CLI to sanity-check the results.

thefirebanks Oct 4, 2022
Author

Gotcha, thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training textcat without using the CLI tool #11566

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training textcat without using the CLI tool #11566

Uh oh!

Uh oh!

thefirebanks Sep 30, 2022

Replies: 2 comments · 2 replies

Uh oh!

ljvmiranda921 Oct 3, 2022

Uh oh!

thefirebanks Oct 3, 2022 Author

Uh oh!

ljvmiranda921 Oct 4, 2022

Uh oh!

thefirebanks Oct 4, 2022 Author

thefirebanks
Sep 30, 2022

Replies: 2 comments 2 replies

ljvmiranda921
Oct 3, 2022

thefirebanks
Oct 3, 2022
Author

thefirebanks Oct 4, 2022
Author