Training textcat without using the CLI tool #11566
-
Hello! I know that the recommended way of training a model is using the CLI tool, but I need to add some custom code to measure performance during the training loop so I'm looking for a code-based solution to run the training loop, as opposed to running the CLI tool. I've seen previous examples go like this: nlp = spacy.blank('en')
if 'textcat' not in nlp.pipe_names:
textcat = nlp.add_pipe('textcat', config={
"threshold": 0.5,
"model": DEFAULT_SINGLE_TEXTCAT_MODEL
})
# get names of other pipes to disable them during training
pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
train_data = get_train_data_in_spacy_format()
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.create_optimizer()
batch_sizes = compounding(4.0, 32.0, 1.001)
for i in range(n_iter):
print('Iteration: {}'.format(i))
losses = {}
# batch up the examples using spaCy's minibatch
random.shuffle(train_data)
batches = minibatch(train_data, size=batch_sizes)
for batch in batches:
examples = [Example.from_dict(nlp.make_doc(text), annotation) for text, annotation in batch]
nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses) Which works well! However, my main concern is - what if I want to use pre-trained vectors from the I saw a "similar" post here: #6898 and the answer "You can also try the training quickstart with textcat + CPU + accuracy for a language that has a pretrained _lg model available like English." <- is there a way of implementing this in my loop without the CLI tool? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hi @thefirebanks ,
If you want to do this programmatically, you can use the From the code example, it's not clear to me what it's trying to measure. In any case, you can write a custom logger that suits your needs. |
Beta Was this translation helpful? Give feedback.
-
Thanks @ljvmiranda921 ! I'm mainly trying to measure how long it takes for the textcat model to train on certain datasets, as well as how fast it processes characters. Do you know if there's a noticeable difference in accuracy when a textcat model is trained from scratch (using |
Beta Was this translation helpful? Give feedback.
Hi @thefirebanks ,
If you want to do this programmatically, you can use the
train()
function inspacy.cli.train
. Import it and then provide the override you need. In order to use theen_core_web_lg
model, you can specify thepaths.vectors
parameter in the configuration file (null -> en_core_web_lg
).From the code example, it's not clear to me what it's trying to measure. In any case, you can write a custom logger that suits your needs.