is possible to train Textcat with rules/matcher? #9636

info2000 · 2021-11-05T23:10:26Z

info2000
Nov 5, 2021

I want to use the matcher (phrase matcher) power using tokens and lemmas to build a textcat train dataset in front of a python regex that always will need exact word match
And also if this is the rigth way or is a better way to use rule
on the dataset building process

matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]]
matcher.add("Names", patterns)

doc = nlp("angela merkel and us president barack Obama")
doc.cats = categoriesText #from the csv in example
for match_id, start, end in matcher(doc):
    doc.cats["Names"]=1
    print("Matched based on lowercase token text:", doc.vocab.strings[match_id], doc[start:end])

Is this the correct way ? or exists a better way on spacy to augmentate a textcat dataset?

Answered by polm

Nov 7, 2021

It is definitely possible to use the Matcher to create training data for a textcat model. That's a form of "weak supervision", where you train a statistical model using the output of a rule-based model.

The code you have works. I assume it's example code, but just in case, I will note that "names" is kind of a weird category for a document. Also note you can use entities from existing pipelines if you actually need to match on names.

We recently released a weak supervision tutorial project that you might find useful.

View full answer

polm · 2021-11-07T04:01:29Z

polm
Nov 7, 2021

It is definitely possible to use the Matcher to create training data for a textcat model. That's a form of "weak supervision", where you train a statistical model using the output of a rule-based model.

The code you have works. I assume it's example code, but just in case, I will note that "names" is kind of a weird category for a document. Also note you can use entities from existing pipelines if you actually need to match on names.

We recently released a weak supervision tutorial project that you might find useful.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

is possible to train Textcat with rules/matcher? #9636

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

is possible to train Textcat with rules/matcher? #9636

Uh oh!

Uh oh!

info2000 Nov 5, 2021

Replies: 1 comment

Uh oh!

polm Nov 7, 2021

info2000
Nov 5, 2021

polm
Nov 7, 2021