Keyword extraction with Spacy #10080

hmltn-0 · 2022-01-18T14:37:12Z

hmltn-0
Jan 18, 2022

I’d like to do keyword extraction with Spacy. My previous attempt wasn’t perfect. I presume I need to modify both the tokeniser, since multiword expressions (if I recall correctly) were not tokenized perfectly, as well as the entity recognizer, since many of the entities were insignificant concepts like the number “1”, as opposed to key, topically-related concepts like “suspension bridge”.

I’d like to open a discussion about this as I work on it.

What are some known ways to improve tokenization of multiword expressions? Is there a different pretrained model we can use in the tokeniser pipeline? Maybe from Huggingface?
And, the same question for entity recognition. I think it’d be smarter for me to try using a pre-existing one than making my own just yet. I tried KeyBERT but I didn’t find it fast enough, I used a simple GPU from Colab and it took almost a minute for 3 keywords from 3 paragraphs. My aim would be 1000 good keywords in 1 second, ideally, but I can come down on that of course. Is it possible to use KeyBERT on a CPU in Spacy? Is it possible to pass KeyBERT or PyTextRank as the entity recognition pipeline?

Thanks very much

pmbaumgartner · 2022-01-18T20:46:06Z

pmbaumgartner
Jan 18, 2022

Could you share more about your problem to give us a better idea of what you're trying to do? Some examples of the output you expect to get as well as what you've tried so far would really help.

multiword expressions (if I recall correctly) were not tokenized perfectly

Tokenization is generally the process of breaking up a long string of text into smaller units, typically words and punctuation. If you need those tokens combined in some way differently, I wouldn't make that the responsibility of the tokenizer (unless you're talking about whether things separated by punctuation like hyphens should be a single token).

since many of the entities were insignificant concepts like the number “1”, as opposed to key, topically-related concepts like “suspension bridge”.

You can filter out entity types from a NER model after it's been parsed. For example:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
orgs = [ent.text for ent in doc.ents if ent.label_ == "ORG"]

As for a pretrained huggingface model: most of those are transformer based, which are going to require more time to process than a simpler architecture or a keyword extraction algorithm. For your timing in colab, which model did you use? How much of that time was downloading the model vs. running your text through? Could you share that colab notebook?

5 replies

hmltn-0 Jan 19, 2022
Author

Thank you.

I can reply to this more extensively soon.

For right now, I’m curious - you say there are other models available on HuggingFace, but just that they’d be much slower?

I’m pretty sure Colab was using a K80 GPU but AWS / SageMaker has much faster yet affordable GPUs.

I’m especially interested in top quality segmentation because I’ve had an issue with Spacy’s built in one for a while. The main issue is it doesn’t recognise when two text entities aren’t connected just because there isn’t the right punctuation between them, for example a title followed by newlines followed by a sentence.

So maybe I’ll explore explicitly programmatic ways to handle this in case it’s faster but in case I get a fast GPU, does any alternative segmentation model you know of come to mind?

Thanks very much

hmltn-0 Jan 19, 2022
Author

Another thing is that I noticed PyTextRank is to be used as a pipeline but not on the “entity recognition” layer, rather as a pipeline called “textrank” which is accessed through the “phrases” attribute.

Do you think keyword extraction has a better analogous layer than “entity recognition”? If “phrases” is an attribute are there other pipelines/attributes I should know about?

Perhaps for speed I should also remove all pipelines I don’t need. How can I load just the pipelines I need from the Spacy module and create a doc object with just those pipelines?

By the way, this documentation is out of date and PyTextRank is no longer called directly from the module as this page says: https://spacy.io/universe/project/spacy-pytextrank

Thanks very much

pmbaumgartner Jan 19, 2022

does any alternative segmentation model you know of come to mind?

It's difficult to make generic recommendations without knowing more about the data and the problem you're trying to solve. I'd look at the models available there and see if there is one that's trained on data that's close to the type of data you're trying to use it on.

Another thing is that I noticed PyTextRank is to be used as a pipeline but not on the “entity recognition” layer, rather as a pipeline called “textrank” which is accessed through the “phrases” attribute.

Do you think keyword extraction has a better analogous layer than “entity recognition”? If “phrases” is an attribute are there other pipelines/attributes I should know about?

Perhaps for speed I should also remove all pipelines I don’t need. How can I load just the pipelines I need from the Spacy module and create a doc object with just those pipelines?

My understanding is that PyTextRank is working on the tokens independently of other parts of the pipeline. It's storing the results in the phrases attribute on the doc. This is operating independently of the prior parts of the pipeline - I don't think there's anything from the entity recognizer that is input into this component.

You could disable components you don't need for a speedup - there's an example of that in the docs here.

By the way, this documentation is out of date and PyTextRank is no longer called directly from the module as this page says: https://spacy.io/universe/project/spacy-pytextrank

Thanks for the info. We rely on the maintainers of any libraries included in spaCy Universe to make sure their entry is up to date, so you might see outdated information occasionally. We're working on cleaning things up on an ongoing basis as well.

hmltn-0 Jan 26, 2022
Author

Thank you.

I see I can exclude pipelines with this syntax:

nlp = spacy.load("en_core_web_sm", exclude=["ner"])

What about importing Spacy? I find that to be the slowest part. Can I load a smaller version with a syntax like

“from spacy import tokenizer, sentencizer”

or anything like that?

What’s the smallest, fastest language model? en_core_web_sm?

By the way, you were asking for some specific examples of performance issues above. I’ll still follow up in more detail some time, but one example would be the entity recognition for this:

import spacy
import subprocess

str = subprocess.run(“grep —help”, capture_output=True, shell=True).stdout.decode(“UTF-8”)

doc = spacy.load(“en_core_web_sm”)(str)

doc.ents

The entities are often duplicates just with spaces or different capitalisation, and some trivial entities it seems, like numbers, or random choices like “fewer than two”.

(FILE, Perl, FILE , FILE
-i, --word-regexp , 0, NUM , NUM, --quiet, TYPE, --binary, --binary, GLOB , GLOB, GLOB , FILE, GLOB, GLOB
-L, FILEs, FILE
-T, FILE, NUM, NUM, NUM, NUM, NUM , NUM, NUM, CR, EOL, MSDOS/Windows, FILE, fewer than two, Exit, 0, 1, 2)

I’ll definitely explore the advice you gave me above for improving the entities returned but I just wanted to give you a specific example for how it doesn’t appear to work great just out of the box.

Thanks very much

pmbaumgartner Jan 26, 2022

What about importing Spacy? I find that to be the slowest part. Can I load a smaller version with a syntax like

What are you doing that you need to import spaCy frequently and quickly? Imports are usually a one time cost.

What’s the smallest, fastest language model? en_core_web_sm?

Yes.

The entities are often duplicates just with spaces or different capitalisation, and some trivial entities it seems, like numbers, or random choices like “fewer than two”.

The labels are not random choices - you can see the annotation scheme for NER here. If you need to remove duplicates after NER, call set([ent.text for ent in doc.ents]).

By the way, you were asking for some specific examples of performance issues above. I’ll still follow up in more detail some time, but one example would be the entity recognition for this:

It looks like you're trying to use this on terminal output, which is much different in structure than the type of data these models were trained on. If you want better results for your use case, I'd consider annotating your own data and training your own model to do what you need to do.

Uh oh!

Keyword extraction with Spacy #10080

Uh oh!

hmltn-0 Jan 18, 2022

Replies: 1 comment · 5 replies

Uh oh!

pmbaumgartner Jan 18, 2022

Uh oh!

hmltn-0 Jan 19, 2022 Author

Uh oh!

Uh oh!

hmltn-0 Jan 19, 2022 Author

Uh oh!

pmbaumgartner Jan 19, 2022

Uh oh!

Uh oh!

hmltn-0 Jan 26, 2022 Author

Uh oh!

Uh oh!

pmbaumgartner Jan 26, 2022

hmltn-0
Jan 18, 2022

Replies: 1 comment 5 replies

pmbaumgartner
Jan 18, 2022

hmltn-0 Jan 19, 2022
Author

hmltn-0 Jan 19, 2022
Author

hmltn-0 Jan 26, 2022
Author