Keyword extraction with Spacy #10080
Replies: 1 comment 5 replies
-
Could you share more about your problem to give us a better idea of what you're trying to do? Some examples of the output you expect to get as well as what you've tried so far would really help.
Tokenization is generally the process of breaking up a long string of text into smaller units, typically words and punctuation. If you need those tokens combined in some way differently, I wouldn't make that the responsibility of the tokenizer (unless you're talking about whether things separated by punctuation like hyphens should be a single token).
You can filter out entity types from a NER model after it's been parsed. For example: import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
orgs = [ent.text for ent in doc.ents if ent.label_ == "ORG"] As for a pretrained huggingface model: most of those are transformer based, which are going to require more time to process than a simpler architecture or a keyword extraction algorithm. For your timing in colab, which model did you use? How much of that time was downloading the model vs. running your text through? Could you share that colab notebook? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I’d like to do keyword extraction with Spacy. My previous attempt wasn’t perfect. I presume I need to modify both the tokeniser, since multiword expressions (if I recall correctly) were not tokenized perfectly, as well as the entity recognizer, since many of the entities were insignificant concepts like the number “1”, as opposed to key, topically-related concepts like “suspension bridge”.
I’d like to open a discussion about this as I work on it.
What are some known ways to improve tokenization of multiword expressions? Is there a different pretrained model we can use in the tokeniser pipeline? Maybe from Huggingface?
And, the same question for entity recognition. I think it’d be smarter for me to try using a pre-existing one than making my own just yet. I tried KeyBERT but I didn’t find it fast enough, I used a simple GPU from Colab and it took almost a minute for 3 keywords from 3 paragraphs. My aim would be 1000 good keywords in 1 second, ideally, but I can come down on that of course. Is it possible to use KeyBERT on a CPU in Spacy? Is it possible to pass KeyBERT or PyTextRank as the entity recognition pipeline?
Thanks very much
Beta Was this translation helpful? Give feedback.
All reactions