Automatically categorise keywords #9921
-
I would like to extract keywords and automatically categorise them. The algorithm should guess how to group them together and what their category label should be from scratch, with no suggested keywords. I know of some Spacy methods for keyword extraction but I don’t know if the need for auto-categorisation implies I can just hitch on one more step in the procedure or if I should use a different algorithm altogether. I am pretty sure the algorithm needs to evaluate every keyword with relevance to the entire text. Maybe an algorithm can rank the similarity of each keyword to each other keyword (using BERT). Then another algorithm can “cluster” terms with high degrees of interlinking. Lastly, GPT-3 or BERT can suggest a label for each group of terms. I’m trying to think of ways to combine how some algorithms are based on pure statistics and some are heavily pre-trained and some are trained on the specific data and combine them. Like training BERT on the text I’m using, doing a graph-theoretic clustering of key terms or a more linguistic clustering using semantic relations like WordNet, and either something like WordNet or trained BERT as said to suggest a category. Is there a more elegant way to do this? What’s the most standard way to cluster any kind of data points, and could it be applied to language data? Thank you |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Clustering data is basically its own research field, and there are many, many approaches. A simple one you can get started with is KNN. There are techniques to generate labels for clusters but that kind of NLG is out of scope for spaCy. About your approach more broadly here, it's not clear to me what you're trying to do or why this would be useful. Can you give a more concrete example? More practically speaking, while I can see how a human could do this, how would you say whether one solution was better or worse than another? If you can't tell the computer if it's doing a good job or not there's no real way to train a model for this. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much.
I’m attempting to make a glossary creation tool. Given source text as
input, generate a glossary of key terms. The terms should ideally be
categorized as well as include definitions and context sentences pulled
from the text showing their use.
I hear you that right and wrong is fuzzier here than in some cases, but I
don’t see that as a huge limitation. I do think it might hint at that
however I pull this off, it would be with something unsupervised to group
terms rather than any kind of training set highlighting right and wrong
answers.
Someone suggested “topic modelling” to me so I may explore that.
Thanks very much.
…On Wed 22. Dec 2021 at 05:49, polm ***@***.***> wrote:
Clustering data is basically its own research field, and there are many,
many approaches. A simple one you can get started with is KNN.
There are techniques to generate labels for clusters but that kind of NLG
is out of scope for spaCy.
About your approach more broadly here, it's not clear to me what you're
trying to do or why this would be useful. Can you give a more concrete
example? More practically speaking, while I can see how a human could do
this, how would you say whether one solution was better or worse than
another? If you can't tell the computer if it's doing a good job or not
there's no real way to train a model for this.
—
Reply to this email directly, view it on GitHub
<#9921 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AW4SQ3YRBK6NJ4YRQ7E6HEDUSFKF3ANCNFSM5KQ52MXQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Clustering data is basically its own research field, and there are many, many approaches. A simple one you can get started with is KNN.
There are techniques to generate labels for clusters but that kind of NLG is out of scope for spaCy.
About your approach more broadly here, it's not clear to me what you're trying to do or why this would be useful. Can you give a more concrete example? More practically speaking, while I can see how a human could do this, how would you say whether one solution was better or worse than another? If you can't tell the computer if it's doing a good job or not there's no real way to train a model for this.