Automatically categorise keywords #9921

hmltn-0 · 2021-12-21T19:42:20Z

hmltn-0
Dec 21, 2021

I would like to extract keywords and automatically categorise them. The algorithm should guess how to group them together and what their category label should be from scratch, with no suggested keywords.

I know of some Spacy methods for keyword extraction but I don’t know if the need for auto-categorisation implies I can just hitch on one more step in the procedure or if I should use a different algorithm altogether.

I am pretty sure the algorithm needs to evaluate every keyword with relevance to the entire text. Maybe an algorithm can rank the similarity of each keyword to each other keyword (using BERT). Then another algorithm can “cluster” terms with high degrees of interlinking. Lastly, GPT-3 or BERT can suggest a label for each group of terms.

I’m trying to think of ways to combine how some algorithms are based on pure statistics and some are heavily pre-trained and some are trained on the specific data and combine them. Like training BERT on the text I’m using, doing a graph-theoretic clustering of key terms or a more linguistic clustering using semantic relations like WordNet, and either something like WordNet or trained BERT as said to suggest a category.

Is there a more elegant way to do this? What’s the most standard way to cluster any kind of data points, and could it be applied to language data?

Thank you

Answered by polm

Dec 22, 2021

Clustering data is basically its own research field, and there are many, many approaches. A simple one you can get started with is KNN.

There are techniques to generate labels for clusters but that kind of NLG is out of scope for spaCy.

About your approach more broadly here, it's not clear to me what you're trying to do or why this would be useful. Can you give a more concrete example? More practically speaking, while I can see how a human could do this, how would you say whether one solution was better or worse than another? If you can't tell the computer if it's doing a good job or not there's no real way to train a model for this.

View full answer

polm · 2021-12-22T04:49:22Z

polm
Dec 22, 2021

Clustering data is basically its own research field, and there are many, many approaches. A simple one you can get started with is KNN.

There are techniques to generate labels for clusters but that kind of NLG is out of scope for spaCy.

About your approach more broadly here, it's not clear to me what you're trying to do or why this would be useful. Can you give a more concrete example? More practically speaking, while I can see how a human could do this, how would you say whether one solution was better or worse than another? If you can't tell the computer if it's doing a good job or not there's no real way to train a model for this.

0 replies

hmltn-0 · 2021-12-22T12:36:34Z

hmltn-0
Dec 22, 2021
Author

Thank you very much. I’m attempting to make a glossary creation tool. Given source text as input, generate a glossary of key terms. The terms should ideally be categorized as well as include definitions and context sentences pulled from the text showing their use. I hear you that right and wrong is fuzzier here than in some cases, but I don’t see that as a huge limitation. I do think it might hint at that however I pull this off, it would be with something unsupervised to group terms rather than any kind of training set highlighting right and wrong answers. Someone suggested “topic modelling” to me so I may explore that. Thanks very much.

…

On Wed 22. Dec 2021 at 05:49, polm ***@***.***> wrote: Clustering data is basically its own research field, and there are many, many approaches. A simple one you can get started with is KNN. There are techniques to generate labels for clusters but that kind of NLG is out of scope for spaCy. About your approach more broadly here, it's not clear to me what you're trying to do or why this would be useful. Can you give a more concrete example? More practically speaking, while I can see how a human could do this, how would you say whether one solution was better or worse than another? If you can't tell the computer if it's doing a good job or not there's no real way to train a model for this. — Reply to this email directly, view it on GitHub <#9921 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AW4SQ3YRBK6NJ4YRQ7E6HEDUSFKF3ANCNFSM5KQ52MXQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Automatically categorise keywords #9921

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Automatically categorise keywords #9921

Uh oh!

Uh oh!

hmltn-0 Dec 21, 2021

Replies: 2 comments

Uh oh!

polm Dec 22, 2021

Uh oh!

hmltn-0 Dec 22, 2021 Author

hmltn-0
Dec 21, 2021

polm
Dec 22, 2021

hmltn-0
Dec 22, 2021
Author