Splitting compound words into parts #11521

e-orlov · 2022-09-19T16:04:04Z

e-orlov
Sep 19, 2022

After reading the docs and searching through discussions I still have a question about a possibilities of spaCy:

Is there any model/pipeline, which is able to split compound words into their parts?

Background: I work on a search keyword clustering for German language, where, before to lemmatize tokens, I need to split compound words into their parts.

richardpaulhudson · 2022-09-26T07:35:40Z

richardpaulhudson
Sep 26, 2022

We don't envisage adding this capability to the core spaCy library in the foreseeable future. However, the Holmes information extraction library, which builds on spaCy, does split up German compound words. In some ways it's quite a rough and ready implementation: because I didn't have time to train and evaluate a proper model, it uses scoring weights that are based on linguistic intuitions/best guesses, and there is no measure of its accuracy. Still, it does seem to work quite well so I can imagine you might find it useful.

You can find the relevant code here. Unless you want to use other features of Holmes, you will probably be better off copying and adapting the code, because using Holmes implies various constraints (e.g. you have to download a coreference model) that will probably be unnecessary for what you want to achieve. That said, the code is tested with the rule-based German lemmatizer rather than the edit-tree German lemmatizer, so it will still make sense for you to download spacy-lookups-data and switch lemmatizer.

To do a quick test, install Holmes for German, then:

import holmes_extractor as holmes
manager = holmes.Manager("de_core_news_lg")
nlp = manager.nlp
doc = nlp("Informationsextraktionsgespräche und -diskussionen")
for token in doc:
    print(token, [(subword.text, subword.lemma) for subword in token._.holmes.subwords])

which prints

Informationsextraktionsgespräche [('information', 'information'), ('extraktion', 'extraktion'), ('gespräche', 'gespräch')]
und []
-diskussionen [('information', 'information'), ('extraktion', 'extraktion'), ('diskussionen', 'diskussion')]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Splitting compound words into parts #11521

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Splitting compound words into parts #11521

Uh oh!

Uh oh!

e-orlov Sep 19, 2022

Replies: 1 comment

Uh oh!

richardpaulhudson Sep 26, 2022

e-orlov
Sep 19, 2022

richardpaulhudson
Sep 26, 2022