Splitting compound words into parts #11521
Replies: 1 comment
-
We don't envisage adding this capability to the core spaCy library in the foreseeable future. However, the Holmes information extraction library, which builds on spaCy, does split up German compound words. In some ways it's quite a rough and ready implementation: because I didn't have time to train and evaluate a proper model, it uses scoring weights that are based on linguistic intuitions/best guesses, and there is no measure of its accuracy. Still, it does seem to work quite well so I can imagine you might find it useful. You can find the relevant code here. Unless you want to use other features of Holmes, you will probably be better off copying and adapting the code, because using Holmes implies various constraints (e.g. you have to download a coreference model) that will probably be unnecessary for what you want to achieve. That said, the code is tested with the rule-based German lemmatizer rather than the edit-tree German lemmatizer, so it will still make sense for you to download To do a quick test, install Holmes for German, then:
which prints
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
After reading the docs and searching through discussions I still have a question about a possibilities of spaCy:
Is there any model/pipeline, which is able to split compound words into their parts?
Background: I work on a search keyword clustering for German language, where, before to lemmatize tokens, I need to split compound words into their parts.
Beta Was this translation helpful? Give feedback.
All reactions