Advice on designing custom morphologizer #12025
Unanswered
megamattc
asked this question in
Help: Coding & Implementations
Replies: 1 comment 4 replies
-
It sounds like what you want is a custom component. A component is anything in the pipeline, like the Morphologizer or Parser. You can write a component as just a function that takes in a Doc, modifies it, and returns it. In this case you could put your component after the Morphologizer in the pipeline, then check the Morphologizer update against your table of parses and modify the Doc as necessary. The custom functions you linked to are for things that don't go directly in the pipeline - for example, they happen before the pipeline is loaded, or are used as arguments to other components, like a span suggester or something. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I was wondering how exactly I should implement an idea I have to change the default morphologizer component of my pipeline for Akkadian. I have, via separate means, a table of the linguistic forms appearing in a corpus along with various possible morphological parses of those forms in UFEATS format. For a given format there are multiple possible parses. All of them are semantically coherent but in a given context only one or perhaps a few are possible.
I was thinking of being able to take the predicted morphological analysis of a given form from the default morphologizer and compare it to the list of possible parses from my table. Sometimes the default morphologizer suggests incoherent parses (e.g. noun and verb features), is incomplete, or wrong in a certain number of features. The comparison would yield the parse from the table that most closely matches the morphologizer's suggestion.
I am wondering where I should implement this exactly? If my understanding of the code is right, within the
morphologizer.pyx
file there is theset_annotations
function, which has a section that goes through a document and assigns the predicted features to each token:I wonder if I should modify the indicated section. Equally important, I am not certain how to properly `modify' the morphologizer in a way conducive to integrating it into the rest of the pipeline, as well as my own custom language class 'ak' (for Akkadian). The file morphologizer.pyx is Cython and so I cannot (if I am correct) simply extend the Morphologizer class in Python, just as I did to modify the default Lemmatizer for my language class. The modifications I've tried already create problems in the pipeline. Should I be looking at modifying only the 'Model Architecture' as briefly discussed here (https://spacy.io/api/architectures), or perhaps defining a custom function using the decorator notation, as illustrated here (https://spacy.io/usage/training/#custom-functions)? I do not really understand how these features work.
Thank you
Beta Was this translation helpful? Give feedback.
All reactions