Advice on designing custom morphologizer #12025

megamattc · 2022-12-24T20:52:26Z

megamattc
Dec 24, 2022

Hello,

I was wondering how exactly I should implement an idea I have to change the default morphologizer component of my pipeline for Akkadian. I have, via separate means, a table of the linguistic forms appearing in a corpus along with various possible morphological parses of those forms in UFEATS format. For a given format there are multiple possible parses. All of them are semantically coherent but in a given context only one or perhaps a few are possible.

I was thinking of being able to take the predicted morphological analysis of a given form from the default morphologizer and compare it to the list of possible parses from my table. Sometimes the default morphologizer suggests incoherent parses (e.g. noun and verb features), is incomplete, or wrong in a certain number of features. The comparison would yield the parse from the table that most closely matches the morphologizer's suggestion.

I am wondering where I should implement this exactly? If my understanding of the code is right, within the morphologizer.pyx file there is the set_annotations function, which has a section that goes through a document and assigns the predicted features to each token:

        for i, doc in enumerate(docs):
            doc_tag_ids = batch_tag_ids[i]
            if hasattr(doc_tag_ids, "get"):
                doc_tag_ids = doc_tag_ids.get()
            for j, tag_id in enumerate(doc_tag_ids):
                morph = labels[tag_id]
                # set morph
                if doc.c[j].morph == 0 or overwrite or extend:
                    if overwrite and extend:
                        # morphologizer morph overwrites any existing features
                        # while extending
                        extended_morph = Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph])
                        extended_morph.update(Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0)))
                        doc.c[j].morph = self.vocab.morphology.add(extended_morph)
                    elif extend:
                        # existing features are preserved and any new features
                        # are added
                        extended_morph = Morphology.feats_to_dict(self.cfg["labels_morph"].get(morph, 0))
                        extended_morph.update(Morphology.feats_to_dict(self.vocab.strings[doc.c[j].morph]))

                        ### Add code here??
                        #Compare extended_morph dictionary to my own dictionary???

                        doc.c[j].morph = self.vocab.morphology.add(extended_morph)
                    else:
                        # clobber
                        doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"].get(morph, 0))
                # set POS
                if doc.c[j].pos == 0 or overwrite:
                    doc.c[j].pos = self.cfg["labels_pos"].get(morph, 0)

I wonder if I should modify the indicated section. Equally important, I am not certain how to properly `modify' the morphologizer in a way conducive to integrating it into the rest of the pipeline, as well as my own custom language class 'ak' (for Akkadian). The file morphologizer.pyx is Cython and so I cannot (if I am correct) simply extend the Morphologizer class in Python, just as I did to modify the default Lemmatizer for my language class. The modifications I've tried already create problems in the pipeline. Should I be looking at modifying only the 'Model Architecture' as briefly discussed here (https://spacy.io/api/architectures), or perhaps defining a custom function using the decorator notation, as illustrated here (https://spacy.io/usage/training/#custom-functions)? I do not really understand how these features work.

Thank you

polm · 2022-12-26T04:51:33Z

polm
Dec 26, 2022

It sounds like what you want is a custom component. A component is anything in the pipeline, like the Morphologizer or Parser. You can write a component as just a function that takes in a Doc, modifies it, and returns it.

In this case you could put your component after the Morphologizer in the pipeline, then check the Morphologizer update against your table of parses and modify the Doc as necessary.

The custom functions you linked to are for things that don't go directly in the pipeline - for example, they happen before the pipeline is loaded, or are used as arguments to other components, like a span suggester or something.

4 replies

megamattc Dec 26, 2022
Author

I see. Should I then model my code like the *.py files in /spacy/pipeline which specify default pipeline components (attribute_ruler.py, lemmatizer.py)? Since my custom pipeline component is meant to be language specific to my custom language (Akkadian), could my file be placed in the folder defining my language class, or should it be placed in the /spacy/pipeline folder?

polm Dec 27, 2022

You don't need to add this to the main spaCy source tree, and since it's not a trained component you don't need to follow the core components closely - even the rule based ones are more thorough than most user components have to be. In particular it sounds like you can make your component a stateless one, which can be a simple function.

You could put this in a directory for your language class, or just in a file wherever you're working. It could be something like this...

# my_component.py

from spacy.language import Language

@Language.component("custom_morphologizer")
def custom_morphologizer(doc):
   # Do something to the doc here
   return doc

# main.py
import spacy
import my_component 

nlp = spacy.load(... trained Akkadian pipeline ...)
nlp.add_pipe("custom_morpholgizer")

When you import my_component, the my_component.py file will be read in and executed. Since it's just a function definition that will be registered and the decorator will be processed, which allows it to work with add_pipe.

megamattc Dec 27, 2022
Author

I see. Thank you. I have made such a stateless function as you describe. I was curious, though, if I did want to make a more complicated component that gets included in the training process, I would need to create a Python class with initialization functions similar to those specifying regular components like lemmatizer.py, attributeruler.py, and entityruler.py, yes?

polm Dec 27, 2022

If you want to make a component that participates in the training process (*) then you would need to implement a class with a number of functions like update and so on. None of the component you mentioned are trainable, they're all rule-based, so you'd need to refer to something else, like the textcat or something.

Note that this is only necessary for components doing updates during training, and that simple rule-based components can also be turned on during training without any special preparation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Advice on designing custom morphologizer #12025

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Advice on designing custom morphologizer #12025

Uh oh!

megamattc Dec 24, 2022

Replies: 1 comment · 4 replies

Uh oh!

polm Dec 26, 2022

Uh oh!

megamattc Dec 26, 2022 Author

Uh oh!

polm Dec 27, 2022

Uh oh!

megamattc Dec 27, 2022 Author

Uh oh!

polm Dec 27, 2022

megamattc
Dec 24, 2022

Replies: 1 comment 4 replies

polm
Dec 26, 2022

megamattc Dec 26, 2022
Author

megamattc Dec 27, 2022
Author