Help on building Akkadian language model from scratch #11516
Replies: 12 comments 20 replies
-
Hi Matthew, I hope I can help. I used to work for Oracc as an annotator of Akkadian texts and have trained many models for ancient Greek. For a project like yours, I think, you should use as a template the spaCy Projects: Part-of-speech Tagging & Dependency Parsing] (Universal Dependencies)(https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud) and Project: Universal Dependencies v2.5 Benchmarks(https://github.com/explosion/projects/tree/v3/benchmarks/ud_benchmark). For the data that you need to produce, you can look at the two existing Universal Dependencies Akkadian corpora which are in a format that spaCy can process. You can find those corpora here: https://universaldependencies.org/. In the conllu you can see the annotation scheme that you need. For instance:
Is annotated as: 1 ēkal ēkallu NOUN N Gender=Fem|NounBase=Bound|Number=Sing 0 root _ E₂.GAL In the first word of the line, you get the morphological annotation:
You would need to produce some files with similar annotations. Since the Oracc corpus does not include dependency annotations, I would create my own annotated corpus following the UD conventions. For this you could use whatever tools the UD page offers for annotating and try to covert the ATF files of Oracc to something close to conllu that you could use as a draft. Jacobo |
Beta Was this translation helpful? Give feedback.
-
Hi Jacobo, If I may respond to a few things: "I do not think that the morphologizer's function is to decompose the word in stem and affixes, this is something that the edit tree lemmatizer does, and it is part of the lemmatization process." If this is so, I wonder how spacy is to do things like recognize pronominal suffixes as being arguments of a verb, or objects of prepositions? It seems like the lemmatizer just wants to strip away affixes to get to the lemma of the token and doesn't retain affix info. So it would not be able to help the dependencies parser or anything else that needs to know about such suffixes... " I personally would keep bēliya as a single word and add the pronominal information to the FEATS field. I would not decompose the token because this is not the format of the raw data." That sounds more promising. However the UD list of universal features (https://universaldependencies.org/u/feat/index.html) doesn't include anything like this (poss just means an adjective is possessive). According to their guidelines, I should just use underscores to create a new feature category. Thus for bēliya I would put PossSuffNum= Sing|PossSuffPerson=1, or šupuršu = AccSuffNum=Sing|AccSuffPerson=3? "I notice that the RIAO corpus annotates the suffixes this way: 4-5 ikribīšu _ _ _ _ _ _ _ _ It gives ikribīšu first, token 4-5, and then the decomposition. Why don't you try to train a model with this corpus and see how it behaves?" Indeed. I noticed this. Is this not, however, going against what you said above about working only with what is in the text and not separating the pronominal suffixes by hand? Matt |
Beta Was this translation helpful? Give feedback.
-
Apparently, RIAO does indicate in its FEATS that a noun has a suffix (NounBase=Suffixal), but does not gives features to the suffix, nor does it say anything for verbs. |
Beta Was this translation helpful? Give feedback.
-
Hi Matt,
This is all very interesting, and I think you should contact me directly at jmyerstonATucsd.edu since we could collaborate in writing an Akkadian module for spaCy.Since we are in the UC system, there are some resources we could use for collaborating. I was also at Berkeley once for an Oracc training with Niek.
Going back to your question (I hope a spaCy main developer jumps in here for help).
It is possible that the spaCy’s tokenizer learns from annotations like the ones we are seeing in cases like ikribīšu. (I need to correct myself here: this is the task of the tokenizer and not of the lemmatizer).
Once the text is properly tokenized, you can annotate the deps. I face such issues all the time annotating ancient Greek with Prodigy. If the tokenizer is not splitting the tokens properly, I must go back to the source code and fix it.
My argument against separating pronominal suffixes from nouns is that we do not do this with verbs, although we could. One could argue that the endings -o in Latin and Spanish is a first-person marker and split it out of the stem. This could help with coreference resolution for sure, but it creates also an over analytical and nonstandard representation of the language.
Maybe the need felt to separate the pronominal suffixes from nouns comes from the idea that nouns do not have person. But the Akkadian verbs has gender like nouns and adjective do and we don’t find that as offensive as the idea that the inflection of the noun can indicate person.
The problem with my approach is that you break compatibility with what has been already annotated by Lukko’s team. Since you need as much data as possible to train models, it would be good if you can use the existing UD corpora.
So, the first step, I would say, is to train models with the existing UD corpora and test them well. Then examine the advantages and shortcomings of the annotation schema they are using. If spaCy is learning from the decomposition annotations of Lukko (you will see this while tokenizing a text that contains the same vocabulary), then you can just take over the same annotation model. If not, you have a good reason either to write the Akkadian support files for spaCy that could handle such annotations or carefully develop your own annotation system
… On Sep 20, 2022, at 7:55 AM, Matthew Ong ***@***.***> wrote:
Hi Jacobo,
If I may respond to a few things:
"I do not think that the morphologizer's function is to decompose the word in stem and affixes, this is something that the edit tree lemmatizer does, and it is part of the lemmatization process."
If this is so, I wonder how spacy is to do things like recognize pronominal suffixes as being arguments of a verb, or objects of prepositions? It seems like the lemmatizer just wants to strip away affixes to get to the lemma of the token and doesn't retain affix info. So it would not be able to help the dependencies parser or anything else that needs to know about such suffixes...
" I personally would keep bēliya as a single word and add the pronominal information to the FEATS field. I would not decompose the token because this is not the format of the raw data."
That sounds more promising. However the UD list of universal features (https://universaldependencies.org/u/feat/index.html <https://universaldependencies.org/u/feat/index.html>) doesn't include anything like this (poss just means an adjective is possessive). According to their guidelines, I should just use underscores to create a new feature category. Thus for bēliya I would put PossSuffNum= Sing|PossSuffPerson=1, or šupuršu = AccSuffNum=Sing|AccSuffPerson=3?
"I notice that the RIAO corpus annotates the suffixes this way:
4-5 ikribīšu _ _ _ _ _ _ _ _
4 ikribī ikribu NOUN N Gender=Masc|NounBase=Suffixal|Number=Plur 6 obj _ ŠUD₃-šu₂
5 šu _ PRON _ Gender=Masc|Number=Sing|Person=3 4 det:poss _ _
It gives ikribīšu first, token 4-5, and then the decomposition. Why don't you try to train a model with this corpus and see how it behaves?"
Indeed. I noticed this. Is this not, however, going against what you said above about working only with what is in the text and not separating the pronominal suffixes by hand?
Matt
—
Reply to this email directly, view it on GitHub <#11516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKJYKB64G4YWAIVZSDGNGMDV7HF5TANCNFSM6AAAAAAQPQEZ7E>.
You are receiving this because you commented.
|
Beta Was this translation helpful? Give feedback.
-
Not knowing Akkadian I don't feel I can usefully comment on the linguistic question of whether or not such words are better split into two separate tokens or analysed morphologically as single tokens. However, when considering the feasibility of the two approaches, it's important to bear in mind that the tokenizer is based on hand-written rules while the morphologizer/lemmatizer are based on trained machine-learning models. The tokenizer is thus more likely to be an appropriate choice to capture a small number of distinctive suffixes, e.g. the situation in Spanish where there is a closed class of clitics that can attach to the verb in forms like I'm currently doing some work on the morphologizer/lemmatizer to try and fix the problem that the affixes that they used for training are of fixed length and are too short for many languages. If Akkadian regularly has suffixes on the lemma stem that have a combined length of more than 3 characters, it may be that you could benefit from what I am doing. Unfortunately it's still very much work in progress, hasn't been tested or reviewed and isn't officially supported yet, but if you're feeling brave you could have a look. Please get back in touch if there are more specific questions: I heard the call for a spaCy main developer but am not sure if I've fully understood what the open points are. |
Beta Was this translation helpful? Give feedback.
-
Hi Richard, Thanks for your input. I actually have tried to just encode suffixes and other morphological features of a token as attributes of the lemma itself. It seems to work reasonably well with enough training examples even though it is not rule-based. I was wondering how you would recommend I go about adding some concord/agreement rules to the dependency parser so that, e.g. it assigns a noun as subject of a verb only if their explicit person/number/gender features match? I looked at the code of the dependency parser (https://github.com/explosion/spaCy/blob/master/spacy/pipeline/dep_parser.pyx) but it isn't clear the concrete parsing occurs (although I know that parsers are evaluated according to likelihood). Similarly, the DependencyMatcher assumes the parser has already been run on the text. In addition, I've made an early version of an Akkadian Language class for spacy modeled on the existing classes, and I was wondering how formalized and tidy the package has to be in order to be allowed into the spacy repository? Matt |
Beta Was this translation helpful? Give feedback.
-
There are certainly constraint-based dependency parsers that support such rules, but the spaCy dependency parser is a purely statistical transition-based parser that is based completely on machine learning. It generally works surprisingly well even though morphological tags do not form part of its output. Just as a morphologizer model can learn the mappings from words, word shapes and affixes to morphological tags, a dependency parser model can directly learn how those features affect dependency parsing without capturing the morphology as an overt intermediate step. Note that the fact that the morphologizer and the dependency parser are autonomous models means that there is no guarantee that their outputs line up if one makes an error, so that a word might be marked as a subject of a verb and simultaneously be analysed as having some case marking that is incompatible with that. An "Akkadian Language class" could mean several different things. If you've checked it into Github, perhaps you could share a link to it, then I can comment further? Richard |
Beta Was this translation helpful? Give feedback.
-
Thanks for the link, I hope to look at this over the next week or so. The dependency parser isn't aiming to derive a semantic representation of each sentence, but rather to recognise and classify the syntactic relationships between the words in each sentence. This means that pronominal suffixes on a verb are not something the dependency parser will (or indeed can) ever learn to recognise, simply because they're outside the scope of what it's trying to do. That said, it should learn that a verb with a certain pronominal suffix doesn't have an overt nominal object of the corresponding type if that is indeed so for the language in question. |
Beta Was this translation helpful? Give feedback.
-
I've had a look at your code and I think I've found a cause for the poor parser accuracy you describe. The |
Beta Was this translation helpful? Give feedback.
-
Ah, that is interesting. Thank you for looking at this issue. I look forward to hearing from you next week! Matt |
Beta Was this translation helpful? Give feedback.
-
At the moment the
The
The lengths 1 for prefixes and 3 for suffixes are set with these methods which you could override in your own, language-specific version of
|
Beta Was this translation helpful? Give feedback.
-
The other question was whether or not this work could be incorporated into the spaCy repository. If you look under https://spacy.io/usage/models#languages, you'll see that there is a group of languages that have published pipelines/models and a second group of languages that have basic support consisting of tokenization, examples, recognizing numbers etc. (you can have a look at some of the directories for languages with basic support in https://github.com/explosion/spaCy/tree/master/spacy/lang to get an idea). We only publish pipelines/models that support a full range of features including named-entity recognition, which (at least yet) doesn't apply to this model. However, when it's finished it would be great to include it in the spaCy Universe (see also the What would also be great, again when/if everything is finished, would be if you could submit a PR to add basic language support for Akkadian. This would essentially entail copying the files from https://github.com/megamattc/Akkadian-language-models/tree/main/ak that don't refer to trained models into a new subdirectory of https://github.com/explosion/spaCy/tree/master/spacy/lang and also writing regression tests in a corresponding subdirectory of https://github.com/explosion/spaCy/tree/master/spacy/tests/lang.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am a graduate student in Assyriology trying to build a language model for Akkadian (Semitic language written in cuneiform in ancient Mesopotamia) using spacy, with a lemmatized and coarse pos-tagged corpus of letters available on Oracc (http://oracc.museum.upenn.edu/saao/saa01/corpus). I ultimately hope to use this model as part of a project in automatic metaphor detection. Initially, however, I am interested in building some of the basic nlp components for the model. I've already got a basic lemmatizer and coarse pos-tagger using lookup tables and a dependency parser using hand-annotated training data made in Inception.
I was wondering if there are more detailed instructions somewhere for how to build a Morphologizer? The specification of the Morphologizer class (https://spacy.io/api/morphologizer) indicates one can supply morphology examples but do not indicate what those examples should look like for a (partially) concatenative language like Akkadian. It's also not totally clear to me from the specification what one should do to the config.cfg file to enable the training of the model on these data examples, or how one could import an external morphologizer into the pipeline.
Matthew Ong
Beta Was this translation helpful? Give feedback.
All reactions