Help on building Akkadian language model from scratch #11516

megamattc · 2022-09-18T16:13:57Z

megamattc
Sep 18, 2022

Hello,

I am a graduate student in Assyriology trying to build a language model for Akkadian (Semitic language written in cuneiform in ancient Mesopotamia) using spacy, with a lemmatized and coarse pos-tagged corpus of letters available on Oracc (http://oracc.museum.upenn.edu/saao/saa01/corpus). I ultimately hope to use this model as part of a project in automatic metaphor detection. Initially, however, I am interested in building some of the basic nlp components for the model. I've already got a basic lemmatizer and coarse pos-tagger using lookup tables and a dependency parser using hand-annotated training data made in Inception.

I was wondering if there are more detailed instructions somewhere for how to build a Morphologizer? The specification of the Morphologizer class (https://spacy.io/api/morphologizer) indicates one can supply morphology examples but do not indicate what those examples should look like for a (partially) concatenative language like Akkadian. It's also not totally clear to me from the specification what one should do to the config.cfg file to enable the training of the model on these data examples, or how one could import an external morphologizer into the pipeline.

Matthew Ong

jmyerston · 2022-09-19T17:14:42Z

jmyerston
Sep 19, 2022

Hi Matthew,

I hope I can help. I used to work for Oracc as an annotator of Akkadian texts and have trained many models for ancient Greek.

For a project like yours, I think, you should use as a template the spaCy Projects: Part-of-speech Tagging & Dependency Parsing] (Universal Dependencies)(https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud) and Project: Universal Dependencies v2.5 Benchmarks(https://github.com/explosion/projects/tree/v3/benchmarks/ud_benchmark).

For the data that you need to produce, you can look at the two existing Universal Dependencies Akkadian corpora which are in a format that spaCy can process. You can find those corpora here: https://universaldependencies.org/.

In the conllu you can see the annotation scheme that you need. For instance:

sent_id = Q004591-1

text = ēkal Aššur-naṣir-apli šar kiššati šar māt Aššur mār Tukulti-Ninurta šar māt Aššur mār Adad-nerari šar māt Aššurma

Is annotated as:

1 ēkal ēkallu NOUN N Gender=Fem|NounBase=Bound|Number=Sing 0 root _ E₂.GAL
2 Aššur-naṣir-apli Aššur-naṣir-apli_II PROPN RN Gender=Masc 1 nmod:poss _ {m}aš-šur-PAP-A
3 šar šarru NOUN N Gender=Masc|NounBase=Bound|Number=Sing 2 appos _ MAN
4 kiššati kiššatu NOUN N Case=Gen|Gender=Fem|NounBase=Free|Number=Sing 3 nmod:poss _ ŠU₂
5 šar šarru NOUN N Gender=Masc|NounBase=Bound|Number=Sing 2 appos _ MAN
6 māt mātu NOUN N Gender=Fem|NounBase=Bound|Number=Sing 5 nmod:poss _ KUR
7 Aššur Aššur PROPN GN Gender=Masc 6 nmod:poss _ aš-šur
8 mār māru NOUN N Gender=Masc|NounBase=Bound|Number=Sing 2 appos _ A
9 Tukulti-Ninurta Tukulti-Ninurta_II PROPN RN Gender=Masc 8 nmod:poss _ GISKIM-{d}MAŠ
10 šar šarru NOUN N Gender=Masc|NounBase=Bound|Number=Sing 9 appos _ MAN
11 māt mātu NOUN N Gender=Fem|NounBase=Bound|Number=Sing 10 nmod:poss _ KUR
12 Aššur Aššur PROPN GN Gender=Masc 11 nmod:poss _ aš-šur
13 mār māru NOUN N Gender=Masc|NounBase=Bound|Number=Sing 9 appos _ A
14 Adad-nerari Adad-narari_II PROPN RN Gender=Masc 13 nmod:poss _ 10-ERIM.TAH₂
15 šar šarru NOUN N Gender=Masc|NounBase=Bound|Number=Sing 14 appos _ MAN
16 māt mātu NOUN N Gender=Fem|NounBase=Bound|Number=Sing 15 nmod:poss _ KUR
17-18 Aššurma _ _ _ _ _ _ _ _
17 Aššur _ PROPN _ Gender=Masc 16 nmod:poss _ _
18 ma _ PART _ _ 17 dep _ _

In the first word of the line, you get the morphological annotation:

Gender=Fem|NounBase=Bound|Number=Sing

You would need to produce some files with similar annotations. Since the Oracc corpus does not include dependency annotations, I would create my own annotated corpus following the UD conventions. For this you could use whatever tools the UD page offers for annotating and try to covert the ATF files of Oracc to something close to conllu that you could use as a draft.

Jacobo

3 replies

megamattc Sep 19, 2022
Author

Hi Jacobo,

Thank you for your reply!

Up to this point I have managed to scrape Oracc data for SAA 1 in normalized form, along with the lemmatizations and pos tags in lookup files. I've annotated about 16 letters for dependencies in Inception so far. Using these, I have been able to produce a language model in spacy. It does pos tagging pretty well although it does dependencies terrible so far. The process is also not streamlined into a project like you have shown me, so I will look at your recommendations. I either forgot or didn't know spacy had those specific projects.

One thing though: If I am correct, including the morphology data in the conllu files as you indicated will tell spacy the relevant features of a single word (like bēlu 'lord'), but this does not tell spacy how to decompose and analyze words with suffixes like bēliya or ittišunu. That is what I was aiming at with my question about the morphologizer. I see from the conllu files that Mikko Lukko, when annotating the Akkadian treebank for the NA royal inscriptions, handled this issue by manually separating suffixes from the head noun by a space in the input text and then annotating the result in Inception (or rather, its predecessor) that captured the morphemic information. I was hoping to avoid that...

jmyerston Sep 19, 2022

I do not think that the morphologizer's function is to decompose the word in stem and affixes, this is something that the edit tree lemmatizer does, and it is part of the lemmatization process.

The pronominal suffixes that you mention are really challenging but I do not think that separating the suffixes by hand is a good practice because one will need to give spaCy texts in the same format, which is not a standard form of transliterating Akkadian. This means you will have to preprocess the text before your model can properly process it. You will have to add corrections by hand when your script does not separate the suffixes properly.

Neural networks will learn from what they see, so your examples should represent what it is found out there. For example, If you remove punctuation from a corpus and train your model with that corpus, your model will not know how to deal with punctuation. A problem I'm having right now with an ancient Greek UD corpus. When they annotate the treebanks, many linguists don't think about training models.

I will ask the question of how to annotate pronominal suffixes in the UD mailing list, always keeping the machine learning aspect in sight. I personally would keep bēliya as a single word and add the pronominal information to the FEATS field. I would not decompose the token because this is not the format of the raw data.

jmyerston Sep 19, 2022

One last thing:

I notice that the RIAO corpus annotates the suffixes this way:

4-5 ikribīšu _ _ _ _ _ _ _ _
4 ikribī ikribu NOUN N Gender=Masc|NounBase=Suffixal|Number=Plur 6 obj _ ŠUD₃-šu₂
5 šu _ PRON _ Gender=Masc|Number=Sing|Person=3 4 det:poss _ _

It gives ikribīšu first, token 4-5, and then the decomposition. Why don't you try to train a model with this corpus and see how it behaves?

megamattc · 2022-09-20T14:55:10Z

megamattc
Sep 20, 2022
Author

Hi Jacobo,

If I may respond to a few things:

"I do not think that the morphologizer's function is to decompose the word in stem and affixes, this is something that the edit tree lemmatizer does, and it is part of the lemmatization process."

If this is so, I wonder how spacy is to do things like recognize pronominal suffixes as being arguments of a verb, or objects of prepositions? It seems like the lemmatizer just wants to strip away affixes to get to the lemma of the token and doesn't retain affix info. So it would not be able to help the dependencies parser or anything else that needs to know about such suffixes...

" I personally would keep bēliya as a single word and add the pronominal information to the FEATS field. I would not decompose the token because this is not the format of the raw data."

That sounds more promising. However the UD list of universal features (https://universaldependencies.org/u/feat/index.html) doesn't include anything like this (poss just means an adjective is possessive). According to their guidelines, I should just use underscores to create a new feature category. Thus for bēliya I would put PossSuffNum= Sing|PossSuffPerson=1, or šupuršu = AccSuffNum=Sing|AccSuffPerson=3?

"I notice that the RIAO corpus annotates the suffixes this way:

4-5 ikribīšu _ _ _ _ _ _ _ _
4 ikribī ikribu NOUN N Gender=Masc|NounBase=Suffixal|Number=Plur 6 obj _ ŠUD₃-šu₂
5 šu _ PRON _ Gender=Masc|Number=Sing|Person=3 4 det:poss _ _

It gives ikribīšu first, token 4-5, and then the decomposition. Why don't you try to train a model with this corpus and see how it behaves?"

Indeed. I noticed this. Is this not, however, going against what you said above about working only with what is in the text and not separating the pronominal suffixes by hand?

Matt

0 replies

megamattc · 2022-09-20T15:00:21Z

megamattc
Sep 20, 2022
Author

Apparently, RIAO does indicate in its FEATS that a noun has a suffix (NounBase=Suffixal), but does not gives features to the suffix, nor does it say anything for verbs.

0 replies

jmyerston · 2022-09-20T17:20:48Z

jmyerston
Sep 20, 2022

Hi Matt, This is all very interesting, and I think you should contact me directly at jmyerstonATucsd.edu since we could collaborate in writing an Akkadian module for spaCy.Since we are in the UC system, there are some resources we could use for collaborating. I was also at Berkeley once for an Oracc training with Niek. Going back to your question (I hope a spaCy main developer jumps in here for help). It is possible that the spaCy’s tokenizer learns from annotations like the ones we are seeing in cases like ikribīšu. (I need to correct myself here: this is the task of the tokenizer and not of the lemmatizer). Once the text is properly tokenized, you can annotate the deps. I face such issues all the time annotating ancient Greek with Prodigy. If the tokenizer is not splitting the tokens properly, I must go back to the source code and fix it. My argument against separating pronominal suffixes from nouns is that we do not do this with verbs, although we could. One could argue that the endings -o in Latin and Spanish is a first-person marker and split it out of the stem. This could help with coreference resolution for sure, but it creates also an over analytical and nonstandard representation of the language. Maybe the need felt to separate the pronominal suffixes from nouns comes from the idea that nouns do not have person. But the Akkadian verbs has gender like nouns and adjective do and we don’t find that as offensive as the idea that the inflection of the noun can indicate person. The problem with my approach is that you break compatibility with what has been already annotated by Lukko’s team. Since you need as much data as possible to train models, it would be good if you can use the existing UD corpora. So, the first step, I would say, is to train models with the existing UD corpora and test them well. Then examine the advantages and shortcomings of the annotation schema they are using. If spaCy is learning from the decomposition annotations of Lukko (you will see this while tokenizing a text that contains the same vocabulary), then you can just take over the same annotation model. If not, you have a good reason either to write the Akkadian support files for spaCy that could handle such annotations or carefully develop your own annotation system

…

On Sep 20, 2022, at 7:55 AM, Matthew Ong ***@***.***> wrote: Hi Jacobo, If I may respond to a few things: "I do not think that the morphologizer's function is to decompose the word in stem and affixes, this is something that the edit tree lemmatizer does, and it is part of the lemmatization process." If this is so, I wonder how spacy is to do things like recognize pronominal suffixes as being arguments of a verb, or objects of prepositions? It seems like the lemmatizer just wants to strip away affixes to get to the lemma of the token and doesn't retain affix info. So it would not be able to help the dependencies parser or anything else that needs to know about such suffixes... " I personally would keep bēliya as a single word and add the pronominal information to the FEATS field. I would not decompose the token because this is not the format of the raw data." That sounds more promising. However the UD list of universal features (https://universaldependencies.org/u/feat/index.html <https://universaldependencies.org/u/feat/index.html>) doesn't include anything like this (poss just means an adjective is possessive). According to their guidelines, I should just use underscores to create a new feature category. Thus for bēliya I would put PossSuffNum= Sing|PossSuffPerson=1, or šupuršu = AccSuffNum=Sing|AccSuffPerson=3? "I notice that the RIAO corpus annotates the suffixes this way: 4-5 ikribīšu _ _ _ _ _ _ _ _ 4 ikribī ikribu NOUN N Gender=Masc|NounBase=Suffixal|Number=Plur 6 obj _ ŠUD₃-šu₂ 5 šu _ PRON _ Gender=Masc|Number=Sing|Person=3 4 det:poss _ _ It gives ikribīšu first, token 4-5, and then the decomposition. Why don't you try to train a model with this corpus and see how it behaves?" Indeed. I noticed this. Is this not, however, going against what you said above about working only with what is in the text and not separating the pronominal suffixes by hand? Matt — Reply to this email directly, view it on GitHub <#11516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKJYKB64G4YWAIVZSDGNGMDV7HF5TANCNFSM6AAAAAAQPQEZ7E>. You are receiving this because you commented.

0 replies

richardpaulhudson · 2022-09-26T13:09:29Z

richardpaulhudson
Sep 26, 2022

Not knowing Akkadian I don't feel I can usefully comment on the linguistic question of whether or not such words are better split into two separate tokens or analysed morphologically as single tokens. However, when considering the feasibility of the two approaches, it's important to bear in mind that the tokenizer is based on hand-written rules while the morphologizer/lemmatizer are based on trained machine-learning models. The tokenizer is thus more likely to be an appropriate choice to capture a small number of distinctive suffixes, e.g. the situation in Spanish where there is a closed class of clitics that can attach to the verb in forms like escúchame ('listen to me') (this is a gross oversimplification but still essentially true).

I'm currently doing some work on the morphologizer/lemmatizer to try and fix the problem that the affixes that they used for training are of fixed length and are too short for many languages. If Akkadian regularly has suffixes on the lemma stem that have a combined length of more than 3 characters, it may be that you could benefit from what I am doing. Unfortunately it's still very much work in progress, hasn't been tested or reviewed and isn't officially supported yet, but if you're feeling brave you could have a look.

Please get back in touch if there are more specific questions: I heard the call for a spaCy main developer but am not sure if I've fully understood what the open points are.

0 replies

megamattc · 2022-10-10T22:32:39Z

megamattc
Oct 10, 2022
Author

Hi Richard,

Thanks for your input. I actually have tried to just encode suffixes and other morphological features of a token as attributes of the lemma itself. It seems to work reasonably well with enough training examples even though it is not rule-based.

I was wondering how you would recommend I go about adding some concord/agreement rules to the dependency parser so that, e.g. it assigns a noun as subject of a verb only if their explicit person/number/gender features match? I looked at the code of the dependency parser (https://github.com/explosion/spaCy/blob/master/spacy/pipeline/dep_parser.pyx) but it isn't clear the concrete parsing occurs (although I know that parsers are evaluated according to likelihood). Similarly, the DependencyMatcher assumes the parser has already been run on the text.

In addition, I've made an early version of an Akkadian Language class for spacy modeled on the existing classes, and I was wondering how formalized and tidy the package has to be in order to be allowed into the spacy repository?

Matt

0 replies

richardpaulhudson · 2022-10-13T07:16:23Z

richardpaulhudson
Oct 13, 2022

There are certainly constraint-based dependency parsers that support such rules, but the spaCy dependency parser is a purely statistical transition-based parser that is based completely on machine learning. It generally works surprisingly well even though morphological tags do not form part of its output. Just as a morphologizer model can learn the mappings from words, word shapes and affixes to morphological tags, a dependency parser model can directly learn how those features affect dependency parsing without capturing the morphology as an overt intermediate step. Note that the fact that the morphologizer and the dependency parser are autonomous models means that there is no guarantee that their outputs line up if one makes an error, so that a word might be marked as a subject of a verb and simultaneously be analysed as having some case marking that is incompatible with that.

An "Akkadian Language class" could mean several different things. If you've checked it into Github, perhaps you could share a link to it, then I can comment further?

Richard

1 reply

megamattc Oct 13, 2022
Author

I see. I wonder then why my parser is still so bad at certain aspects of morphosyntax, including the fact (as you mention) that phi-features (person, number, gender, case) of nouns sometimes don't match verbs they are dependent upon, or that agreeing noun/verb pairs aren't linked. I wonder how many examples of the right combination it has to see before it learns. That is a problem for my sparse data.

This also makes me wonder how the parser will ever be able to learn that the pronominal suffixes on verbs, which I encode as morphology features on the verb, can represent the direct or indirect object of that verb instead of an independent noun...

I've made public my Github repository which includes the language package (under megamattc/Akkadian-language-models/ak) if you could take a look. Thanks.

Matt

richardpaulhudson · 2022-10-14T08:15:00Z

richardpaulhudson
Oct 14, 2022

Thanks for the link, I hope to look at this over the next week or so.

The dependency parser isn't aiming to derive a semantic representation of each sentence, but rather to recognise and classify the syntactic relationships between the words in each sentence. This means that pronominal suffixes on a verb are not something the dependency parser will (or indeed can) ever learn to recognise, simply because they're outside the scope of what it's trying to do. That said, it should learn that a verb with a certain pronominal suffix doesn't have an overt nominal object of the corresponding type if that is indeed so for the language in question.

0 replies

richardpaulhudson · 2022-10-21T18:28:55Z

richardpaulhudson
Oct 21, 2022

I've had a look at your code and I think I've found a cause for the poor parser accuracy you describe. The tok2vec layer that provides the input to the dependency parser looks as though it is currently configured in such a way that it can only learn the grammatical functions of the specific forms that are present in the training data and is unable to generalize the functions of e.g. grammatical suffixes to new words. The good news is that this will be easy to fix, but I'd like to discuss which exact configuration to recommend with a couple of colleagues and will get back to you again early next week.

0 replies

megamattc · 2022-10-21T18:48:04Z

megamattc
Oct 21, 2022
Author

Ah, that is interesting. Thank you for looking at this issue. I look forward to hearing from you next week!

Matt

0 replies

richardpaulhudson · 2022-10-24T10:37:46Z

richardpaulhudson
Oct 24, 2022

At the moment the tok2vec embed step is configured like this:

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false

The attrs variable specifies the features that are made visible for the neural network to learn from. At present we only have ORTH (the whole word) and SHAPE (the distribution of upper-case letters, lower-case letters and punctuation within each word, e.g. Matt is rendered as Xxxx; with your data it is most likely to be useful in picking out hyphens). The problem is that these two pieces of information do not allow the neural network to generalize grammatical tendencies to new words it has not seen in the training data. There are two possible routes to fixing this:

Using the current version of spaCy, you could add PREFIX and SUFFIX as attributes. These encode the first character and the last three characters of each word respectively and allow the neural network to learn the effects e.g. flexional endings have on whatever phenomenon it is learning:

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,5000,2500]
include_static_vectors = false

The lengths 1 for prefixes and 3 for suffixes are set with these methods which you could override in your own, language-specific version of lex_attrs.py if you wanted longer prefixes and/or suffixes.

As I mentioned above, I'm currently working on new functionality that will enable the neural network to be fed richer information about the morphology of each word — affixes of varying lengths as well as the presence and order of specific characters within each word. This new functionality is still very much under internal review: what it does and how it works may well change quite considerably before we add it to spaCy, and bugs might come to light during the review process. Still, I have an alpha version that certainly seems to work overall. If you choose to try it out we would be very interested to know whether it does in fact help. You can check it out using git clone --branch=alpha/rfe https://github.com/richardpaulhudson/spaCy and build it from source as explained here. I'd recommend the following configuration. The list of characters to search for is based on https://en.wikipedia.org/wiki/Akkadian_language#Verb_patterns: you may well come up with a more appropriate list:

[components.tok2vec.model.embed]
@architectures = "spacy.RichMultiHashEmbed.v1"
width = ${components.tok2vec.model.encode:width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = "False"
case_sensitive = "False"
pref_lengths = [2, 3, 4, 5]
pref_rows = [5000, 5000, 5000, 5000]
suff_lengths = [2, 3, 4, 5]
suff_rows = [5000, 5000, 5000, 5000]
pref_search_chars = "aiuštn"
pref_search_lengths = [2, 3, 4]
pref_search_rows = [5000, 5000, 5000]

16 replies

megamattc Oct 30, 2022
Author

Hi Richard,

I managed to compile the experimental version of spacy using the above instructions, modified my config file for my model as you suggested (using the spacy.RichMultiHashEmbed.v1 function), trained the model on my training data, and packaged it. On the one hand, from the training data readout I did not see a major improvement in the weakest feature of the model (the LAS scores, about .30), but since that aspect of the model has been stuck there for quite some time during development, something else is going on with that number. However when I try to apply the trained model to an unseen text I get a strange error:

(.env) (base) matthewong@Matthews-MacBook-Air ak_ud_akk_saa01_combined_lemmatizers_all_ovw_true_saa05lemmas_suff-0.0.0 % python saa01_model_combined_lemmatizers_all_ovw_true_annotator_with_ak.py
Line is
ina {URU}x x x ša qanni Waisi rabiūtīšu iltibiušu iddūkušu turtānu ša imitti ša qīni Issar-duri x x x+x x+x x x bi x x x libbi x x x x+x udīna ina Ṭurušpa lā erraba o issurri šarru bēlī o iqabbi mā x {KUR}x x x x bīrtu a-x x x x x x lā ūdū x x x x ša bīrti x x x x x x Waisi x

Traceback (most recent call last):
  File "/Users/matthewong/Documents/nlp_stuff/Akkadian/Akkadian-language-models/ak_tagger_parser_ud_lookup/packages/ak_ud_akk_saa01_combined_lemmatizers_all_ovw_true_saa05lemmas_suff-0.0.0/saa01_model_combined_lemmatizers_all_ovw_true_annotator_with_ak.py", line 103, in <module>
    doc = nlp(line)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/language.py", line 1031, in __call__
    error_handler(name, proc, [doc], e)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/util.py", line 1670, in raise_error
    raise e
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/language.py", line 1026, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "spacy/pipeline/trainable_pipe.pyx", line 56, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/util.py", line 1670, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/pipeline/tok2vec.py", line 125, in predict
    tokvecs = self.model.predict(docs)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/layers/chain.py", line 55, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/layers/chain.py", line 55, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/layers/concatenate.py", line 44, in forward
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/layers/concatenate.py", line 44, in <listcomp>
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/ml/richfeatureextractor.py", line 81, in forward
    hashes = doc.get_character_combination_hashes(
  File "spacy/tokens/doc.pyx", line 1825, in spacy.tokens.doc.Doc.get_character_combination_hashes
  File "stringsource", line 660, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 350, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only

This error occurs directly upon executing the code in my python processing script

doc = nlp(line of input text)

I run this in the virtual environment. I don't know if it is related to the spacy_conll or pandas package in the script. Any ideas about the problem?

megamattc Oct 30, 2022
Author

Incidentally, when I try to push the recent changes to my Github repository that include the experimental model, I get an error:

From https://github.com/megamattc/Akkadian-language-models
 - [deleted]         (none)     -> origin/HEAD 2
fatal: bad object refs/remotes/origin/HEAD 2
error: https://github.com/megamattc/Akkadian-language-models.git did not send all necessary objects

Maybe I am putting directories/files in different places relative to the virtual environment that should be in one place?

richardpaulhudson Oct 31, 2022

Hi Matt, thanks for letting me know about the buffer source array is read-only error. This was a bug in my alpha code that I have now fixed. If you pull and build the changed version, everything should work without you necessarily having to rebuild the model.

I'd be surprised if the Github problem had anything to do with the contents either of my experimental code or of the model you've produced; it looks more like a problem with Git or Github. I would recommend updating to the latest version of Git, deleting and re-cloning the local repository and seeing whether that solves it.

megamattc Nov 1, 2022
Author

Hi Richard,

I think this new error came up the first time I tried to train my model with your version of spacy and I got around the issue then, but it has come up again and I don't remember what I did to solve it (if it even is the same issue). Within my virtual environment, python cannot find the crucial component spacy.RichMultiHashEmbed.v1:

(.env) (base) matthewong@Matthews-MacBook-Air ak_tagger_parser_ud_lookup % python -m spacy train configs/combined_lemmatizers_suff_rich.cfg --output training/UD_Akkadian_saa01 --gpu-id -1 --paths.train corpus/UD_Akkadian_saa01/train --paths.dev corpus/UD_Akkadian_saa01/dev --nlp.lang=ak
ℹ Saving to output directory: training/UD_Akkadian_saa01
ℹ Using CPU

=========================== Initializing pipeline ===========================
Traceback (most recent call last):
  File "/Users/matthewong/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/matthewong/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/typer/main.py", line 532, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/cli/train.py", line 72, in train
    nlp = init_nlp(config, use_gpu=use_gpu)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/training/initialize.py", line 41, in init_nlp
    nlp = load_model_from_config(raw_config, auto_fill=True)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/util.py", line 554, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/language.py", line 1818, in from_config
    nlp.add_pipe(
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/language.py", line 801, in add_pipe
    pipe_component = self.create_pipe(
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/language.py", line 680, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/confection/__init__.py", line 728, in resolve
    resolved, _ = cls._make(
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/confection/__init__.py", line 777, in _make
    filled, _, resolved = cls._fill(
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/confection/__init__.py", line 832, in _fill
    filled[key], validation[v_key], final[key] = cls._fill(
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/confection/__init__.py", line 832, in _fill
    filled[key], validation[v_key], final[key] = cls._fill(
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/confection/__init__.py", line 831, in _fill
    promise_schema = cls.make_promise_schema(value, resolve=resolve)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/.env/lib/python3.10/site-packages/confection/__init__.py", line 1023, in make_promise_schema
    func = cls.get(reg_name, func_name)
  File "/Users/matthewong/Documents/nlp_stuff/spaCy/spacy/util.py", line 139, in get
    raise RegistryError(
catalogue.RegistryError: [E893] Could not find function 'spacy.RichMultiHashEmbed.v1' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy-legacy.CharacterEmbed.v1, spacy-legacy.EntityLinker.v1, spacy-legacy.HashEmbedCNN.v1, spacy-legacy.MaxoutWindowEncoder.v1, spacy-legacy.MishWindowEncoder.v1, spacy-legacy.MultiHashEmbed.v1, spacy-legacy.Tagger.v1, spacy-legacy.TextCatBOW.v1, spacy-legacy.TextCatCNN.v1, spacy-legacy.TextCatEnsemble.v1, spacy-legacy.Tok2Vec.v1, spacy-legacy.TransitionBasedParser.v1, spacy.CharacterEmbed.v2, spacy.EntityLinker.v2, spacy.HashEmbedCNN.v2, spacy.MaxoutWindowEncoder.v2, spacy.MishWindowEncoder.v2, spacy.MultiHashEmbed.v2, spacy.PretrainCharacters.v1, spacy.PretrainVectors.v1, spacy.SpanCategorizer.v1, spacy.Tagger.v2, spacy.TextCatBOW.v2, spacy.TextCatCNN.v2, spacy.TextCatEnsemble.v2, spacy.TextCatLowData.v1, spacy.Tok2Vec.v2, spacy.Tok2VecListener.v1, spacy.TorchBiLSTMEncoder.v1, spacy.TransitionBasedParser.v2
(.env) (base) matthewong@Matthews-MacBook-Air ak_tagger_parser_ud_lookup %

Any suggestions?

richardpaulhudson Nov 2, 2022

It's hard to know for sure without actually being able to look at the machine, but it looks to me as though you are accessing the official version of spaCy rather than my experimental version. Please ensure that:

you have cloned the experimental version (branch alpha/rfe from https://github.com/richardpaulhudson/spaCy)
you have built the experimental version and registered it within your environment (follow the instructions under https://spacy.io/usage, but cloning the experimental version rather than the official version).

richardpaulhudson · 2022-10-24T11:38:34Z

richardpaulhudson
Oct 24, 2022

The other question was whether or not this work could be incorporated into the spaCy repository. If you look under https://spacy.io/usage/models#languages, you'll see that there is a group of languages that have published pipelines/models and a second group of languages that have basic support consisting of tokenization, examples, recognizing numbers etc. (you can have a look at some of the directories for languages with basic support in https://github.com/explosion/spaCy/tree/master/spacy/lang to get an idea).

We only publish pipelines/models that support a full range of features including named-entity recognition, which (at least yet) doesn't apply to this model. However, when it's finished it would be great to include it in the spaCy Universe (see also the Submit your project section at the end).

What would also be great, again when/if everything is finished, would be if you could submit a PR to add basic language support for Akkadian. This would essentially entail copying the files from https://github.com/megamattc/Akkadian-language-models/tree/main/ak that don't refer to trained models into a new subdirectory of https://github.com/explosion/spaCy/tree/master/spacy/lang and also writing regression tests in a corresponding subdirectory of https://github.com/explosion/spaCy/tree/master/spacy/tests/lang.

As there isn't an ISO 639-1 code for Akkadian, we use the ISO 639-2 code for the directory names, which is akk.
We normally support languages in their original scripts, so it would be great if the basic language support could include cuneiform.
How standard is the Roman transcription you are working with in your model? If it's specific to the training data you're working with, we probably wouldn't want to include it; If it's standard, however, it would be great to include it as well. Ideally, the basic language support in a single akk directory would cover both; if this isn't possible for some reason, we could consider a separate directory for Romanized Akkadian (perhaps akk_rom) but would rather avoid that because that isn't a real ISO 639 code.

0 replies

Uh oh!

Help on building Akkadian language model from scratch #11516

Uh oh!

Replies: 12 comments · 20 replies

Uh oh!

Uh oh!

sent_id = Q004591-1

text = ēkal Aššur-naṣir-apli šar kiššati šar māt Aššur mār Tukulti-Ninurta šar māt Aššur mār Adad-nerari šar māt Aššurma

Uh oh!

megamattc Sep 19, 2022 Author

Uh oh!

Uh oh!

Uh oh!

megamattc Sep 20, 2022 Author

Uh oh!

megamattc Sep 20, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

megamattc Oct 10, 2022 Author

Uh oh!

Uh oh!

megamattc Oct 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

megamattc Oct 21, 2022 Author

Uh oh!

Uh oh!

Uh oh!

megamattc Oct 30, 2022 Author

Uh oh!

megamattc Oct 30, 2022 Author

Uh oh!

Uh oh!

megamattc Nov 1, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Replies: 12 comments 20 replies

megamattc Sep 19, 2022
Author

megamattc
Sep 20, 2022
Author

megamattc
Sep 20, 2022
Author

megamattc
Oct 10, 2022
Author

megamattc Oct 13, 2022
Author

megamattc
Oct 21, 2022
Author

megamattc Oct 30, 2022
Author

megamattc Oct 30, 2022
Author

megamattc Nov 1, 2022
Author