Support for Hebrew #12497
Replies: 3 comments 10 replies
-
Most of the models are trained on UD corpora and most of them have |
Beta Was this translation helpful? Give feedback.
-
Here are the slightly out-of-date basics for adding a new language to the trained pipelines: #3056 We look for corpora with licenses that allow commercial use that include annotations for: POS tags, lemmas, dependency parses, and named entities. Most languages use a UD corpus for POS+lemmas+dependencies and a separate NER corpus, but sometimes there's one corpus with all annotation layers. Typically NER corpora are more difficult to find with appropriate licenses, and Hebrew would probably need a third-party lemmatizer. Hebrew in particular may be difficult because of spacy's limited ability to handle multi-word tokens from UD corpora and because spacy's built-in components do not perform well on Hebrew, see e.g. https://aclanthology.org/D19-3044.pdf. |
Beta Was this translation helpful? Give feedback.
-
Hello again @adrianeboyd I have some questions regarding annotated corpora for NER in Hebrew. It is mentioned here that the entity recognizer identifies non-overlapping labelled spans of tokens. I'm a bit unsure about what this implies exactly - does it mean that the entity recognizer doesn't identify nested entities – entity within a longer entity? Should I be concerned if the NER corpus I choose contains nesting? If you have any additional insights or characteristics that I should consider before selecting a suitable NER corpus, I would greatly appreciate it! On a related note, based on our previous correspondence, I learned about the importance of avoiding corpora with annotation schemes involving an addition of covert morphemes. Could you please confirm whether this concern is limited to tokenization and UD corpora, or if it also applies to NER corpora? Does it matter if my chosen NER corpus involves adding covert morphemes (as part of the NER annotation)? Thank you for your assistance in clarifying these points. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I am interested in adding full support for Hebrew in spaCy. After searching through the spaCy Discussions forum, I found an older thread discussing this same issue.
Recently a new UD corpus for Hebrew has been published and is licensed under CC BY-SA 4.0. However, I'm uncertain if this corpus can be used with spaCy due to license incompatibility. Can someone provide clarification on this matter?
Beta Was this translation helpful? Give feedback.
All reactions