forked from Julie921/guideline_prototype_hugo
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
When a sentence is annotated in mSUD, there are two different text information to handle.
For instance in sentence AN14_06-06 in the mSUD Ika treebank, there are two mSUD tokens: a- and gwasi.
We then have "text" representation:
- morph level: a- gwasi
- word level: agwasi
We need to decide the way to encode this consistently in mSUD treebanks.
In our current mSUD data we have a very inconsistent state:
| lang | morph level text | word level |
|---|---|---|
| Beja | inconsistent phonetic_text / text |
inconsistent |
| Gbaya | text |
phonetic_text |
| Pesh | morphemic_text |
text |
| Tuwari | text |
not available |
| Ika & Bokota (in ArboratorGrew) | morphemic_text |
text_ortho & text |
I suggest to choose the "Pesh" line because the name morphemic_text is more explicit and less confusing than phonetic_text
Note that this would also need to adapt ArboratorGrew to take into account the fact that mSUD is special in this respect:
- in UD and SUD:
textis the contatenation of token's FORM (taking into account MWT tokens andSpaceAfter) - in mSUD:
textis not the contatenation...
Note also that in Chinese, Taigi and Teochew treebanks, morphs do not contains the - symbol, only one "text" field is needed.