Skip to content

Handling of text metadata in mSUD annotated treebanks #54

@bguil

Description

@bguil

When a sentence is annotated in mSUD, there are two different text information to handle.
For instance in sentence AN14_06-06 in the mSUD Ika treebank, there are two mSUD tokens: a- and gwasi.
We then have "text" representation:

  • morph level: a- gwasi
  • word level: agwasi

We need to decide the way to encode this consistently in mSUD treebanks.

In our current mSUD data we have a very inconsistent state:

lang morph level text word level
Beja inconsistent phonetic_text / text inconsistent
Gbaya text phonetic_text
Pesh morphemic_text text
Tuwari text not available
Ika & Bokota (in ArboratorGrew) morphemic_text text_ortho & text

I suggest to choose the "Pesh" line because the name morphemic_text is more explicit and less confusing than phonetic_text

Note that this would also need to adapt ArboratorGrew to take into account the fact that mSUD is special in this respect:

  • in UD and SUD: text is the contatenation of token's FORM (taking into account MWT tokens and SpaceAfter)
  • in mSUD: text is not the contatenation...

Note also that in Chinese, Taigi and Teochew treebanks, morphs do not contains the - symbol, only one "text" field is needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions