Data augmentation for adding the pronoun 'hen' to Swedish corpora
enhence provides support code for adjusting Swedish text corpora (such as the Stockholm-Umeå Corpus (SUC)) to include examples that use the third person singular gender-neutral pronoun hen. It was written primarily to produce updated corpora for use in training POS taggers for efselab.
A detailed description of the approach can be found in the following paper:
Henrik Björklund and Hannah Devinney (2023). Computer, enhence: POS-tagging improvements for nonbinary pronoun use in Swedish. LT-EDI 2023 -- Third Workshop on Language Technology for Equality, Diversity, and Inclusions, 54--61.
To use this code, you will need to edit some paths to point towards the data you wish to modify. enhence uses a relatively simple rule-based approach, which will produce a few minor errors that you may wish to hand-correct. Support code (as well as suggested corrections for SUC) has been provided to aid with this.
-
extract-pronoun-sentences-[tab|conll|udt].py- Several versions of extract-pronoun-sentences are provided, one for .tab files, one for .conll files, and one for conllu (
udt) files. Use the appropriate version for your data. - Modify the
INPUT_DIRvalue to reflect the appropriate path before running the code (line 11)
- Several versions of extract-pronoun-sentences are provided, one for .tab files, one for .conll files, and one for conllu (
-
produce-combined-corpus.py- Set the value of HEN_SENTENCES_NO to the number of sentences including 'hen' you would like to augment your corpus with. Preset values reflect the experiments performed in the paper.
- Modify the
INPUT_DIR,henfilename, andoutfilenamevalue to reflect the appropriate paths before running the code.
-
(optional) hand correction
- We are aware of two main errors due the rules being "overapplied", resulting in "hen eller hen" and "hen och hen"-type sentences. Fortunately, there aren't many instances of these, so they can be hand-corrected.
- "Hen eller hen" almost always needs adjusting, either by condensing a generic "hon eller han" into just "hen"; or by reverting one "hen" back to a binary-gendered pronoun.
- "Hen och hen" only sometimes needs adjusting, by reverting one "hen" back to a binary-gendered pronoun for clarity. Most of the time, this does not produce incorrect or unclear sentences because the string "hen och hen" actually links two clauses.
- We are aware of two main errors due the rules being "overapplied", resulting in "hen eller hen" and "hen och hen"-type sentences. Fortunately, there aren't many instances of these, so they can be hand-corrected.
Generally speaking, the usual instructions for esfelab apply. You can create a new build_*.py file, or just edit the paths, and then build a tagger. To use this tagger in a pipeline, simply change the imports in tagger.py to point towards your new module.
If you wish to use an augmented version of SUC, we also suggest the following changes to supporting files:
saldo.txt- add the following line:
hen hen PN|UTR|SIN|DEF|SUB/OBJ 0swe-brown100.txt- add the following lines:
hen 23 hen 73 hens 15suc-blogs.tab- either run
extract-pronoun-sentencesandproduce-combined-corpus, OR - find/replace the 9 instances of han/hon with hen
- either run
This package is co-written with Henrik Björklund.