You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,40 +18,40 @@ bibliography: paper.bib
18
18
---
19
19
20
20
# Summary
21
-
Text augmentation - the process of generating new text samples by applying transformations to existing samples - is a useful tool for training [@wei-zou-2019-eda] and evaluating [@ribeiro-etal-2020-beyond] natural language processing (NLP) models and systems. Despite its utility, existing libraries are often limited in terms of functionality and flexibility. They are confined to basic tasks such as text-classification or by catering to specific downstream use-cases such as estimating robustness [@goel-etal-2021-robustness]. Recognizing these constraints, `Augmenty` is a tool for structured augmentation of text along with its annotations. `Augmenty` integrates seamlessly with the popular NLP library `spaCy` [@spacy] and seeks to be compatible with all models and tasks supported by `spaCy`. Augmenty provides a wide range of augmenters which can be combined in a flexible manner to create complex augmentation pipelines. It also includes a set of primitives that can be used to create custom augmenters such as word replacement augmenters. This functionality allows for augmentations within a range of applications such as named entity recognition (NER), part-of-speech tagging, and dependency parsing.
21
+
Text augmentation - the process of generating new text samples by applying transformations to existing samples - is a useful tool for training [@wei-zou-2019-eda] and evaluating [@ribeiro-etal-2020-beyond] natural language processing (NLP) models and systems. Despite its utility, existing libraries are often limited in terms of functionality and flexibility. They are confined to basic tasks such as text classification or catering to specific downstream use cases such as estimating robustness [@goel-etal-2021-robustness]. Recognizing these constraints, `Augmenty` is a tool for structured augmentation of text along with its annotations. `Augmenty` integrates seamlessly with the popular NLP library `spaCy` [@spacy] and seeks to be compatible with all models and tasks supported by `spaCy`. Augmenty provides a wide range of augmenters that can be combined flexibly to create complex augmentation pipelines. It also includes a set of primitives that can be used to create custom augmenters such as word replacement augmenters. This functionality allows for augmentations within a range of applications such as named entity recognition (NER), part-of-speech tagging, and dependency parsing.
22
22
23
23
# Statement of need
24
24
<!-- augmentation is useful -->
25
-
Augmentation has proven to be a powerful tool within disciplines such as computer vision [@wang2017effectiveness] and speech recognition [@Park2019SpecAugmentAS] where it is used for both training more robust models and for evaluating the ability of the models to handle pertubations. Within NLP augmentation has seen some uses as a tool for generating additional training data [@wei-zou-2019-eda], but has shined as a tool for model evaluation, such as estimating robustness [@goel-etal-2021-robustness] and bias [@lassen-etal-2023-detecting], or for creating novel datasets [@nielsen-2023-scandeval].
25
+
Augmentation has proven to be a powerful tool within disciplines such as computer vision [@wang2017effectiveness] and speech recognition [@Park2019SpecAugmentAS] where it is used for both training more robust models and for evaluating the ability of the models to handle perturbations. Within NLP augmentation has seen some uses as a tool for generating additional training data [@wei-zou-2019-eda], but has shined as a tool for model evaluation, such as estimating robustness [@goel-etal-2021-robustness] and bias [@lassen-etal-2023-detecting], or for creating novel datasets [@nielsen-2023-scandeval].
26
26
27
-
Despite its utility, existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility. Commonly they only provide pure string augmentation which typically leads to the annotations becoming misaligned with the text. This has limited the use of augmentation to tasks such as text classification while neglecting structured prediction tasks such as named entity recognition (NER) or coreference resolution. This has limited the use of augmentation to a wide range of tasks both for training and evaluation.
27
+
Despite its utility, existing libraries for text augmentation often exhibit limitations in terms of functionality and flexibility. Commonly they only provide pure string augmentation which typically leads to the annotations becoming misaligned with the text. This has limited the use of augmentation to tasks such as text classification while neglecting structured prediction tasks such as named entity recognition (NER) or coreference resolution. This has limited the use of augmentation to a wide range of tasks for training and evaluation.
28
28
29
29
<!-- limitation of existing methods -->
30
-
Existing tools such as `textgenie`[@pandya_hetpandyatextgenie_2023], and `textaugment`[@marivate2020improving]implements powerful techniques such as backtranslation and paraprashing, which are useful augmentations for text-classification tasks. However, these tools neglect a category of tasks which require that the annotations are aligned with the augmentation of the text. For instance even simple augmentations such as replacing the named entity "Jane Doe" with "John" will lead to a misalignment of the NER annotation, part-of-speech tags, etc., which if not properly handled will lead to a misinterpretation of the model performance or generation of incorrect training samples.
30
+
Existing tools such as `textgenie`[@pandya_hetpandyatextgenie_2023], and `textaugment`[@marivate2020improving]implement powerful techniques such as backtranslation and paraprashing, which are useful augmentations for text-classification tasks. However, these tools neglect a category of tasks that require that the annotations be aligned with the augmentation of the text. For instance, even simple augmentations such as replacing the named entity "Jane Doe" with "John" will lead to a misalignment of the NER annotation, part-of-speech tags, etc., which if not properly handled will lead to a misinterpretation of the model performance or generation of incorrect training samples.
31
31
32
-
Other tools for data augmentation focus on specific downstream application such as `textattack`[@morris2020textattack] which is useful for adversarial attacks of classification systems or `robustnessgym`[@goel-etal-2021-robustness] which is useful for evaluating robustness of classification systems.
32
+
Other tools for data augmentation focus on specific downstream applications such as `textattack`[@morris2020textattack] which is useful for adversarial attacks of classification systems or `robustnessgym`[@goel-etal-2021-robustness] which is useful for evaluating the robustness of classification systems.
33
33
34
34
35
-
`Augmenty` introduces a flexible and easy-to-use interface for structured text augmentation, seeking to augment the annotations along with the text. `Augmenty` is built to integrate well with `spaCy`[@spacy] and seeks to be compatible with the broad set of tasks supported by `spaCy`. Augmenty provides augmenters which take a spaCy `Doc`-object (but works just as well with `string`-objects) and return a new `Doc`-object with the augmentations applied. This allows for augmentations of both the text and the annotations present in the `Doc`-object.
36
-
`Augmenty` does not seek to replace useful tools such as `textattack`, but seeks to provide a generalpurpose tool for augmentation of both the text and its annotations. This allows for augmentations within a range of applications such as named entity recognition, part-of-speech tagging, and dependency parsing.
35
+
`Augmenty` introduces a flexible and easy-to-use interface for structured text augmentation, seeking to augment the annotations along with the text. `Augmenty` is built to integrate well with `spaCy`[@spacy] and seeks to be compatible with the broad set of tasks supported by `spaCy`. Augmenty provides augmenters that take a spaCy `Doc`-object (but works just as well with `string`-objects) and return a new `Doc`-object with the augmentations applied. This allows for augmentations of both the text and the annotations present in the `Doc`-object.
36
+
`Augmenty` does not seek to replace tools such as `textattack`, but seeks to provide a general-purpose tool for augmentation of both the text and its annotations. This allows for augmentations within a range of applications such as named entity recognition, part-of-speech tagging, and dependency parsing.
37
37
38
38
39
39
# Features & Functionality
40
40
`Augmenty` is a Python library that implements augmentations based on `spaCy`'s `Doc` object. `spaCy`'s `Doc` object is a container for a text and its annotations. This makes it easy to augment text and annotations simultaneously. The `Doc` object can easily be extended to include custom augmentation not available in `spaCy` by adding custom attributes to the `Doc` object. While `Augmenty` is built to augment `Doc`s the object is easily converted into strings, lists or other formats. The annotations within a `Doc` can be provided either by human annotations or using a trained model.
41
41
42
-
Augmenty implements a series of augmenters for token-, span- and sentence-level augmentation. These augmenters range from primitive augmentations such as word replacement to languagespecific augmenters such as keystroke error augmentations based on a French keyboard layout. Augmenty also integrates with other libraries such as `NLTK`[@bird2009natural] to allow for augmentations based on WordNet [@miller-1994-wordnet] and allows for specification of static word vectors [@pennington-etal-2014-glove] to allow for augmentations based on word similarity. Lastly, `augmenty` provides a set of utility functions for repeating augmentations, combining augmenters or adjust the percentage of documents that should be augmented. This allow for the flexible construction of augmentation pipelines specific to the task at hand.
42
+
Augmenty implements a series of augmenters for token-, span- and sentence-level augmentation. These augmenters range from primitive augmentations such as word replacement to language-specific augmenters such as keystroke error augmentations based on a French keyboard layout. Augmenty also integrates with other libraries such as `NLTK`[@bird2009natural] to allow for augmentations based on WordNet [@miller-1994-wordnet] and allows for specification of static word vectors [@pennington-etal-2014-glove] to allow for augmentations based on word similarity. Lastly, `augmenty` provides a set of utility functions for repeating augmentations, combining augmenters, or adjusting the percentage of documents that should be augmented. This allows for the flexible construction of augmentation pipelines specific to the task at hand.
43
43
44
44
# Example Use Cases
45
45
46
-
Augmenty has already seen used in a number of projects. The code base was initially developed for evaluating the robustness and bias of `DaCy`[@Enevoldsen_DaCy_A_Unified_2021], a state-of-the-art Danish NLP pipeline. It is also continually used to evaluate Danish NER systems for biases and robustness on the DaCy website.
46
+
Augmenty has already been used in a number of projects. The code base was initially developed for evaluating the robustness and bias of `DaCy`[@Enevoldsen_DaCy_A_Unified_2021], a state-of-the-art Danish NLP pipeline. It is also continually used to evaluate Danish NER systems for biases and robustness on the DaCy website.
47
47
Augmenty has also been used to detect intersectional biases [@lassen-etal-2023-detecting] and used within benchmarks of Danish language models [@sloth_dadebiasgenda-lens_2023].
48
48
49
-
Besides its existing use-cases `Augmenty` could for example also be used to a) upsample minority classes without duplicating samples, b) train less biased models by e.g. replacing names with names of minority groups c) train more robust models e.g. by augmenting with typos or d) generate pseudohistorical data by augmenting with known spelling variations of words.
49
+
Besides its existing usecases `Augmenty` could for example also be used to a) upsample minority classes without duplicating samples, b) train less biased models by e.g. replacing names with names of minority groups c) train more robust models e.g. by augmenting with typos or d) generate pseudo-historical data by augmenting with known spelling variations of words.
50
50
51
51
52
52
# Target Audience
53
53
54
-
The package is mainly targeted at NLP researchers and practitioners who wish to augment their data for training or evaluation. The package is also targeted at researchers who wish to evaluate their models with augmentations or want to generate new datasets.
54
+
The package mainly targets NLP researchers and practitioners who wish to augment their data for training or evaluation. The package is also targeted at researchers who wish to evaluate their models with augmentations or want to generate new datasets.
0 commit comments