This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.
- Data Augmentation library for Text
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Does your NLP model able to prevent adversarial attack?
- Example of Augmentation for Textual Inputs
- Example of Augmentation for Spectrogram Inputs
- Example of Augmentation for Audio Inputs
- Example of Orchestra Multiple Augmenters
- How to train TF-IDF model
- How to create custom augmentation
- API Documentation
| Pipeline | Description |
|---|---|
| Sequential | Apply list of augmentation functions sequentially |
| Sometimes | Apply some augmentation functions randomly |
| Target | Augmenter | Action | Description |
|---|---|---|---|
| Character | RandomAug | insert | Insert character randomly |
| substitute | Substitute character randomly | ||
| swap | Swap character randomly | ||
| delete | Delete character randomly | ||
| OcrAug | substitute | Simulate OCR engine error | |
| KeyboardAug | substitute | Simulate keyboard distance error | |
| Word | RandomWordAug | swap | Swap word randomly |
| delete | Delete word randomly | ||
| SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |
| WordNetAug | substitute | Substitute word according to WordNet's synonym | |
| WordEmbsAug | insert | Insert word randomly from word2vec, GloVe or fasttext dictionary | |
| substitute | Substitute word based on word2vec, GloVe or fasttext embeddings | ||
| TfIdfAug | insert | Insert word randomly trained TF-IDF model | |
| substitute | Substitute word based on TF-IDF score | ||
| ContextualWordEmbsAug | insert | Insert word based by feeding surroundings word to BERT and XLNet language model | |
| substitute | Substitute word based by feeding surroundings word to BERT and XLNet language model | ||
| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to GPT2 or XLNet prediction |
| Target | Augmenter | Action | Description |
|---|---|---|---|
| Audio | NoiseAug | substitute | Inject noise |
| PitchAug | substitute | Adjust audio's pitch | |
| ShiftAug | substitute | Shift time dimension forward/ backward | |
| SpeedAug | substitute | Adjust audio's speed | |
| CropAug | delete | Delete audio's segment | |
| LoudnessAug | substitute | Adjust audio's volume | |
| MaskAug | substitute | Mask audio's segment | |
| Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
| TimeMaskingAug | substitute | Set block of values to zero according to time dimension |
The library supports python 3.5+ in linux and window platform.
To install the library:
pip install nlpaug numpy matplotlib python-dotenvor install the latest version (include BETA features) from github directly
pip install git+https://github.com/makcedward/nlpaug.gitIf you use ContextualWordEmbsAug, install the following dependencies as well
pip install torch>=1.1.0 pytorch_pretrained_bert>=1.1.0If you use WordNetAug, install the following dependencies as well
pip install nltkIf you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first
from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext modelIf you use any one of audio augmenter, install the following dependencies as well
pip install librosaBETA Sep, 2019
- Added Swap Mode (adjacent, middle and random) for RandomAug (character level)
- WordNetAug supports antonyms
0.0.8 Sep 4, 2019
- BertAug is replaced by ContextualWordEmbsAug
- Support GPU (for ContextualWordEmbsAug only) #26
- Upgraded pytorch_transformer to 1.1.0 version #33
- ContextualWordEmbsAug suuports both BERT and XLNet model
- Removed librosa dependency
- Add ContextualWordEmbsForSentenceAug for generating next sentence
- Fix sampling issue #38
See changelog for more details.
This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.