Skip to content

H20Watermelon/nlpaug

 
 

Repository files navigation

Build Status Codacy Badge

nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Starter Guides

Flow

Pipeline Description
Sequential Apply list of augmentation functions sequentially
Sometimes Apply some augmentation functions randomly

Textual Augmenter

Target Augmenter Action Description
Character RandomAug insert Insert character randomly
substitute Substitute character randomly
swap Swap character randomly
delete Delete character randomly
OcrAug substitute Simulate OCR engine error
KeyboardAug substitute Simulate keyboard distance error
Word RandomWordAug swap Swap word randomly
delete Delete word randomly
SpellingAug substitute Substitute word according to spelling mistake dictionary
WordNetAug substitute Substitute word according to WordNet's synonym
WordEmbsAug insert Insert word randomly from word2vec, GloVe or fasttext dictionary
substitute Substitute word based on word2vec, GloVe or fasttext embeddings
TfIdfAug insert Insert word randomly trained TF-IDF model
substitute Substitute word based on TF-IDF score
ContextualWordEmbsAug insert Insert word based by feeding surroundings word to BERT and XLNet language model
substitute Substitute word based by feeding surroundings word to BERT and XLNet language model
Sentence ContextualWordEmbsForSentenceAug insert Insert sentence according to GPT2 or XLNet prediction

Signal Augmenter

Target Augmenter Action Description
Audio NoiseAug substitute Inject noise
PitchAug substitute Adjust audio's pitch
ShiftAug substitute Shift time dimension forward/ backward
SpeedAug substitute Adjust audio's speed
CropAug delete Delete audio's segment
LoudnessAug substitute Adjust audio's volume
MaskAug substitute Mask audio's segment
Spectrogram FrequencyMaskingAug substitute Set block of values to zero according to frequency dimension
TimeMaskingAug substitute Set block of values to zero according to time dimension

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug numpy matplotlib python-dotenv

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git

If you use ContextualWordEmbsAug, install the following dependencies as well

pip install torch>=1.1.0 pytorch_pretrained_bert>=1.1.0

If you use WordNetAug, install the following dependencies as well

pip install nltk

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

If you use any one of audio augmenter, install the following dependencies as well

pip install librosa

Recent Changes

BETA Sep, 2019

  • Added Swap Mode (adjacent, middle and random) for RandomAug (character level)
  • WordNetAug supports antonyms

0.0.8 Sep 4, 2019

  • BertAug is replaced by ContextualWordEmbsAug
  • Support GPU (for ContextualWordEmbsAug only) #26
  • Upgraded pytorch_transformer to 1.1.0 version #33
  • ContextualWordEmbsAug suuports both BERT and XLNet model
  • Removed librosa dependency
  • Add ContextualWordEmbsForSentenceAug for generating next sentence
  • Fix sampling issue #38

See changelog for more details.

Source

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

About

Data augmentation for NLP

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 67.3%
  • Python 32.7%