nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Starter Guides

Example of Augmentation for Textual Inputs
Example of Augmentation for Spectrogram Inputs
Example of Augmentation for Audio Inputs
Example of Orchestra Multiple Augmenters
How to train TF-IDF model
How to create custom augmentation
API Documentation

Flow

Pipeline	Description
Sequential	Apply list of augmentation functions sequentially
Sometimes	Apply some augmentation functions randomly

Textual Augmenter

Target	Augmenter	Action	Description
Character	RandomAug	insert	Insert character randomly
		substitute	Substitute character randomly
		swap	Swap character randomly
		delete	Delete character randomly
	OcrAug	substitute	Simulate OCR engine error
	KeyboardAug	substitute	Simulate keyboard distance error
Word	RandomWordAug	swap	Swap word randomly
		delete	Delete word randomly
	SpellingAug	substitute	Substitute word according to spelling mistake dictionary
	WordNetAug	substitute	Substitute word according to WordNet's synonym
	WordEmbsAug	insert	Insert word randomly from word2vec, GloVe or fasttext dictionary
		substitute	Substitute word based on word2vec, GloVe or fasttext embeddings
	TfIdfAug	insert	Insert word randomly trained TF-IDF model
		substitute	Substitute word based on TF-IDF score
	ContextualWordEmbsAug	insert	Insert word based by feeding surroundings word to BERT and XLNet language model
		substitute	Substitute word based by feeding surroundings word to BERT and XLNet language model
Sentence	ContextualWordEmbsForSentenceAug	insert	Insert sentence according to GPT2 or XLNet prediction

Signal Augmenter

Target	Augmenter	Action	Description
Audio	NoiseAug	substitute	Inject noise
	PitchAug	substitute	Adjust audio's pitch
	ShiftAug	substitute	Shift time dimension forward/ backward
	SpeedAug	substitute	Adjust audio's speed
	CropAug	delete	Delete audio's segment
	LoudnessAug	substitute	Adjust audio's volume
	MaskAug	substitute	Mask audio's segment
Spectrogram	FrequencyMaskingAug	substitute	Set block of values to zero according to frequency dimension
	TimeMaskingAug	substitute	Set block of values to zero according to time dimension

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug numpy matplotlib python-dotenv

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git

If you use ContextualWordEmbsAug, install the following dependencies as well

pip install torch>=1.1.0 pytorch_pretrained_bert>=1.1.0

If you use WordNetAug, install the following dependencies as well

pip install nltk

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

If you use any one of audio augmenter, install the following dependencies as well

pip install librosa

Recent Changes

BETA Sep, 2019

Added Swap Mode (adjacent, middle and random) for RandomAug (character level)
WordNetAug supports antonyms

0.0.8 Sep 4, 2019

BertAug is replaced by ContextualWordEmbsAug
Support GPU (for ContextualWordEmbsAug only) #26
Upgraded pytorch_transformer to 1.1.0 version #33
ContextualWordEmbsAug suuports both BERT and XLNet model
Removed librosa dependency
Add ContextualWordEmbsForSentenceAug for generating next sentence
Fix sampling issue #38

See changelog for more details.

Source

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
data		data
docs		docs
example		example
model		model
nlpaug		nlpaug
res		res
test		test
.codacy.yml		.codacy.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
CHANGE.md		CHANGE.md
LICENSE		LICENSE
README.md		README.md
SOURCE.md		SOURCE.md
codecov.yml		codecov.yml
requirements.txt		requirements.txt
script.txt		script.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlpaug

Starter Guides

Flow

Textual Augmenter

Signal Augmenter

Installation

Recent Changes

Source

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nlpaug

Starter Guides

Flow

Textual Augmenter

Signal Augmenter

Installation

Recent Changes

Source

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages