NLP Cleaning Pipeline

This repository includes an advanced suite of text preprocessing and cleaning tools, developed to produce high-quality inputs for NLP analysis and modeling.

Features

Document Preprocessing and Cleaning

Handle raw data transformations with precision:

Convert PDFs to text
Parse text documents and perform sentence tokenization
Lemmatize tokens and remove stop words
Filter out non-alphabetical tokens
Apply spell check to recover misspelled words
Normalize tokens to lowercase

Phrase Detection

Develop logical token groupings using advanced NLP algorithms:

Discover phrases that encapsulate intrinsic meanings
Leverage Gensim and Spacy to enhance detection

Acronym Detection

Detect and expand acronyms effectively:

Identify and replace acronyms with their full expansions
Handle multiple instances with prototypes encoding their contextual meanings (e.g., PPP -> private-public partnership or purchasing power parity)

Configuration

Easily configure the cleaning and preprocessing pipeline through YAML files, e.g., configs/cleaning/default.yml.

Contributions are welcome to enhance the capabilities of this robust NLP cleaning suite. Fork and submit your improvements today!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs/cleaning		configs/cleaning
data		data
src/wb_cleaning		src/wb_cleaning
tests		tests
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Cleaning Pipeline

Features

Document Preprocessing and Cleaning

Phrase Detection

Acronym Detection

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Cleaning Pipeline

Features

Document Preprocessing and Cleaning

Phrase Detection

Acronym Detection

Configuration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages