This repository includes an advanced suite of text preprocessing and cleaning tools, developed to produce high-quality inputs for NLP analysis and modeling.
Handle raw data transformations with precision:
- Convert PDFs to text
- Parse text documents and perform sentence tokenization
- Lemmatize tokens and remove stop words
- Filter out non-alphabetical tokens
- Apply spell check to recover misspelled words
- Normalize tokens to lowercase
Develop logical token groupings using advanced NLP algorithms:
- Discover phrases that encapsulate intrinsic meanings
- Leverage Gensim and Spacy to enhance detection
Detect and expand acronyms effectively:
- Identify and replace acronyms with their full expansions
- Handle multiple instances with prototypes encoding their contextual meanings (e.g., PPP -> private-public partnership or purchasing power parity)
Easily configure the cleaning and preprocessing pipeline through YAML files, e.g., configs/cleaning/default.yml.
Contributions are welcome to enhance the capabilities of this robust NLP cleaning suite. Fork and submit your improvements today!