Skip to content

jingweili87/NLP-Cleaning-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Cleaning Pipeline

This repository includes an advanced suite of text preprocessing and cleaning tools, developed to produce high-quality inputs for NLP analysis and modeling.

Features

Document Preprocessing and Cleaning

Handle raw data transformations with precision:

  • Convert PDFs to text
  • Parse text documents and perform sentence tokenization
  • Lemmatize tokens and remove stop words
  • Filter out non-alphabetical tokens
  • Apply spell check to recover misspelled words
  • Normalize tokens to lowercase

Phrase Detection

Develop logical token groupings using advanced NLP algorithms:

  • Discover phrases that encapsulate intrinsic meanings
  • Leverage Gensim and Spacy to enhance detection

Acronym Detection

Detect and expand acronyms effectively:

  • Identify and replace acronyms with their full expansions
  • Handle multiple instances with prototypes encoding their contextual meanings (e.g., PPP -> private-public partnership or purchasing power parity)

Configuration

Easily configure the cleaning and preprocessing pipeline through YAML files, e.g., configs/cleaning/default.yml.

Contributions are welcome to enhance the capabilities of this robust NLP cleaning suite. Fork and submit your improvements today!

About

A powerful suite of text preprocessing and cleaning tools for NLP analysis. Includes document cleaning, phrase detection, acronym detection, and flexible configurations for high-quality text modeling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages