A simple text extractor for various files. Includes core functionality for extracting text from files, a command-line interface, restful API, and python bindings. Project is a work in progress.
There are four main ways to use textractor:
- Command-line interface
- Python bindings
- Restful API
- Core functionality
Install the CLI with cargo:
cargo install --git https://github.com/nleroy917/textractorThen run the CLI with:
textractor <file>The python bindings are not yet available on PyPi, but you can install them from source. First, clone this repository:
git clone https://github.com/nleroy917/textractorThen install the python bindings with:
cd textractor/textractor-py
make installYou need to ensure that you have the maturin package installed. You can install it with:
pip install maturinThere is also a web server built with axum that can be run with:
cd textractor-web
cargo run --releaseFinally, you can use the core functionality in your own Rust project. Add the following to your Cargo.toml:
[dependencies]
textractor = { git = "https://github.com/nleroy917/textractor" }Then you can use the library in your project with:
use std::
use textractor::extraction::extract;
fn main() {
let path = std::path::Path::new("path/to/file");
let file = std::fs::File::open(path)?;
let mut reader = std::io::BufReader::new(file);
let mut data = Vec::new();
reader.read_to_end(&mut data)?;
let text = extract(&data)?;
match text {
Some(text) => Ok(text),
None => Err(anyhow::anyhow!("Unsupported file type")),
}
println!("{}", text);
}I am working to prioritize adding PPTX and XLSX support, as well as improving the text extraction for PDFs.
- Text (txt)
- Word (docx)
- PowerPoint (pptx)
- Excel (xlsx)
- Images (png, jpg, etc)