Skip to content

Commit b6eecc1

Browse files
feat: add document parsing functionality for various formats
- Implemented DOCX parser using docx_rs for extracting text from Microsoft Word documents. - Added image parser utilizing Tesseract OCR for text extraction from images (PNG, JPEG, WebP). - Created PDF parser using pdf_extract for extracting text from PDF documents. - Developed PPTX parser for extracting text from Microsoft PowerPoint presentations. - Introduced XLSX parser using calamine for extracting text from Excel spreadsheets. - Added plain text parser for handling UTF-8 encoded text files, including TXT, CSV, and JSON formats. - Established a web API using Actix for file parsing, supporting multipart file uploads. - Implemented error handling for API responses with appropriate status codes. - Added tests for all parsers and API endpoints to ensure functionality and correctness. - Included assets for testing various file formats in the tests directory.
1 parent 6c64883 commit b6eecc1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+265
-949
lines changed

.dockerignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.git
2+
13
/target
24

35
.env

CLAUDE.md

Lines changed: 0 additions & 28 deletions
This file was deleted.

Cargo.lock

Lines changed: 9 additions & 53 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 36 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,51 @@
1-
[workspace]
2-
members = ["crates/core", "crates/web", "crates/cli", "crates/test-utils"]
3-
resolver = "3"
4-
5-
[workspace.package]
1+
[package]
2+
name = "parser"
63
version = "0.1.7"
74
edition = "2024"
85
authors = ["Leonard Excoffier"]
96
license = "MIT"
107
repository = "https://github.com/excoffierleonard/parser"
8+
description = "A library and web API for extracting text from various file formats including PDF, DOCX, XLSX, PPTX, images via OCR, and more"
9+
readme = "README.md"
10+
keywords = ["parser", "pdf", "docx", "text-extraction", "ocr"]
11+
categories = ["text-processing", "parsing", "web-programming::http-server"]
1112

12-
[workspace.dependencies]
13-
parser-core = { path = "crates/core", version = "0.1.3" }
14-
parser-test-utils = { path = "crates/test-utils" }
15-
actix-multipart = "0.7.2"
16-
actix-web = "4.9.0"
13+
[lib]
14+
name = "parser"
15+
path = "src/lib.rs"
16+
17+
[[bin]]
18+
name = "parser-web"
19+
path = "src/main.rs"
20+
21+
[dependencies]
22+
# Core parsing dependencies
1723
calamine = "0.26.1"
18-
clap = { version = "4.5.1", features = ["derive"] }
19-
criterion = "0.5"
2024
docx-rs = "0.4.17"
21-
dotenvy = "0.15.7"
22-
env_logger = "0.11.6"
23-
futures-util = "0.3.31"
2425
infer = "0.16.0"
2526
lazy_static = "1.4.0"
2627
mime = "0.3.17"
27-
mime_guess = "2.0.5"
28-
num_cpus = "1.16.0"
2928
pdf-extract = "0.8.0"
30-
rayon = "1.10.0"
3129
regex = "1.11.1"
32-
rust-embed = { version = "8.5.0", features = ["interpolate-folder-path"] }
33-
serde = { version = "1.0.217", features = ["derive"] }
34-
tesseract = "0.15.1"
3530
tempfile = "3.9.0"
31+
tesseract = "0.15.1"
3632
zip = "2.3.0"
33+
34+
# Web API dependencies
35+
actix-web = "4.9.0"
36+
actix-multipart = "0.7.2"
37+
futures-util = "0.3.31"
38+
rayon = "1.10.0"
39+
serde = { version = "1.0.217", features = ["derive"] }
40+
mime_guess = "2.0.5"
41+
rust-embed = { version = "8.5.0", features = ["interpolate-folder-path"] }
42+
env_logger = "0.11.6"
43+
dotenvy = "0.15.7"
44+
45+
[dev-dependencies]
46+
criterion = "0.5"
47+
num_cpus = "1.16.0"
48+
49+
[[bench]]
50+
name = "function_parse"
51+
harness = false
File renamed without changes.

README.md

Lines changed: 26 additions & 134 deletions
Original file line numberDiff line numberDiff line change
@@ -1,153 +1,45 @@
11
# Parser
22

3-
A Rust-based document parsing system that extracts text content from various file formats.
3+
A Rust library for extracting text from various document formats.
44

5-
[Live Demo](https://parser.excoffierleonard.com) | [API Endpoint](https://parser.excoffierleonard.com/parse)
5+
[Website](https://parser.excoffierleonard.com)
66

77
![Website Preview](website_preview.png)
88

9-
## 📚 Overview
9+
## Features
1010

11-
Parser is a modular Rust project that provides comprehensive document parsing capabilities through multiple interfaces:
11+
- PDF, DOCX, XLSX, PPTX documents
12+
- OCR for images (PNG, JPEG, WebP) with English and French support
13+
- Plain text formats (TXT, CSV, JSON)
1214

13-
- **Core library**: The foundation providing parsing functionality for various file formats
14-
- **CLI tool**: Command-line interface for quick file parsing
15-
- **Web API**: REST service for parsing files via HTTP requests
16-
- **Web UI**: Simple interface for testing the parser functionality
15+
## Installation
1716

18-
## 📦 Project Structure
19-
20-
The project is organized as a Rust workspace with multiple crates:
21-
22-
- **parser-core**: The core parsing engine
23-
- **parser-cli**: Command-line interface
24-
- **parser-web**: Web API and frontend
25-
- **test-utils**: Shared testing utilities
26-
27-
## 📄 Supported File Types
28-
29-
- **Documents**: PDF (`.pdf`), Word (`.docx`), PowerPoint (`.pptx`), Excel (`.xlsx`)
30-
- **Text**: Plain text (`.txt`), CSV, JSON, YAML, source code, and other text-based formats
31-
- **Images**: PNG, JPEG, WebP, and other image formats with OCR (Optical Character Recognition)
32-
33-
The OCR functionality supports English and French languages.
34-
35-
## 🛠️ Getting Started
36-
37-
### Prerequisites
38-
39-
- [Rust](https://www.rust-lang.org/learn/get-started) (latest stable)
40-
- OCR Dependencies:
41-
- Tesseract development libraries
42-
- Leptonica development libraries
43-
- Clang development libraries
44-
45-
#### Installing OCR Dependencies
46-
47-
**Debian/Ubuntu:**
48-
49-
```bash
50-
sudo apt install libtesseract-dev libleptonica-dev libclang-dev
51-
```
52-
53-
**macOS:**
54-
55-
```bash
56-
brew install tesseract
57-
```
58-
59-
**Windows:**
60-
Follow the instructions at [Tesseract GitHub repository](https://github.com/tesseract-ocr/tesseract).
61-
62-
### Building from Source
63-
64-
```bash
65-
# Build all crates
66-
cargo build
67-
68-
# Build in release mode
69-
cargo build --release
70-
```
71-
72-
### Using the CLI
73-
74-
```bash
75-
# Run directly with cargo
76-
cargo run -p parser-cli -- path/to/file1.pdf path/to/file2.docx
77-
78-
# Or use the built binary
79-
./target/release/parser-cli path/to/file1.pdf path/to/file2.docx
80-
```
81-
82-
### Running the Web Server
83-
84-
```bash
85-
# Run the web server
86-
cargo run -p parser-web
87-
88-
# With custom port
89-
PARSER_APP_PORT=9000 cargo run -p parser-web
90-
91-
# With file serving enabled (for frontend)
92-
ENABLE_FILE_SERVING=true cargo run -p parser-web
93-
```
94-
95-
## 🚀 Deployment
96-
97-
The easiest way to deploy the service is using Docker:
98-
99-
```bash
100-
curl -o compose.yaml https://raw.githubusercontent.com/excoffierleonard/parser/refs/heads/main/compose.yaml && \
101-
docker compose up -d
102-
```
103-
104-
### Environment Variables
105-
106-
- `PARSER_APP_PORT`: The port on which the web service listens (default: 8080)
107-
- `ENABLE_FILE_SERVING`: Enable serving frontend files (default: false)
108-
109-
## 🧪 Development
110-
111-
### Testing
112-
113-
```bash
114-
# Run all tests
115-
cargo test --workspace
116-
117-
# Run specific test
118-
cargo test test_name
17+
```toml
18+
[dependencies]
19+
parser = "0.1"
11920
```
12021

121-
### Benchmarking
22+
## Usage
12223

123-
```bash
124-
# Run benchmarks
125-
cargo bench --workspace
24+
```rust
25+
use parser::parse;
12626

127-
# Run benchmark script
128-
./scripts/benchmark.sh
27+
fn main() -> Result<(), Box<dyn std::error::Error>> {
28+
let data = std::fs::read("document.pdf")?;
29+
let text = parse(&data)?;
30+
println!("{}", text);
31+
Ok(())
32+
}
12933
```
13034

131-
### Code Quality
35+
## System Dependencies
13236

133-
```bash
134-
# Run linter
135-
cargo clippy --workspace -- -D warnings
37+
Requires Tesseract OCR libraries:
13638

137-
# Format code
138-
cargo fmt --all
139-
```
140-
141-
### Building with Scripts
142-
143-
```bash
144-
# Full build script
145-
./scripts/build.sh
146-
147-
# Deployment tests
148-
./scripts/deploy-tests.sh
149-
```
39+
- **Debian/Ubuntu:** `sudo apt install libtesseract-dev libleptonica-dev libclang-dev`
40+
- **macOS:** `brew install tesseract`
41+
- **Windows:** Follow the instructions at [Tesseract GitHub repository](https://github.com/tesseract-ocr/tesseract)
15042

151-
## 📜 License
43+
## License
15244

153-
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
45+
MIT

0 commit comments

Comments
 (0)