Skip to content

Commit 2ef1550

Browse files
committed
Update README.md to enhance clarity and branding
- Revised the README.md to improve the description of docproc's functionality and features, emphasizing its role as a document-to-markdown extraction engine. - Added a new logo and updated the layout for better visual appeal and branding consistency. - Expanded the features section to detail the capabilities of converting various document types into structured markdown. - Included an example of usage to illustrate the command-line interface functionality more clearly. - Removed outdated sections to streamline the document and focus on current features and usage. These changes aim to provide clearer information about docproc and improve the overall user experience for new users.
1 parent 1330b16 commit 2ef1550

File tree

2 files changed

+73
-42
lines changed

2 files changed

+73
-42
lines changed

README.md

Lines changed: 65 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,95 @@
11
# docproc
22

3-
docproc turns documents into markdown. Give it a PDF, DOCX, PPTX, or XLSX; you get clean text and every image (equations, diagrams, labels) explained by a vision model. It’s CLI only. Works with OpenAI, Azure, Anthropic, Ollama, or LiteLLM.
3+
<p align="center">
4+
<img src="assets/logo.svg" width="160" alt="docproc logo">
5+
</p>
46

5-
The **docproc // edu** demo in [demo/](demo/) is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. That app is written in Go and calls this CLI when a document is uploaded; it does grading itself.
7+
<p align="center">
8+
<b>docproc</b><br>
9+
Turn messy documents into clean markdown for AI pipelines.
10+
</p>
11+
12+
<p align="center">
13+
Document → Markdown → AI
14+
</p>
615

716
---
817

9-
## What the CLI does
18+
docproc is a document-to-markdown extraction engine. It converts PDFs, DOCX, PPTX, and XLSX into clean structured markdown while preserving equations, figures, and embedded images. It is designed to power LLM pipelines, RAG systems, and document processing workflows.
1019

11-
**Extract.** `docproc --file input.pdf -o output.md` — Pulls text from the native layer and runs vision on every embedded image. Optional extra pass: tidy markdown, LaTeX math, strip boilerplate (see `ingest.use_llm_refine` in config).
20+
## Features
1221

13-
**Config.** `docproc.yaml` holds AI providers and ingest options. No database or server needed for extract. Use `docproc init-config --env .env` once to generate a starter config from your `.env`.
22+
- **PDF → Markdown** — Native text extraction plus vision-based handling of embedded images
23+
- **DOCX → Markdown** — Full document structure and formatting
24+
- **PPTX → Markdown** — Slides to structured content
25+
- **XLSX → Markdown** — Spreadsheets to readable tables
26+
- **Equation preservation** — LaTeX and math kept intact (with optional LLM refinement)
27+
- **Figure extraction** — Every image, diagram, and label described by a vision model
28+
- **Clean structured output** — Ready for LLMs, RAG, and downstream pipelines
1429

15-
## Quick start
30+
## Example
1631

17-
```bash
18-
git clone https://github.com/rithulkamesh/docproc.git && cd docproc
19-
uv sync --python 3.12
32+
**Before:** A PDF with mixed text, equations, and diagrams.
33+
34+
**After:** A single `.md` file with extracted text, LaTeX math blocks, and every figure explained by the vision model—ready to embed, chunk, or feed into an LLM.
2035

21-
uv run docproc init-config --env .env # one-time
22-
uv run docproc --file input.pdf -o output.md
36+
```bash
37+
docproc --file paper.pdf -o paper.md
2338
```
2439

25-
## Demo (docproc // edu)
40+
## Installation
2641

27-
See [demo/README.md](demo/README.md). From `demo/`, run `docker compose up -d` (stack name: **docproc-edu**). Then start the Go API and worker from `demo/go/`, and the React app from `demo/web/`. The worker runs the docproc CLI on each uploaded document.
42+
```bash
43+
pip install git+https://github.com/rithulkamesh/docproc.git
44+
```
2845

29-
## Configuration
46+
Or with [uv](https://github.com/astral-sh/uv):
3047

31-
Create `docproc.yaml` or generate from `.env` with `init-config`. For both the CLI and the demo, the bits that matter are AI providers and ingest:
48+
```bash
49+
uv tool install git+https://github.com/rithulkamesh/docproc.git
50+
```
3251

33-
```yaml
34-
ai_providers:
35-
- provider: openai # or azure, anthropic, ollama, litellm
36-
primary_ai: openai
52+
From source:
3753

38-
ingest:
39-
use_vision: true
40-
use_llm_refine: true
54+
```bash
55+
git clone https://github.com/rithulkamesh/docproc.git && cd docproc
56+
uv sync --python 3.12
4157
```
4258

43-
Secrets go in the environment or `.env`. Full schema: [docs/CONFIGURATION.md](docs/CONFIGURATION.md).
59+
## Usage
4460

45-
## Install
61+
One-time config (generates `docproc.yaml` from your `.env`):
4662

4763
```bash
48-
uv tool install git+https://github.com/rithulkamesh/docproc.git
49-
# or: pip install git+https://github.com/rithulkamesh/docproc.git
64+
docproc init-config --env .env
5065
```
5166

52-
From source: `uv sync --python 3.12` then `uv run docproc --file input.pdf -o output.md`.
67+
Extract a document to markdown:
5368

54-
## Usage
69+
```bash
70+
docproc --file input.pdf -o output.md
71+
```
72+
73+
Optional: `--config path`, `-v` for verbose output. Shell completions: `docproc completions bash` or `docproc completions zsh`.
74+
75+
## Why docproc?
76+
77+
Naive PDF parsers often drop equations, misread layouts, and leave images as black boxes. docproc uses native extractors where possible (PyMuPDF, python-docx, etc.) and runs a vision model on every embedded image—so diagrams, charts, and equations become text or LaTeX that your AI stack can actually use. Optional LLM refinement cleans markdown and normalizes math. The result is document content that fits cleanly into RAG pipelines and LLM context windows instead of noisy, incomplete text.
78+
79+
## Architecture
80+
81+
docproc is **CLI-only**: no server, no database. The pipeline is:
82+
83+
1. **Load** — Read the file (PDF/DOCX/PPTX/XLSX) and extract full text from the native layer.
84+
2. **Vision** — For PDFs, run a vision model on every embedded image; get descriptions, LaTeX, or structured captions.
85+
3. **Refine** (optional) — LLM pass to tidy markdown, normalize LaTeX, and strip boilerplate.
86+
4. **Sanitize** — Dedupe and clean; write a single `.md` file.
87+
88+
Configuration lives in `docproc.yaml` (or generated via `docproc init-config --env .env`). AI providers: OpenAI, Azure, Anthropic, Ollama, LiteLLM. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) and [docs/CONFIGURATION.md](docs/CONFIGURATION.md) for details.
5589

56-
- **Extract:** `docproc --file input.pdf -o output.md` (optional `--config path`, `-v`).
57-
- **Completions:** `docproc completions bash` or `docproc completions zsh`.
90+
## Demo (docproc // edu)
91+
92+
The [demo/](demo/) is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. It’s a separate Go + React app that calls this CLI when a document is uploaded. See [demo/README.md](demo/README.md).
5893

5994
## Docs
6095

@@ -75,15 +110,3 @@ Pull requests welcome. Run the tests before sending.
75110
## License
76111

77112
MIT. See [LICENSE.md](LICENSE.md).
78-
79-
---
80-
81-
## Why I built this
82-
83-
I learn by asking questions. Not surface-level ones—the deep "why"s that most materials never answer. When my peers studied from slides and PDFs, I got stuck. I couldn’t absorb content I wasn’t allowed to interrogate. Documents don’t talk back. They don’t explain the intuition or the connections. Tools like NotebookLM didn’t help: they don’t understand images in the source, so those parts showed up blank. Most of my slides were visual or screenshots. I had nothing to work with.
84-
85-
So I built something for myself. A way to pull content out of any document—slides, papers, textbooks—and ask AI the questions I needed. *Why does this work? What’s the reasoning here? How does this connect to what we did last week?* It grew from "extract and query" into a full study environment: chat over the corpus, generate notes and flashcards, create and take assessments with automatic grading. For the first time I could learn from static documents by *conversing*, *noting*, and *testing*—not just re-reading.
86-
87-
I’m open-sourcing it because I’m probably not the only one who learns this way.
88-
89-
[hi@rithul.dev](mailto:hi@rithul.dev)

assets/logo.svg

Lines changed: 8 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)