You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Revised the README.md to improve the description of docproc's functionality and features, emphasizing its role as a document-to-markdown extraction engine.
- Added a new logo and updated the layout for better visual appeal and branding consistency.
- Expanded the features section to detail the capabilities of converting various document types into structured markdown.
- Included an example of usage to illustrate the command-line interface functionality more clearly.
- Removed outdated sections to streamline the document and focus on current features and usage.
These changes aim to provide clearer information about docproc and improve the overall user experience for new users.
Copy file name to clipboardExpand all lines: README.md
+65-42Lines changed: 65 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,60 +1,95 @@
1
1
# docproc
2
2
3
-
docproc turns documents into markdown. Give it a PDF, DOCX, PPTX, or XLSX; you get clean text and every image (equations, diagrams, labels) explained by a vision model. It’s CLI only. Works with OpenAI, Azure, Anthropic, Ollama, or LiteLLM.
The **docproc // edu** demo in [demo/](demo/) is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. That app is written in Go and calls this CLI when a document is uploaded; it does grading itself.
7
+
<palign="center">
8
+
<b>docproc</b><br>
9
+
Turn messy documents into clean markdown for AI pipelines.
10
+
</p>
11
+
12
+
<palign="center">
13
+
Document → Markdown → AI
14
+
</p>
6
15
7
16
---
8
17
9
-
## What the CLI does
18
+
docproc is a document-to-markdown extraction engine. It converts PDFs, DOCX, PPTX, and XLSX into clean structured markdown while preserving equations, figures, and embedded images. It is designed to power LLM pipelines, RAG systems, and document processing workflows.
10
19
11
-
**Extract.**`docproc --file input.pdf -o output.md` — Pulls text from the native layer and runs vision on every embedded image. Optional extra pass: tidy markdown, LaTeX math, strip boilerplate (see `ingest.use_llm_refine` in config).
20
+
## Features
12
21
13
-
**Config.**`docproc.yaml` holds AI providers and ingest options. No database or server needed for extract. Use `docproc init-config --env .env` once to generate a starter config from your `.env`.
22
+
-**PDF → Markdown** — Native text extraction plus vision-based handling of embedded images
23
+
-**DOCX → Markdown** — Full document structure and formatting
24
+
-**PPTX → Markdown** — Slides to structured content
25
+
-**XLSX → Markdown** — Spreadsheets to readable tables
26
+
-**Equation preservation** — LaTeX and math kept intact (with optional LLM refinement)
27
+
-**Figure extraction** — Every image, diagram, and label described by a vision model
28
+
-**Clean structured output** — Ready for LLMs, RAG, and downstream pipelines
**Before:** A PDF with mixed text, equations, and diagrams.
33
+
34
+
**After:** A single `.md` file with extracted text, LaTeX math blocks, and every figure explained by the vision model—ready to embed, chunk, or feed into an LLM.
20
35
21
-
uv run docproc init-config --env .env # one-time
22
-
uv run docproc --file input.pdf -o output.md
36
+
```bash
37
+
docproc --file paper.pdf -o paper.md
23
38
```
24
39
25
-
## Demo (docproc // edu)
40
+
## Installation
26
41
27
-
See [demo/README.md](demo/README.md). From `demo/`, run `docker compose up -d` (stack name: **docproc-edu**). Then start the Go API and worker from `demo/go/`, and the React app from `demo/web/`. The worker runs the docproc CLI on each uploaded document.
From source: `uv sync --python 3.12`then `uv run docproc --file input.pdf -o output.md`.
67
+
Extract a document to markdown:
53
68
54
-
## Usage
69
+
```bash
70
+
docproc --file input.pdf -o output.md
71
+
```
72
+
73
+
Optional: `--config path`, `-v` for verbose output. Shell completions: `docproc completions bash` or `docproc completions zsh`.
74
+
75
+
## Why docproc?
76
+
77
+
Naive PDF parsers often drop equations, misread layouts, and leave images as black boxes. docproc uses native extractors where possible (PyMuPDF, python-docx, etc.) and runs a vision model on every embedded image—so diagrams, charts, and equations become text or LaTeX that your AI stack can actually use. Optional LLM refinement cleans markdown and normalizes math. The result is document content that fits cleanly into RAG pipelines and LLM context windows instead of noisy, incomplete text.
78
+
79
+
## Architecture
80
+
81
+
docproc is **CLI-only**: no server, no database. The pipeline is:
82
+
83
+
1.**Load** — Read the file (PDF/DOCX/PPTX/XLSX) and extract full text from the native layer.
84
+
2.**Vision** — For PDFs, run a vision model on every embedded image; get descriptions, LaTeX, or structured captions.
85
+
3.**Refine** (optional) — LLM pass to tidy markdown, normalize LaTeX, and strip boilerplate.
86
+
4.**Sanitize** — Dedupe and clean; write a single `.md` file.
87
+
88
+
Configuration lives in `docproc.yaml` (or generated via `docproc init-config --env .env`). AI providers: OpenAI, Azure, Anthropic, Ollama, LiteLLM. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) and [docs/CONFIGURATION.md](docs/CONFIGURATION.md) for details.
- **Completions:** `docproc completions bash` or `docproc completions zsh`.
90
+
## Demo (docproc // edu)
91
+
92
+
The [demo/](demo/) is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. It’s a separate Go + React app that calls this CLI when a document is uploaded. See [demo/README.md](demo/README.md).
58
93
59
94
## Docs
60
95
@@ -75,15 +110,3 @@ Pull requests welcome. Run the tests before sending.
75
110
## License
76
111
77
112
MIT. See [LICENSE.md](LICENSE.md).
78
-
79
-
---
80
-
81
-
## Why I built this
82
-
83
-
I learn by asking questions. Not surface-level ones—the deep "why"s that most materials never answer. When my peers studied from slides and PDFs, I got stuck. I couldn’t absorb content I wasn’t allowed to interrogate. Documents don’t talk back. They don’t explain the intuition or the connections. Tools like NotebookLM didn’t help: they don’t understand images in the source, so those parts showed up blank. Most of my slides were visual or screenshots. I had nothing to work with.
84
-
85
-
So I built something for myself. A way to pull content out of any document—slides, papers, textbooks—and ask AI the questions I needed. *Why does this work? What’s the reasoning here? How does this connect to what we did last week?* It grew from "extract and query" into a full study environment: chat over the corpus, generate notes and flashcards, create and take assessments with automatic grading. For the first time I could learn from static documents by *conversing*, *noting*, and *testing*—not just re-reading.
86
-
87
-
I’m open-sourcing it because I’m probably not the only one who learns this way.
0 commit comments