Update README.md to enhance clarity and branding

rithulkamesh · rithulkamesh · commit 2ef15509792c · 2026-03-08T21:58:56.000+05:30
- Revised the README.md to improve the description of docproc's functionality and features, emphasizing its role as a document-to-markdown extraction engine.
- Added a new logo and updated the layout for better visual appeal and branding consistency.
- Expanded the features section to detail the capabilities of converting various document types into structured markdown.
- Included an example of usage to illustrate the command-line interface functionality more clearly.
- Removed outdated sections to streamline the document and focus on current features and usage.

These changes aim to provide clearer information about docproc and improve the overall user experience for new users.
diff --git a/README.md b/README.md
@@ -1,60 +1,95 @@
 # docproc
 
-docproc turns documents into markdown. Give it a PDF, DOCX, PPTX, or XLSX; you get clean text and every image (equations, diagrams, labels) explained by a vision model. It’s CLI only. Works with OpenAI, Azure, Anthropic, Ollama, or LiteLLM.
+<p align="center">
+  <img src="assets/logo.svg" width="160" alt="docproc logo">
+</p>
 
-The **docproc // edu** demo in [demo/](demo/) is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. That app is written in Go and calls this CLI when a document is uploaded; it does grading itself.
+<p align="center">
+  <b>docproc</b><br>
+  Turn messy documents into clean markdown for AI pipelines.
+</p>
+
+<p align="center">
+  Document → Markdown → AI
+</p>
 
 ---
 
-## What the CLI does
+docproc is a document-to-markdown extraction engine. It converts PDFs, DOCX, PPTX, and XLSX into clean structured markdown while preserving equations, figures, and embedded images. It is designed to power LLM pipelines, RAG systems, and document processing workflows.
 
-**Extract.** `docproc --file input.pdf -o output.md` — Pulls text from the native layer and runs vision on every embedded image. Optional extra pass: tidy markdown, LaTeX math, strip boilerplate (see `ingest.use_llm_refine` in config).
+## Features
 
-**Config.** `docproc.yaml` holds AI providers and ingest options. No database or server needed for extract. Use `docproc init-config --env .env` once to generate a starter config from your `.env`.
+- **PDF → Markdown** — Native text extraction plus vision-based handling of embedded images
+- **DOCX → Markdown** — Full document structure and formatting
+- **PPTX → Markdown** — Slides to structured content
+- **XLSX → Markdown** — Spreadsheets to readable tables
+- **Equation preservation** — LaTeX and math kept intact (with optional LLM refinement)
+- **Figure extraction** — Every image, diagram, and label described by a vision model
+- **Clean structured output** — Ready for LLMs, RAG, and downstream pipelines
 
-## Quick start
+## Example
 
-```bash
-git clone https://github.com/rithulkamesh/docproc.git && cd docproc
-uv sync --python 3.12
+**Before:** A PDF with mixed text, equations, and diagrams.
+
+**After:** A single `.md` file with extracted text, LaTeX math blocks, and every figure explained by the vision model—ready to embed, chunk, or feed into an LLM.
 
-uv run docproc init-config --env .env   # one-time
-uv run docproc --file input.pdf -o output.md
+```bash
+docproc --file paper.pdf -o paper.md
 ```
 
-## Demo (docproc // edu)
+## Installation
 
-See [demo/README.md](demo/README.md). From `demo/`, run `docker compose up -d` (stack name: **docproc-edu**). Then start the Go API and worker from `demo/go/`, and the React app from `demo/web/`. The worker runs the docproc CLI on each uploaded document.
+```bash
+pip install git+https://github.com/rithulkamesh/docproc.git
+```
 
-## Configuration
+Or with [uv](https://github.com/astral-sh/uv):
 
-Create `docproc.yaml` or generate from `.env` with `init-config`. For both the CLI and the demo, the bits that matter are AI providers and ingest:
+```bash
+uv tool install git+https://github.com/rithulkamesh/docproc.git
+```
 
-```yaml
-ai_providers:
-  - provider: openai   # or azure, anthropic, ollama, litellm
-primary_ai: openai
+From source:
 
-ingest:
-  use_vision: true
-  use_llm_refine: true
+```bash
+git clone https://github.com/rithulkamesh/docproc.git && cd docproc
+uv sync --python 3.12
 ```
 
-Secrets go in the environment or `.env`. Full schema: [docs/CONFIGURATION.md](docs/CONFIGURATION.md).
+## Usage
 
-## Install
+One-time config (generates `docproc.yaml` from your `.env`):
 
 ```bash
-uv tool install git+https://github.com/rithulkamesh/docproc.git
-# or: pip install git+https://github.com/rithulkamesh/docproc.git
+docproc init-config --env .env
 ```
 
-From source: `uv sync --python 3.12` then `uv run docproc --file input.pdf -o output.md`.
+Extract a document to markdown:
 
-## Usage
+```bash
+docproc --file input.pdf -o output.md
+```
+
+Optional: `--config path`, `-v` for verbose output. Shell completions: `docproc completions bash` or `docproc completions zsh`.
+
+## Why docproc?
+
+Naive PDF parsers often drop equations, misread layouts, and leave images as black boxes. docproc uses native extractors where possible (PyMuPDF, python-docx, etc.) and runs a vision model on every embedded image—so diagrams, charts, and equations become text or LaTeX that your AI stack can actually use. Optional LLM refinement cleans markdown and normalizes math. The result is document content that fits cleanly into RAG pipelines and LLM context windows instead of noisy, incomplete text.
+
+## Architecture
+
+docproc is **CLI-only**: no server, no database. The pipeline is:
+
+1. **Load** — Read the file (PDF/DOCX/PPTX/XLSX) and extract full text from the native layer.
+2. **Vision** — For PDFs, run a vision model on every embedded image; get descriptions, LaTeX, or structured captions.
+3. **Refine** (optional) — LLM pass to tidy markdown, normalize LaTeX, and strip boilerplate.
+4. **Sanitize** — Dedupe and clean; write a single `.md` file.
+
+Configuration lives in `docproc.yaml` (or generated via `docproc init-config --env .env`). AI providers: OpenAI, Azure, Anthropic, Ollama, LiteLLM. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) and [docs/CONFIGURATION.md](docs/CONFIGURATION.md) for details.
 
-- **Extract:** `docproc --file input.pdf -o output.md` (optional `--config path`, `-v`).
-- **Completions:** `docproc completions bash` or `docproc completions zsh`.
+## Demo (docproc // edu)
+
+The [demo/](demo/) is a full study workspace: upload docs, chat over them, generate notes and flashcards, create and take assessments. It’s a separate Go + React app that calls this CLI when a document is uploaded. See [demo/README.md](demo/README.md).
 
 ## Docs
 
@@ -75,15 +110,3 @@ Pull requests welcome. Run the tests before sending.
 ## License
 
 MIT. See [LICENSE.md](LICENSE.md).
-
----
-
-## Why I built this
-
-I learn by asking questions. Not surface-level ones—the deep "why"s that most materials never answer. When my peers studied from slides and PDFs, I got stuck. I couldn’t absorb content I wasn’t allowed to interrogate. Documents don’t talk back. They don’t explain the intuition or the connections. Tools like NotebookLM didn’t help: they don’t understand images in the source, so those parts showed up blank. Most of my slides were visual or screenshots. I had nothing to work with.
-
-So I built something for myself. A way to pull content out of any document—slides, papers, textbooks—and ask AI the questions I needed. *Why does this work? What’s the reasoning here? How does this connect to what we did last week?* It grew from "extract and query" into a full study environment: chat over the corpus, generate notes and flashcards, create and take assessments with automatic grading. For the first time I could learn from static documents by *conversing*, *noting*, and *testing*—not just re-reading.
-
-I’m open-sourcing it because I’m probably not the only one who learns this way.
-
-[hi@rithul.dev](mailto:hi@rithul.dev)
diff --git a/assets/logo.svg b/assets/logo.svg
@@ -0,0 +1,8 @@
+<svg width="400" height="290" viewBox="0 0 400 290" fill="none" xmlns="http://www.w3.org/2000/svg" aria-hidden="true">
+  <!-- White background: visible on light and dark -->
+  <rect x="20" y="20" width="360" height="250" rx="12" fill="#FFFFFF"/>
+  <!-- Logo scaled and centered with internal padding (~48px breathing room) -->
+  <g transform="translate(200, 145) scale(0.65) translate(-180, -125)">
+    <path fill-rule="evenodd" clip-rule="evenodd" d="M35.634 35.2523L0 70.5023L0.0350037 155.44L0.0700073 240.378L2.32001 243.686C6.72401 250.158 7.79699 250.319 42.486 249.69C68.997 249.21 74.674 248.837 81.07 247.155C96.179 243.184 106.911 238.043 118.807 229.079C135.052 216.838 145.855 202.745 162.557 172.002C179.123 141.511 193.046 127.123 214.372 118.457C222.112 115.312 224.376 115.338 226.116 118.587C228.995 123.968 235.985 125.454 241.726 121.905C250.462 116.507 250.168 104.389 241.17 98.9163C239.438 97.8643 236.717 97.0023 235.123 97.0023C231.94 97.0023 226.537 100.954 225.575 103.985C225.157 105.304 223.393 106.213 219.772 106.975C206.39 109.793 191.916 118.162 179.973 129.989C170.335 139.532 166.347 145.318 153.059 169.032C137.071 197.566 128.96 208.543 116.384 218.669C102.594 229.773 88.653 235.908 71.246 238.532C58.778 240.411 13.085 240.617 11.27 238.802C10.357 237.889 10.07 218.657 10.07 158.35V79.0984L35.32 78.8003C63.427 78.4683 64.829 78.1763 71.811 71.1953C78.566 64.4393 79.07 61.8514 79.07 33.9194V9.00234H138.013H196.957L198.513 11.2254C199.848 13.1304 200.07 18.3004 200.07 47.5254C200.07 72.6964 200.384 81.9163 201.27 82.8023C203.037 84.5693 206.524 84.2623 208.415 82.1733C209.888 80.5453 210.068 76.5643 210.054 45.9233C210.037 12.1473 209.995 11.4293 207.804 7.60635C203.242 -0.350651 206.622 0.0023444 134.934 0.0023444H71.267L35.634 35.2523ZM42.812 41.7593C29.204 55.3763 18.07 67.0763 18.07 67.7593C18.07 68.7093 22.933 69.0023 38.726 69.0023C59.268 69.0023 59.399 68.9883 62.689 66.4793C68.996 61.6683 70.07 57.2153 70.07 35.8753C70.07 21.3833 69.778 17.0023 68.812 17.0023C68.121 17.0023 56.421 28.1433 42.812 41.7593ZM300.562 43.2523C289.195 68.2803 269.065 117.038 269.287 119.002C269.535 121.196 270.051 121.502 273.494 121.502H277.419L283.751 107.002C287.235 99.0273 291.693 88.6773 293.66 84.0023C295.626 79.3273 298.473 72.8023 299.986 69.5023C311.108 45.2423 312.094 40.0023 305.535 40.0023C302.559 40.0023 301.818 40.4863 300.562 43.2523ZM226.27 42.2023C224.796 43.6763 224.728 46.4954 226.105 49.0684C227.085 50.8994 228.439 51.0023 251.45 51.0023C273.731 51.0023 275.898 50.8493 277.415 49.1733C279.566 46.7963 279.51 44.7113 277.241 42.6573C275.647 41.2143 272.343 41.0023 251.441 41.0023C234.13 41.0023 227.137 41.3353 226.27 42.2023ZM320.27 46.2024C317.165 49.3074 318.937 52.1384 333.07 66.6424C340.77 74.5444 347.062 81.5703 347.053 82.2563C347.044 82.9413 340.519 89.7393 332.553 97.3623C317.858 111.426 316.076 114.102 319.51 116.952C322.827 119.705 325.925 117.573 342.534 101.113C355.621 88.1433 359.064 84.1803 359.04 82.1133C359.016 80.0993 355.14 75.5573 342.087 62.2523C332.276 52.2513 324.388 45.0023 323.317 45.0023C322.301 45.0023 320.93 45.5424 320.27 46.2024ZM226.27 76.2024C225.61 76.8624 225.07 78.5723 225.07 80.0023C225.07 84.6523 226.338 85.0023 243.199 85.0023C256.164 85.0023 259.203 84.7263 260.499 83.4313C262.586 81.3433 262.486 78.6883 260.241 76.6573C258.705 75.2673 255.934 75.0023 242.941 75.0023C232.241 75.0023 227.1 75.3724 226.27 76.2024ZM66.725 103.831C64.781 105.979 64.577 109.109 66.27 110.802C67.17 111.702 79.407 112.002 115.188 112.002C141.433 112.002 163.618 111.729 164.488 111.395C166.381 110.669 166.646 104.978 164.87 103.202C163.97 102.302 151.749 102.002 116.025 102.002C71.137 102.002 68.284 102.108 66.725 103.831ZM38.641 138.573C36.776 140.438 36.587 144.119 38.27 145.802C39.999 147.531 137.141 147.531 138.87 145.802C140.637 144.035 140.33 140.548 138.241 138.657C136.593 137.165 131.667 137.002 88.313 137.002C45.919 137.002 40.026 137.188 38.641 138.573ZM230.522 142.648C227.15 146.287 227.32 150.406 230.993 154.079C234.977 158.063 239.045 158.009 242.481 153.925C245.74 150.052 245.789 146.069 242.618 142.648C240.842 140.732 239.174 140.002 236.57 140.002C233.966 140.002 232.298 140.732 230.522 142.648ZM258.27 145.202C256.587 146.885 256.776 150.566 258.641 152.431C260.009 153.798 264.662 154.002 294.486 154.002C326.42 154.002 328.873 153.877 330.415 152.173C332.493 149.877 332.522 147.597 330.499 145.573C329.13 144.205 324.437 144.002 294.199 144.002C268.521 144.002 259.157 144.315 258.27 145.202ZM201.27 150.202C200.376 151.096 200.07 161.817 200.07 192.268C200.07 218.525 199.7 233.825 199.035 235.068C198.048 236.911 196.628 237.003 168.785 237.034C137.947 237.067 136.07 237.333 136.07 241.671C136.07 242.743 136.64 244.4 137.337 245.353C138.5 246.943 141.149 247.062 169.587 246.794C200.466 246.503 200.581 246.494 203.933 244.103C210.024 239.759 210.07 239.377 210.07 193.234C210.07 162.031 209.765 151.097 208.87 150.202C208.21 149.542 206.5 149.002 205.07 149.002C203.64 149.002 201.93 149.542 201.27 150.202ZM38.641 173.573C36.77 175.445 36.609 178.141 38.27 179.802C39.991 181.523 112.149 181.523 113.87 179.802C115.413 178.259 115.413 174.745 113.87 173.202C112.981 172.313 103.286 172.002 76.441 172.002C44.84 172.002 40.012 172.203 38.641 173.573ZM177.12 174.442C176.463 175.233 176.071 177.146 176.248 178.692C176.531 181.158 177.037 181.54 180.39 181.817C182.491 181.99 184.853 181.597 185.64 180.944C187.382 179.498 187.507 175.839 185.87 174.202C184.209 172.541 178.569 172.696 177.12 174.442ZM230.659 176.079C227.142 180.259 227.161 185.142 230.705 187.929C239.492 194.841 250.305 183.392 242.265 175.689C238.339 171.927 234.03 172.072 230.659 176.079ZM258.27 179.202C256.503 180.969 256.81 184.456 258.899 186.347C260.523 187.817 264.425 188.002 293.744 188.002C324.459 188.002 326.875 187.875 328.415 186.173C330.566 183.796 330.51 181.711 328.241 179.657C326.616 178.186 322.683 178.002 292.941 178.002C268.241 178.002 259.156 178.316 258.27 179.202ZM161.327 207.252C159.67 209.348 159.295 211.415 160.24 213.252C161.009 214.747 162.639 215.002 171.405 215.002C178.056 215.002 182.093 214.579 182.87 213.802C184.707 211.965 184.298 208.194 182.136 207.037C179.311 205.526 162.555 205.699 161.327 207.252ZM231.147 209.591C224.928 214.823 228.821 225.002 237.04 225.002C238.072 225.002 240.301 223.837 241.993 222.413C246.187 218.884 246.247 214.025 242.147 209.925C238.522 206.3 235.179 206.198 231.147 209.591ZM258.27 213.202C257.61 213.862 257.07 215.347 257.07 216.502C257.07 220.803 258.252 221.002 283.795 221.002H308.12L309.095 218.437C309.743 216.733 309.723 215.221 309.035 213.937C308.055 212.105 306.702 212.002 283.735 212.002C266.196 212.002 259.137 212.335 258.27 213.202Z" fill="black"/>
+  </g>
+</svg>