Local pdf processing skill (#10)

rodneykinney · web-flow · commit 409d46606a1e · 2026-03-09T13:32:11.000-07:00
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -30,7 +30,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.10", "3.11", "3.12"]
+        python-version: ["3.11", "3.12", "3.13"]
 
     steps:
       - uses: actions/checkout@v4
diff --git a/DEVELOPER.md b/DEVELOPER.md
@@ -62,7 +62,7 @@ hooks/                         # Claude Code permission hooks
 
 ### Prerequisites
 
-- Python 3.10+
+- Python 3.11+
 - `uv` (for running commands)
 - `make` (for development tasks)
 
diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@ npx skills add add allenai/asta-plugins/skills
 - **Literature Report Generation** - Comprehensive report writing with synthesis
 - **Semantic Scholar Lookup** - Quick paper/author lookups and metadata queries
 - **Document Management** - Local document metadata index for tracking and searching papers
+- **PDF Text Extraction** - Extract structured text from PDF files with advanced layout detection
 - **Run Experiment** - Computational experiments with automated report generation
 
 Example user requests that would trigger these skills:
@@ -27,6 +28,7 @@ Example user requests that would trigger these skills:
 - "Get details for arXiv:2005.14165"
 - "What papers cite the GPT-3 paper?"
 - "Store this paper in Asta" / "Search my Asta documents for transformers"
+- "Extract text from this PDF" / "Convert PDF to markdown"
 - "Run an experiment to test GPT-4 translation quality"
 
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -7,12 +7,15 @@ name = "asta"
 version = "0.3.0"
 description = "Asta CLI for scientific literature review"
 readme = "README.md"
-requires-python = ">=3.10"
+requires-python = ">=3.11"
 license = "Apache-2.0"
 dependencies = [
     "click>=8.0",
     "pydantic>=2.0",
     "pyhocon>=0.3.60",
+    "pymupdf>=1.27.0",
+    "pymupdf-layout>=1.27.0",
+    "pymupdf4llm>=0.3.0",
 ]
 
 [project.scripts]
@@ -31,7 +34,7 @@ test = [
 packages = ["src/asta"]
 
 [tool.ruff]
-target-version = "py310"
+target-version = "py311"
 
 [tool.ruff.lint]
 select = ["E", "F", "I", "UP"]
diff --git a/skills/asta-documents/SKILL.md b/skills/asta-documents/SKILL.md
@@ -47,7 +47,7 @@ asta documents add "file://${REPORT_PATH}" \
 
 If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
 
-**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
+**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
 
 Verify installation with `asta documents --help`
 
diff --git a/skills/experiment/SKILL.md b/skills/experiment/SKILL.md
@@ -19,7 +19,7 @@ This skill can also be used to analyze experimental data and generate a research
 
 If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
 
-**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
+**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
 
 Verify installation with `asta experiment --help`
 
diff --git a/skills/find-literature/SKILL.md b/skills/find-literature/SKILL.md
@@ -19,7 +19,7 @@ questions to clarify the topic and refine the query before running the search.
 
 If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
 
-**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
+**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
 
 Verify installation with `asta literature --help`
 
diff --git a/skills/pdf-extract/SKILL.md b/skills/pdf-extract/SKILL.md
@@ -0,0 +1,148 @@
+---
+name: PDF Text Extraction
+description: This skill should be used when the user asks to "extract text from PDF", "convert PDF to text", "read PDF", "parse PDF", "get text from PDF", or needs to process PDF documents.
+allowed-tools:
+  - Bash(asta pdf *)
+  - Read
+  - Write
+---
+
+# PDF Text Extraction
+
+Extract structured text from PDF files using advanced layout detection. The extraction preserves document structure and formatting using PyMuPDF Layout, which combines heuristics with machine learning for improved accuracy.
+
+## Installation
+
+If `asta` command is not available, install it using:
+```bash
+uv tool install git+ssh://git@github.com/allenai/asta-plugins.git
+```
+
+**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
+
+Verify installation with:
+```bash
+asta pdf --help
+```
+
+## Usage
+
+### Single PDF Extraction
+
+Extract to markdown (default, best for preserving structure):
+```bash
+asta pdf to-text document.pdf
+```
+
+Extract to JSON (structured output):
+```bash
+asta pdf to-text document.pdf --format json
+```
+
+Extract to plain text:
+```bash
+asta pdf to-text document.pdf --format text
+```
+
+### Batch Processing Multiple PDFs
+
+Process multiple PDFs in parallel for improved performance:
+```bash
+# Extract all PDFs in current directory
+asta pdf batch-extract *.pdf -o ./extracted
+
+# Extract specific PDFs
+asta pdf batch-extract file1.pdf file2.pdf file3.pdf -o ./output
+
+# Use JSON format
+asta pdf batch-extract *.pdf -o ./output --format json
+
+# Control number of worker processes
+asta pdf batch-extract *.pdf -o ./output --workers 4
+```
+
+### Save to File
+
+```bash
+asta pdf to-text document.pdf -o output.md
+asta pdf to-text document.pdf --format json -o output.json
+```
+
+### Extract Specific Pages
+
+```bash
+# Page range
+asta pdf to-text document.pdf --pages 1-5
+
+# Specific pages
+asta pdf to-text document.pdf --pages 1,3,5,10
+```
+
+### Control Output Format
+
+```bash
+# Without page chunks (single continuous text)
+asta pdf to-text document.pdf --no-page-chunks
+```
+
+## Output Formats
+
+### Markdown (Default)
+- Preserves document structure (headings, lists, tables)
+- Best for readability and further processing
+- Can be chunked by page or continuous
+
+### JSON
+- Structured data with metadata
+- Always uses page chunks
+- Suitable for programmatic processing
+
+### Text
+- Plain text without formatting
+- Simplest output
+- Good for basic text analysis
+
+## Common Workflows
+
+### Extract and Index
+```bash
+# Extract PDF to markdown
+asta pdf to-text research_paper.pdf -o paper.md
+
+# Index in document database
+asta documents add file://paper.md --name="Research Paper" --summary="..."
+```
+
+### Batch Processing with Multiprocessing
+```bash
+# Faster: Use built-in batch processing (2x+ speedup)
+asta pdf batch-extract *.pdf -o ./extracted
+
+# Alternative: Shell loop (slower, sequential)
+for pdf in *.pdf; do
+  asta pdf to-text "$pdf" -o "${pdf%.pdf}.md"
+done
+```
+
+### Extract for Analysis
+```bash
+# Extract to JSON for structured analysis
+asta pdf to-text document.pdf --format json | jq '.content'
+```
+
+## Tips
+
+- Use markdown format for best structure preservation
+- Use page chunks when working with large documents
+- Use JSON format when you need structured data for processing
+- Specify page ranges to extract only relevant sections
+- Use `batch-extract` for processing 10+ PDFs (2x or better speedup)
+- The batch command automatically uses all CPU cores for parallel processing
+
+## Supported Features
+
+- Multi-column layouts
+- Tables and lists
+- Figure captions
+- Mathematical equations (extracted as text)
+- Complex document structures
diff --git a/skills/semantic-scholar/SKILL.md b/skills/semantic-scholar/SKILL.md
@@ -23,7 +23,7 @@ Fast, targeted lookups of paper metadata, citations, and authors using the Seman
 
 If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
 
-**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
+**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
 
 Verify installation with `asta papers --help`
 
diff --git a/src/asta/cli.py b/src/asta/cli.py
@@ -10,6 +10,7 @@
 from asta.papers.citations import citations
 from asta.papers.get import get
 from asta.papers.search import search
+from asta.pdf import batch_extract, to_text
 
 
 @click.group()
@@ -34,6 +35,12 @@ def papers():
     pass
 
 
+@cli.group()
+def pdf():
+    """PDF processing commands"""
+    pass
+
+
 # Register passthrough commands
 cli.add_command(documents)
 cli.add_command(experiment)
@@ -47,6 +54,10 @@ def papers():
 papers.add_command(citations)
 papers.add_command(author)
 
+# Register pdf subcommands
+pdf.add_command(to_text)
+pdf.add_command(batch_extract)
+
 
 if __name__ == "__main__":
     cli()
diff --git a/src/asta/pdf/__init__.py b/src/asta/pdf/__init__.py
@@ -0,0 +1,6 @@
+"""PDF processing commands"""
+
+from asta.pdf.batch_extract import batch_extract
+from asta.pdf.to_text import to_text
+
+__all__ = ["to_text", "batch_extract"]
diff --git a/src/asta/pdf/batch_extract.py b/src/asta/pdf/batch_extract.py
diff --git a/src/asta/pdf/to_text.py b/src/asta/pdf/to_text.py
diff --git a/tests/test_pdf.py b/tests/test_pdf.py
diff --git a/uv.lock b/uv.lock