Skip to content

Commit 454adec

Browse files
committed
Local pdf processing skill
1 parent e508a01 commit 454adec

File tree

15 files changed

+1202
-116
lines changed

15 files changed

+1202
-116
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
runs-on: ubuntu-latest
3131
strategy:
3232
matrix:
33-
python-version: ["3.10", "3.11", "3.12"]
33+
python-version: ["3.11", "3.12", "3.13"]
3434

3535
steps:
3636
- uses: actions/checkout@v4

DEVELOPER.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ hooks/ # Claude Code permission hooks
6262

6363
### Prerequisites
6464

65-
- Python 3.10+
65+
- Python 3.11+
6666
- `uv` (for running commands)
6767
- `make` (for development tasks)
6868

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ npx skills add add allenai/asta-plugins/skills
1919
- **Literature Report Generation** - Comprehensive report writing with synthesis
2020
- **Semantic Scholar Lookup** - Quick paper/author lookups and metadata queries
2121
- **Document Management** - Local document metadata index for tracking and searching papers
22+
- **PDF Text Extraction** - Extract structured text from PDF files with advanced layout detection
2223
- **Run Experiment** - Computational experiments with automated report generation
2324

2425
Example user requests that would trigger these skills:
@@ -27,6 +28,7 @@ Example user requests that would trigger these skills:
2728
- "Get details for arXiv:2005.14165"
2829
- "What papers cite the GPT-3 paper?"
2930
- "Store this paper in Asta" / "Search my Asta documents for transformers"
31+
- "Extract text from this PDF" / "Convert PDF to markdown"
3032
- "Run an experiment to test GPT-4 translation quality"
3133

3234

pyproject.toml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,15 @@ name = "asta"
77
version = "0.3.0"
88
description = "Asta CLI for scientific literature review"
99
readme = "README.md"
10-
requires-python = ">=3.10"
10+
requires-python = ">=3.11"
1111
license = "Apache-2.0"
1212
dependencies = [
1313
"click>=8.0",
1414
"pydantic>=2.0",
1515
"pyhocon>=0.3.60",
16+
"pymupdf>=1.27.0",
17+
"pymupdf-layout>=1.27.0",
18+
"pymupdf4llm>=0.3.0",
1619
]
1720

1821
[project.scripts]
@@ -31,7 +34,7 @@ test = [
3134
packages = ["src/asta"]
3235

3336
[tool.ruff]
34-
target-version = "py310"
37+
target-version = "py311"
3538

3639
[tool.ruff.lint]
3740
select = ["E", "F", "I", "UP"]

skills/asta-documents/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ asta documents add "file://${REPORT_PATH}" \
4747

4848
If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
4949

50-
**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
50+
**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
5151

5252
Verify installation with `asta documents --help`
5353

skills/experiment/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ This skill can also be used to analyze experimental data and generate a research
1919

2020
If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
2121

22-
**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
22+
**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
2323

2424
Verify installation with `asta experiment --help`
2525

skills/find-literature/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ questions to clarify the topic and refine the query before running the search.
1919

2020
If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
2121

22-
**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
22+
**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
2323

2424
Verify installation with `asta literature --help`
2525

skills/pdf-extract/SKILL.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
name: PDF Text Extraction
3+
description: This skill should be used when the user asks to "extract text from PDF", "convert PDF to text", "read PDF", "parse PDF", "get text from PDF", or needs to process PDF documents.
4+
allowed-tools:
5+
- Bash(asta pdf *)
6+
- Read
7+
- Write
8+
---
9+
10+
# PDF Text Extraction
11+
12+
Extract structured text from PDF files using advanced layout detection. The extraction preserves document structure and formatting using PyMuPDF Layout, which combines heuristics with machine learning for improved accuracy.
13+
14+
## Installation
15+
16+
If `asta` command is not available, install it using:
17+
```bash
18+
uv tool install git+ssh://git@github.com/allenai/asta-plugins.git
19+
```
20+
21+
**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
22+
23+
Verify installation with:
24+
```bash
25+
asta pdf --help
26+
```
27+
28+
## Usage
29+
30+
### Single PDF Extraction
31+
32+
Extract to markdown (default, best for preserving structure):
33+
```bash
34+
asta pdf to-text document.pdf
35+
```
36+
37+
Extract to JSON (structured output):
38+
```bash
39+
asta pdf to-text document.pdf --format json
40+
```
41+
42+
Extract to plain text:
43+
```bash
44+
asta pdf to-text document.pdf --format text
45+
```
46+
47+
### Batch Processing Multiple PDFs
48+
49+
Process multiple PDFs in parallel for improved performance:
50+
```bash
51+
# Extract all PDFs in current directory
52+
asta pdf batch-extract *.pdf -o ./extracted
53+
54+
# Extract specific PDFs
55+
asta pdf batch-extract file1.pdf file2.pdf file3.pdf -o ./output
56+
57+
# Use JSON format
58+
asta pdf batch-extract *.pdf -o ./output --format json
59+
60+
# Control number of worker processes
61+
asta pdf batch-extract *.pdf -o ./output --workers 4
62+
```
63+
64+
### Save to File
65+
66+
```bash
67+
asta pdf to-text document.pdf -o output.md
68+
asta pdf to-text document.pdf --format json -o output.json
69+
```
70+
71+
### Extract Specific Pages
72+
73+
```bash
74+
# Page range
75+
asta pdf to-text document.pdf --pages 1-5
76+
77+
# Specific pages
78+
asta pdf to-text document.pdf --pages 1,3,5,10
79+
```
80+
81+
### Control Output Format
82+
83+
```bash
84+
# Without page chunks (single continuous text)
85+
asta pdf to-text document.pdf --no-page-chunks
86+
```
87+
88+
## Output Formats
89+
90+
### Markdown (Default)
91+
- Preserves document structure (headings, lists, tables)
92+
- Best for readability and further processing
93+
- Can be chunked by page or continuous
94+
95+
### JSON
96+
- Structured data with metadata
97+
- Always uses page chunks
98+
- Suitable for programmatic processing
99+
100+
### Text
101+
- Plain text without formatting
102+
- Simplest output
103+
- Good for basic text analysis
104+
105+
## Common Workflows
106+
107+
### Extract and Index
108+
```bash
109+
# Extract PDF to markdown
110+
asta pdf to-text research_paper.pdf -o paper.md
111+
112+
# Index in document database
113+
asta documents add file://paper.md --name="Research Paper" --summary="..."
114+
```
115+
116+
### Batch Processing with Multiprocessing
117+
```bash
118+
# Faster: Use built-in batch processing (2x+ speedup)
119+
asta pdf batch-extract *.pdf -o ./extracted
120+
121+
# Alternative: Shell loop (slower, sequential)
122+
for pdf in *.pdf; do
123+
asta pdf to-text "$pdf" -o "${pdf%.pdf}.md"
124+
done
125+
```
126+
127+
### Extract for Analysis
128+
```bash
129+
# Extract to JSON for structured analysis
130+
asta pdf to-text document.pdf --format json | jq '.content'
131+
```
132+
133+
## Tips
134+
135+
- Use markdown format for best structure preservation
136+
- Use page chunks when working with large documents
137+
- Use JSON format when you need structured data for processing
138+
- Specify page ranges to extract only relevant sections
139+
- Use `batch-extract` for processing 10+ PDFs (2x or better speedup)
140+
- The batch command automatically uses all CPU cores for parallel processing
141+
142+
## Supported Features
143+
144+
- Multi-column layouts
145+
- Tables and lists
146+
- Figure captions
147+
- Mathematical equations (extracted as text)
148+
- Complex document structures

skills/semantic-scholar/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Fast, targeted lookups of paper metadata, citations, and authors using the Seman
2323

2424
If `asta` command is not available install it using `uv tool install git+ssh://git@github.com/allenai/asta-plugins.git`
2525

26-
**Prerequisites:** Python 3.10+ and [uv package manager](https://docs.astral.sh/uv/)
26+
**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/)
2727

2828
Verify installation with `asta papers --help`
2929

src/asta/cli.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from asta.papers.citations import citations
1111
from asta.papers.get import get
1212
from asta.papers.search import search
13+
from asta.pdf import batch_extract, to_text
1314

1415

1516
@click.group()
@@ -34,6 +35,12 @@ def papers():
3435
pass
3536

3637

38+
@cli.group()
39+
def pdf():
40+
"""PDF processing commands"""
41+
pass
42+
43+
3744
# Register passthrough commands
3845
cli.add_command(documents)
3946
cli.add_command(experiment)
@@ -47,6 +54,10 @@ def papers():
4754
papers.add_command(citations)
4855
papers.add_command(author)
4956

57+
# Register pdf subcommands
58+
pdf.add_command(to_text)
59+
pdf.add_command(batch_extract)
60+
5061

5162
if __name__ == "__main__":
5263
cli()

0 commit comments

Comments
 (0)