|
| 1 | +--- |
| 2 | +name: PDF Text Extraction |
| 3 | +description: This skill should be used when the user asks to "extract text from PDF", "convert PDF to text", "read PDF", "parse PDF", "get text from PDF", or needs to process PDF documents. |
| 4 | +allowed-tools: |
| 5 | + - Bash(asta pdf *) |
| 6 | + - Read |
| 7 | + - Write |
| 8 | +--- |
| 9 | + |
| 10 | +# PDF Text Extraction |
| 11 | + |
| 12 | +Extract structured text from PDF files using advanced layout detection. The extraction preserves document structure and formatting using PyMuPDF Layout, which combines heuristics with machine learning for improved accuracy. |
| 13 | + |
| 14 | +## Installation |
| 15 | + |
| 16 | +If `asta` command is not available, install it using: |
| 17 | +```bash |
| 18 | +uv tool install git+ssh://git@github.com/allenai/asta-plugins.git |
| 19 | +``` |
| 20 | + |
| 21 | +**Prerequisites:** Python 3.11+ and [uv package manager](https://docs.astral.sh/uv/) |
| 22 | + |
| 23 | +Verify installation with: |
| 24 | +```bash |
| 25 | +asta pdf --help |
| 26 | +``` |
| 27 | + |
| 28 | +## Usage |
| 29 | + |
| 30 | +### Single PDF Extraction |
| 31 | + |
| 32 | +Extract to markdown (default, best for preserving structure): |
| 33 | +```bash |
| 34 | +asta pdf to-text document.pdf |
| 35 | +``` |
| 36 | + |
| 37 | +Extract to JSON (structured output): |
| 38 | +```bash |
| 39 | +asta pdf to-text document.pdf --format json |
| 40 | +``` |
| 41 | + |
| 42 | +Extract to plain text: |
| 43 | +```bash |
| 44 | +asta pdf to-text document.pdf --format text |
| 45 | +``` |
| 46 | + |
| 47 | +### Batch Processing Multiple PDFs |
| 48 | + |
| 49 | +Process multiple PDFs in parallel for improved performance: |
| 50 | +```bash |
| 51 | +# Extract all PDFs in current directory |
| 52 | +asta pdf batch-extract *.pdf -o ./extracted |
| 53 | + |
| 54 | +# Extract specific PDFs |
| 55 | +asta pdf batch-extract file1.pdf file2.pdf file3.pdf -o ./output |
| 56 | + |
| 57 | +# Use JSON format |
| 58 | +asta pdf batch-extract *.pdf -o ./output --format json |
| 59 | + |
| 60 | +# Control number of worker processes |
| 61 | +asta pdf batch-extract *.pdf -o ./output --workers 4 |
| 62 | +``` |
| 63 | + |
| 64 | +### Save to File |
| 65 | + |
| 66 | +```bash |
| 67 | +asta pdf to-text document.pdf -o output.md |
| 68 | +asta pdf to-text document.pdf --format json -o output.json |
| 69 | +``` |
| 70 | + |
| 71 | +### Extract Specific Pages |
| 72 | + |
| 73 | +```bash |
| 74 | +# Page range |
| 75 | +asta pdf to-text document.pdf --pages 1-5 |
| 76 | + |
| 77 | +# Specific pages |
| 78 | +asta pdf to-text document.pdf --pages 1,3,5,10 |
| 79 | +``` |
| 80 | + |
| 81 | +### Control Output Format |
| 82 | + |
| 83 | +```bash |
| 84 | +# Without page chunks (single continuous text) |
| 85 | +asta pdf to-text document.pdf --no-page-chunks |
| 86 | +``` |
| 87 | + |
| 88 | +## Output Formats |
| 89 | + |
| 90 | +### Markdown (Default) |
| 91 | +- Preserves document structure (headings, lists, tables) |
| 92 | +- Best for readability and further processing |
| 93 | +- Can be chunked by page or continuous |
| 94 | + |
| 95 | +### JSON |
| 96 | +- Structured data with metadata |
| 97 | +- Always uses page chunks |
| 98 | +- Suitable for programmatic processing |
| 99 | + |
| 100 | +### Text |
| 101 | +- Plain text without formatting |
| 102 | +- Simplest output |
| 103 | +- Good for basic text analysis |
| 104 | + |
| 105 | +## Common Workflows |
| 106 | + |
| 107 | +### Extract and Index |
| 108 | +```bash |
| 109 | +# Extract PDF to markdown |
| 110 | +asta pdf to-text research_paper.pdf -o paper.md |
| 111 | + |
| 112 | +# Index in document database |
| 113 | +asta documents add file://paper.md --name="Research Paper" --summary="..." |
| 114 | +``` |
| 115 | + |
| 116 | +### Batch Processing with Multiprocessing |
| 117 | +```bash |
| 118 | +# Faster: Use built-in batch processing (2x+ speedup) |
| 119 | +asta pdf batch-extract *.pdf -o ./extracted |
| 120 | + |
| 121 | +# Alternative: Shell loop (slower, sequential) |
| 122 | +for pdf in *.pdf; do |
| 123 | + asta pdf to-text "$pdf" -o "${pdf%.pdf}.md" |
| 124 | +done |
| 125 | +``` |
| 126 | + |
| 127 | +### Extract for Analysis |
| 128 | +```bash |
| 129 | +# Extract to JSON for structured analysis |
| 130 | +asta pdf to-text document.pdf --format json | jq '.content' |
| 131 | +``` |
| 132 | + |
| 133 | +## Tips |
| 134 | + |
| 135 | +- Use markdown format for best structure preservation |
| 136 | +- Use page chunks when working with large documents |
| 137 | +- Use JSON format when you need structured data for processing |
| 138 | +- Specify page ranges to extract only relevant sections |
| 139 | +- Use `batch-extract` for processing 10+ PDFs (2x or better speedup) |
| 140 | +- The batch command automatically uses all CPU cores for parallel processing |
| 141 | + |
| 142 | +## Supported Features |
| 143 | + |
| 144 | +- Multi-column layouts |
| 145 | +- Tables and lists |
| 146 | +- Figure captions |
| 147 | +- Mathematical equations (extracted as text) |
| 148 | +- Complex document structures |
0 commit comments