Skip to content

Commit d7d396d

Browse files
authored
uv workspace layout, broke out PyMuPDF (#1021)
1 parent f9da499 commit d7d396d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+989
-174
lines changed

.github/workflows/build.yml

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,29 @@ jobs:
1010
runs-on: ubuntu-latest
1111
steps:
1212
- uses: actions/checkout@v4
13-
- id: build
13+
- id: build-paper-qa-pymupdf
1414
uses: hynek/build-and-inspect-python-package@v2
15-
- name: Download built artifact to dist/
15+
with:
16+
path: packages/paper-qa-pymupdf
17+
upload-name-suffix: -paper-qa-pymupdf
18+
- name: Download built paper-qa-pymupdf artifact to dist/
19+
uses: actions/download-artifact@v4
20+
with:
21+
name: ${{ steps.build-paper-qa-pymupdf.outputs.artifact-name }}
22+
path: dist
23+
- name: Clean up paper-qa-pymupdf build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
24+
run: rm -r ${{ steps.build-paper-qa-pymupdf.outputs.dist }}
25+
- id: build-paper-qa
26+
uses: hynek/build-and-inspect-python-package@v2
27+
with:
28+
upload-name-suffix: -paper-qa
29+
- name: Download built paper-qa artifact to dist/
1630
uses: actions/download-artifact@v4
1731
with:
18-
name: ${{ steps.build.outputs.artifact-name }}
32+
name: ${{ steps.build-paper-qa.outputs.artifact-name }}
1933
path: dist
34+
- name: Clean up paper-qa build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
35+
run: rm -r ${{ steps.build-paper-qa.outputs.dist }}
2036
- uses: pypa/gh-action-pypi-publish@release/v1
2137
with:
2238
password: ${{ secrets.PYPI_API_TOKEN }}

.github/workflows/tests.yml

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,27 @@ jobs:
3434
with:
3535
enable-cache: true
3636
- run: uv python pin ${{ matrix.python-version }}
37-
- uses: hynek/build-and-inspect-python-package@v2
37+
- name: Check paper-qa-pymupdf build
38+
id: build-paper-qa-pymupdf
39+
if: matrix.python-version == '3.11'
40+
uses: hynek/build-and-inspect-python-package@v2
41+
with:
42+
path: packages/paper-qa-pymupdf
43+
upload-name-suffix: -paper-qa-pymupdf
44+
- name: Clean up paper-qa-pymupdf build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
45+
if: matrix.python-version == '3.11'
46+
run: rm -r ${{ steps.build-paper-qa-pymupdf.outputs.dist }}
47+
- name: Check paper-qa build
48+
id: build-paper-qa
49+
if: matrix.python-version == '3.11'
50+
uses: hynek/build-and-inspect-python-package@v2
51+
with:
52+
upload-name-suffix: -paper-qa
53+
- name: Clean up paper-qa build # Work around https://github.com/hynek/build-and-inspect-python-package/issues/174
54+
if: matrix.python-version == '3.11'
55+
run: rm -r ${{ steps.build-paper-qa.outputs.dist }}
3856
- run: uv sync --python-preference=only-managed
39-
- run: uv run pylint paperqa
57+
- run: uv run pylint src packages
4058
- uses: suzuki-shunsuke/[email protected]
4159
test:
4260
runs-on: ubuntu-latest

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -311,4 +311,4 @@ tests/example2.*
311311
!tests/stub_data/.DS_Store
312312

313313
# Client data
314-
paperqa/clients/client_data/retractions.csv
314+
src/paperqa/clients/client_data/retractions.csv

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ repos:
77
- id: check-added-large-files
88
exclude: |
99
(?x)^(
10-
paperqa/clients/client_data.*|
10+
src/paperqa/clients/client_data.*|
1111
tests/stub_data.*
1212
)$
1313
- id: check-byte-order-marker

README.md

Lines changed: 67 additions & 67 deletions

packages/paper-qa-pymupdf/LICENSE

Lines changed: 661 additions & 0 deletions
Large diffs are not rendered by default.

packages/paper-qa-pymupdf/README.md

Lines changed: 10 additions & 0 deletions
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
[build-system]
2+
build-backend = "setuptools.build_meta"
3+
requires = ["setuptools>=64", "setuptools_scm>=8"]
4+
5+
[project]
6+
authors = [
7+
{email = "[email protected]", name = "FutureHouse technical staff"},
8+
]
9+
classifiers = [
10+
"Intended Audience :: Developers",
11+
"License :: OSI Approved :: GNU Affero General Public License v3",
12+
"Operating System :: OS Independent",
13+
"Programming Language :: Python :: 3 :: Only",
14+
"Programming Language :: Python :: 3.11",
15+
"Programming Language :: Python :: 3.12",
16+
"Programming Language :: Python :: 3.13",
17+
"Programming Language :: Python",
18+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
19+
]
20+
dependencies = [
21+
"PyMuPDF>=1.24.12", # For pymupdf.set_messages addition
22+
"paper-qa",
23+
]
24+
description = "PaperQA readers implemented using PyMuPDF"
25+
dynamic = ["version"]
26+
license = {file = "LICENSE"}
27+
maintainers = [
28+
{email = "[email protected]", name = "James Braza"},
29+
{email = "[email protected]", name = "Michael Skarlinski"},
30+
{email = "[email protected]", name = "Andrew White"},
31+
]
32+
name = "paper-qa-pymupdf"
33+
readme = "README.md"
34+
requires-python = ">=3.11"
35+
36+
[tool.ruff]
37+
extend = "../../pyproject.toml"
38+
39+
[tool.setuptools.packages.find]
40+
where = ["src"]
41+
42+
[tool.setuptools_scm]
43+
root = "../.."
44+
version_file = "src/paperqa_pymupdf/version.py"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from .reader import BLOCK_TEXT_INDEX, parse_pdf_to_pages, setup_pymupdf_python_logging
2+
3+
__all__ = [
4+
"BLOCK_TEXT_INDEX",
5+
"parse_pdf_to_pages",
6+
"setup_pymupdf_python_logging",
7+
]
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
import os
2+
3+
import pymupdf
4+
from paperqa.types import ParsedMetadata, ParsedText
5+
from paperqa.utils import ImpossibleParsingError
6+
from paperqa.version import __version__ as pqa_version
7+
8+
9+
def setup_pymupdf_python_logging() -> None:
10+
"""
11+
Configure PyMuPDF to use Python logging.
12+
13+
SEE: https://pymupdf.readthedocs.io/en/latest/app3.html#diagnostics
14+
"""
15+
pymupdf.set_messages(pylogging=True)
16+
17+
18+
BLOCK_TEXT_INDEX = 4
19+
20+
21+
def parse_pdf_to_pages(
22+
path: str | os.PathLike,
23+
page_size_limit: int | None = None,
24+
use_block_parsing: bool = False,
25+
**_,
26+
) -> ParsedText:
27+
28+
with pymupdf.open(path) as file:
29+
pages: dict[str, str] = {}
30+
total_length = 0
31+
32+
for i in range(file.page_count):
33+
try:
34+
page = file.load_page(i)
35+
except pymupdf.mupdf.FzErrorFormat as exc:
36+
raise ImpossibleParsingError(
37+
f"Page loading via {pymupdf.__name__} failed on page {i} of"
38+
f" {file.page_count} for the PDF at path {path}, likely this PDF"
39+
" file is corrupt."
40+
) from exc
41+
42+
if use_block_parsing:
43+
# NOTE: this block-based parsing appears to be better, but until
44+
# fully validated on 1+ benchmarks, it's considered experimental
45+
46+
# Extract text blocks from the page
47+
# Note: sort=False is important to preserve the order of text blocks
48+
# as they appear in the PDF
49+
blocks = page.get_text("blocks", sort=False)
50+
51+
# Concatenate text blocks into a single string
52+
text = "\n".join(
53+
block[BLOCK_TEXT_INDEX]
54+
for block in blocks
55+
if len(block) > BLOCK_TEXT_INDEX
56+
)
57+
else:
58+
text = page.get_text("text", sort=True)
59+
60+
if page_size_limit and len(text) > page_size_limit:
61+
raise ImpossibleParsingError(
62+
f"The text in page {i} of {file.page_count} was {len(text)} chars"
63+
f" long, which exceeds the {page_size_limit} char limit for the PDF"
64+
f" at path {path}."
65+
)
66+
pages[str(i + 1)] = text
67+
total_length += len(text)
68+
69+
metadata = ParsedMetadata(
70+
parsing_libraries=[f"pymupdf ({pymupdf.__version__})"],
71+
paperqa_version=pqa_version,
72+
total_parsed_text_length=total_length,
73+
parse_type="pdf",
74+
)
75+
return ParsedText(content=pages, metadata=metadata)

0 commit comments

Comments
 (0)