Skip to content

Commit c297ad0

Browse files
authored
Merge pull request #1209 from kedro-org/feature/pdfdataset
feat(datasets): Added the Experimental pypdf.PDFDataset
2 parents ccc8077 + e28fe59 commit c297ad0

File tree

9 files changed

+352
-0
lines changed

9 files changed

+352
-0
lines changed

kedro-datasets/RELEASE.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,14 @@
1212
## Bug fixes and other changes
1313
- Add HTMLPreview type.
1414

15+
## Major features and improvements
16+
17+
- Added the following new experimental datasets:
18+
19+
| Type | Description | Location |
20+
|--------------------------------|---------------------------------------------------------------|--------------------------------------|
21+
| `pypdf.PDFDataset` | A dataset to read PDF files and extract text using pypdf | `kedro_datasets_experimental.pypdf` |
22+
1523
# Release 8.1.0
1624
## Major features and improvements
1725

kedro-datasets/docs/api/kedro_datasets_experimental/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Name | Description
1313
[langchain.OpenAIEmbeddingsDataset](langchain.OpenAIEmbeddingsDataset.md) | ``OpenAIEmbeddingsDataset`` loads a OpenAIEmbeddings `langchain` model.
1414
[langchain.LangChainPromptDataset](langchain.LangChainPromptDataset.md) | ``LangChainPromptDataset`` loads a `langchain` prompt template.
1515
[netcdf.NetCDFDataset](netcdf.NetCDFDataset.md) | ``NetCDFDataset`` loads/saves data from/to a NetCDF file using an underlying filesystem (e.g.: local, S3, GCS). It uses xarray to handle the NetCDF file.
16+
[pypdf.PDFDataset](pypdf.PDFDataset.md) | ``PDFDataset`` loads data from PDF files using pypdf to extract text from pages. Read-only dataset.
1617
[polars.PolarsDatabaseDataset](polars.PolarsDatabaseDataset.md) | ``PolarsDatabaseDataset`` implementation to access databases as Polars DataFrames. It supports reading from a SQL query and writing to a database table.
1718
[prophet.ProphetModelDataset](prophet.ProphetModelDataset.md) | ``ProphetModelDataset`` loads/saves Facebook Prophet models to a JSON file using an underlying filesystem (e.g., local, S3, GCS). It uses Prophet's built-in serialisation to handle the JSON file.
1819
[pytorch.PyTorchDataset](pytorch.PyTorchDataset.md) | ``PyTorchDataset`` loads and saves PyTorch models' `state_dict` using PyTorch's recommended zipfile serialization protocol. To avoid security issues with Pickle.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
::: kedro_datasets_experimental.pypdf.PDFDataset
2+
options:
3+
members: true
4+
show_source: true
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
"""``AbstractDataset`` implementation to load data from PDF files using pypdf."""
2+
3+
from typing import Any
4+
5+
import lazy_loader as lazy
6+
7+
try:
8+
from .pdf_dataset import PDFDataset
9+
except (ImportError, RuntimeError):
10+
# For documentation builds that might fail due to dependency issues
11+
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
12+
PDFDataset: Any
13+
14+
__getattr__, __dir__, __all__ = lazy.attach(
15+
__name__, submod_attrs={"pdf_dataset": ["PDFDataset"]}
16+
)
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
"""``PDFDataset`` loads data from PDF files using an underlying
2+
filesystem (e.g.: local, S3, GCS). It uses pypdf to read and extract text from PDF files.
3+
"""
4+
from __future__ import annotations
5+
6+
from copy import deepcopy
7+
from pathlib import PurePosixPath
8+
from typing import Any, NoReturn
9+
10+
import fsspec
11+
import pypdf
12+
from kedro.io.core import (
13+
AbstractDataset,
14+
DatasetError,
15+
get_filepath_str,
16+
get_protocol_and_path,
17+
)
18+
19+
20+
class PDFDataset(AbstractDataset[NoReturn, list[str]]):
21+
"""``PDFDataset`` loads data from PDF files using an underlying
22+
filesystem (e.g.: local, S3, GCS). It uses pypdf to read and extract text from PDF files.
23+
24+
This is a read-only dataset - saving is not supported.
25+
26+
Examples:
27+
Using the [YAML API](https://docs.kedro.org/en/stable/catalog-data/data_catalog_yaml_examples/):
28+
29+
```yaml
30+
my_pdf_document:
31+
type: pypdf.PDFDataset
32+
filepath: data/01_raw/document.pdf
33+
34+
password_protected_pdf:
35+
type: pypdf.PDFDataset
36+
filepath: data/01_raw/protected.pdf
37+
load_args:
38+
password: "pass123" # pragma: allowlist secret
39+
40+
s3_pdf:
41+
type: pypdf.PDFDataset
42+
filepath: s3://your_bucket/document.pdf
43+
credentials: dev_s3
44+
```
45+
46+
Using the [Python API](https://docs.kedro.org/en/stable/catalog-data/advanced_data_catalog_usage/):
47+
48+
>>> from kedro_datasets_experimental.pypdf import PDFDataset
49+
>>>
50+
>>> dataset = PDFDataset(filepath="data/document.pdf")
51+
>>> pages = dataset.load()
52+
>>> # pages is a list of strings, one per page
53+
>>> assert isinstance(pages, list)
54+
>>> assert all(isinstance(page, str) for page in pages)
55+
56+
"""
57+
58+
DEFAULT_LOAD_ARGS: dict[str, Any] = {"strict": False}
59+
60+
def __init__(
61+
self,
62+
*,
63+
filepath: str,
64+
load_args: dict[str, Any] | None = None,
65+
credentials: dict[str, Any] | None = None,
66+
fs_args: dict[str, Any] | None = None,
67+
metadata: dict[str, Any] | None = None,
68+
) -> None:
69+
"""Creates a new instance of ``PDFDataset`` pointing to a concrete PDF file
70+
on a specific filesystem.
71+
72+
Args:
73+
filepath: Filepath in POSIX format to a PDF file prefixed with a protocol like `s3://`.
74+
If prefix is not provided, `file` protocol (local filesystem) will be used.
75+
The prefix should be any protocol supported by ``fsspec``.
76+
load_args: Pypdf options for loading PDF files (arguments passed
77+
into ``pypdf.PdfReader``). Here you can find all available arguments:
78+
https://pypdf.readthedocs.io/en/stable/modules/PdfReader.html
79+
All defaults are preserved, except "strict", which is set to False.
80+
Common options include:
81+
- password (str): Password for encrypted PDFs
82+
- strict (bool): Whether to raise errors on malformed PDFs (default: False)
83+
credentials: Credentials required to get access to the underlying filesystem.
84+
E.g. for ``GCSFileSystem`` it should look like `{"token": None}`.
85+
fs_args: Extra arguments to pass into underlying filesystem class constructor
86+
(e.g. `{"project": "my-project"}` for ``GCSFileSystem``), as well as
87+
to pass to the filesystem's `open` method through nested keys
88+
`open_args_load` and `open_args_save`.
89+
Here you can find all available arguments for `open`:
90+
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
91+
All defaults are preserved.
92+
metadata: Any arbitrary metadata.
93+
This is ignored by Kedro, but may be consumed by users or external plugins.
94+
"""
95+
_fs_args = deepcopy(fs_args) or {}
96+
_fs_open_args_load = _fs_args.pop("open_args_load", {})
97+
_credentials = deepcopy(credentials) or {}
98+
99+
super().__init__()
100+
101+
protocol, path = get_protocol_and_path(filepath)
102+
if protocol == "file":
103+
_fs_args.setdefault("auto_mkdir", True)
104+
105+
self._protocol = protocol
106+
self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)
107+
self._filepath = PurePosixPath(path)
108+
self.metadata = metadata
109+
110+
# Handle default load and fs arguments
111+
self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}
112+
self._fs_open_args_load = _fs_open_args_load or {}
113+
114+
def _describe(self) -> dict[str, Any]:
115+
return {
116+
"filepath": self._filepath,
117+
"protocol": self._protocol,
118+
"load_args": self._load_args,
119+
}
120+
121+
def load(self) -> list[str]:
122+
"""Loads data from a PDF file.
123+
124+
Returns:
125+
list[str]: A list of strings, where each string contains the text extracted from one page.
126+
"""
127+
load_path = get_filepath_str(self._filepath, self._protocol)
128+
129+
with self._fs.open(load_path, mode="rb", **self._fs_open_args_load) as fs_file:
130+
pdf_reader = pypdf.PdfReader(stream=fs_file, **self._load_args)
131+
pages = []
132+
for page in pdf_reader.pages:
133+
pages.append(page.extract_text())
134+
return pages
135+
136+
def save(self, data: NoReturn) -> None:
137+
"""Saving to PDFDataset is not supported.
138+
139+
Args:
140+
data: Data to save.
141+
142+
Raises:
143+
DatasetError: Always raised as saving is not supported.
144+
"""
145+
raise DatasetError("Saving to PDFDataset is not supported.")
146+
147+
def _exists(self) -> bool:
148+
"""Check if the PDF file exists.
149+
150+
Returns:
151+
bool: True if the file exists, False otherwise.
152+
"""
153+
load_path = get_filepath_str(self._filepath, self._protocol)
154+
return self._fs.exists(load_path)
155+
156+
def _release(self) -> None:
157+
"""Release any cached filesystem information."""
158+
self._invalidate_cache()
159+
160+
def _invalidate_cache(self) -> None:
161+
"""Invalidate underlying filesystem caches."""
162+
filepath = get_filepath_str(self._filepath, self._protocol)
163+
self._fs.invalidate_cache(filepath)

kedro-datasets/kedro_datasets_experimental/tests/pypdf/__init__.py

Whitespace-only changes.
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
import shutil
2+
from pathlib import PurePosixPath
3+
4+
import pypdf
5+
import pytest
6+
from fsspec.implementations.http import HTTPFileSystem
7+
from fsspec.implementations.local import LocalFileSystem
8+
from gcsfs import GCSFileSystem
9+
from kedro.io.core import PROTOCOL_DELIMITER, DatasetError
10+
from reportlab.lib.pagesizes import letter
11+
from reportlab.pdfgen import canvas
12+
from s3fs.core import S3FileSystem
13+
14+
from kedro_datasets_experimental.pypdf import PDFDataset
15+
16+
17+
@pytest.fixture
18+
def filepath_pdf(tmp_path):
19+
return (tmp_path / "test.pdf").as_posix()
20+
21+
22+
@pytest.fixture
23+
def pdf_dataset(filepath_pdf, load_args, fs_args):
24+
return PDFDataset(filepath=filepath_pdf, load_args=load_args, fs_args=fs_args)
25+
26+
27+
@pytest.fixture
28+
def dummy_pdf_data(tmp_path):
29+
"""Create a simple PDF file for testing."""
30+
filepath = tmp_path / "test_dummy.pdf"
31+
32+
# Create a simple PDF with pypdf
33+
writer = pypdf.PdfWriter()
34+
35+
# Add page 1
36+
page1 = pypdf.PageObject.create_blank_page(width=200, height=200)
37+
writer.add_page(page1)
38+
39+
# Add page 2
40+
page2 = pypdf.PageObject.create_blank_page(width=200, height=200)
41+
writer.add_page(page2)
42+
43+
# Write to file
44+
with open(filepath, "wb") as f:
45+
writer.write(f)
46+
47+
return filepath
48+
49+
50+
@pytest.fixture
51+
def dummy_pdf_with_text(tmp_path):
52+
"""Create a PDF with actual text content."""
53+
filepath = tmp_path / "test_with_text.pdf"
54+
55+
# Create PDF with reportlab
56+
c = canvas.Canvas(str(filepath), pagesize=letter)
57+
58+
# Page 1
59+
c.drawString(100, 750, "This is page 1")
60+
c.drawString(100, 730, "Hello World")
61+
c.showPage()
62+
63+
# Page 2
64+
c.drawString(100, 750, "This is page 2")
65+
c.drawString(100, 730, "Testing PDF Dataset")
66+
c.showPage()
67+
68+
c.save()
69+
70+
return filepath
71+
72+
73+
class TestPDFDataset:
74+
def test_save_raises_error(self, pdf_dataset):
75+
"""Test that saving raises an error."""
76+
pattern = r"Saving to PDFDataset is not supported\."
77+
with pytest.raises(DatasetError, match=pattern):
78+
pdf_dataset.save(["some", "data"])
79+
80+
def test_load_pdf(self, dummy_pdf_data):
81+
"""Test loading a PDF file."""
82+
dataset = PDFDataset(filepath=str(dummy_pdf_data))
83+
pages = dataset.load()
84+
85+
assert isinstance(pages, list)
86+
assert len(pages) == 2 # Two pages created in dummy_pdf_data
87+
assert all(isinstance(page, str) for page in pages)
88+
89+
def test_load_pdf_with_text(self, dummy_pdf_with_text):
90+
"""Test loading a PDF with actual text content."""
91+
dataset = PDFDataset(filepath=str(dummy_pdf_with_text))
92+
pages = dataset.load()
93+
94+
assert len(pages) == 2
95+
assert "page 1" in pages[0].lower()
96+
assert "page 2" in pages[1].lower()
97+
98+
def test_exists(self, pdf_dataset, dummy_pdf_data):
99+
"""Test `exists` method invocation for both existing and
100+
nonexistent dataset."""
101+
assert not pdf_dataset.exists()
102+
103+
# Copy dummy PDF to the expected filepath
104+
shutil.copy(dummy_pdf_data, pdf_dataset._filepath)
105+
106+
assert pdf_dataset.exists()
107+
108+
@pytest.mark.parametrize("load_args", [{"strict": True}], indirect=True)
109+
def test_load_extra_params(self, pdf_dataset, load_args):
110+
"""Test overriding the default load arguments."""
111+
for key, value in load_args.items():
112+
assert pdf_dataset._load_args[key] == value
113+
114+
@pytest.mark.parametrize(
115+
"fs_args",
116+
[{"open_args_load": {"mode": "rb", "compression": "gzip"}}],
117+
indirect=True,
118+
)
119+
def test_open_extra_args(self, pdf_dataset, fs_args):
120+
assert pdf_dataset._fs_open_args_load == fs_args["open_args_load"]
121+
122+
def test_load_missing_file(self, pdf_dataset):
123+
"""Check the error when trying to load missing file."""
124+
pattern = r"Failed while loading data from dataset kedro_datasets_experimental.pypdf.pdf_dataset.PDFDataset\(.*\)"
125+
with pytest.raises(DatasetError, match=pattern):
126+
pdf_dataset.load()
127+
128+
@pytest.mark.parametrize(
129+
"filepath,instance_type",
130+
[
131+
("s3://bucket/file.pdf", S3FileSystem),
132+
("file:///tmp/test.pdf", LocalFileSystem),
133+
("/tmp/test.pdf", LocalFileSystem), # nosec
134+
("gcs://bucket/file.pdf", GCSFileSystem),
135+
("https://example.com/file.pdf", HTTPFileSystem),
136+
],
137+
)
138+
def test_protocol_usage(self, filepath, instance_type):
139+
dataset = PDFDataset(filepath=filepath)
140+
assert isinstance(dataset._fs, instance_type)
141+
142+
path = filepath.split(PROTOCOL_DELIMITER, 1)[-1]
143+
144+
assert str(dataset._filepath) == path
145+
assert isinstance(dataset._filepath, PurePosixPath)
146+
147+
def test_catalog_release(self, mocker):
148+
fs_mock = mocker.patch("fsspec.filesystem").return_value
149+
filepath = "test.pdf"
150+
dataset = PDFDataset(filepath=filepath)
151+
dataset.release()
152+
fs_mock.invalidate_cache.assert_called_once_with(filepath)

kedro-datasets/mkdocs.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,7 @@ plugins:
176176
Experimental Specialized Formats:
177177
- api/kedro_datasets_experimental/prophet.ProphetModelDataset.md: Time series with Prophet
178178
- api/kedro_datasets_experimental/video.VideoDataset.md: Video file processing
179+
- api/kedro_datasets_experimental/pypdf.PDFDataset.md: PDF file text extraction
179180
- api/kedro_datasets_experimental/netcdf.NetCDFDataset.md: NetCDF scientific data
180181
- api/kedro_datasets_experimental/rioxarray.GeoTIFFDataset.md: GeoTIFF raster data
181182
- api/kedro_datasets_experimental/polars.PolarsDatabaseDataset.md: Polars database connector
@@ -326,6 +327,8 @@ nav:
326327
- langchain.LangChainPromptDataset: api/kedro_datasets_experimental/langchain.LangChainPromptDataset.md
327328
- NetCDF:
328329
- netcdf.NetCDFDataset: api/kedro_datasets_experimental/netcdf.NetCDFDataset.md
330+
- PyPDF:
331+
- pypdf.PDFDataset: api/kedro_datasets_experimental/pypdf.PDFDataset.md
329332
- Polars:
330333
- polars.PolarsDatabaseDataset: api/kedro_datasets_experimental/polars.PolarsDatabaseDataset.md
331334
- Prophet:

0 commit comments

Comments
 (0)