Skip to content

Commit f91c835

Browse files
authored
Merge branch 'main' into fix-plotly-utf8
2 parents 5d3af7a + 170cc68 commit f91c835

File tree

19 files changed

+1256
-14
lines changed

19 files changed

+1256
-14
lines changed

kedro-airflow/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,5 +98,5 @@ select = [
9898
]
9999
ignore = ["E501"] # Black takes care of line-too-long
100100

101-
[tool.ruff.per-file-ignores]
101+
[tool.ruff.lint.per-file-ignores]
102102
"{tests,features}/*" = ["T201", "PLR2004", "PLR0915", "PLW1510"]

kedro-datasets/RELEASE.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,26 @@
11
# Upcoming Release
2+
23
## Major features and improvements
34

45
- Group datasets documentation according to the dependencies to clean up the nav bar.
56

7+
- Added the following new **experimental** datasets:
8+
9+
| Type | Description | Location |
10+
| ----------------------------------- | -------------------------------------------------------- | --------------------------------------- |
11+
| `langchain.LangChainPromptDataset` | Kedro dataset for loading LangChain prompts | `kedro_datasets_experimental.langchain` |
12+
613
## Bug fixes and other changes
714
- Add HTMLPreview type.
815

16+
## Major features and improvements
17+
18+
- Added the following new experimental datasets:
19+
20+
| Type | Description | Location |
21+
|--------------------------------|---------------------------------------------------------------|--------------------------------------|
22+
| `pypdf.PDFDataset` | A dataset to read PDF files and extract text using pypdf | `kedro_datasets_experimental.pypdf` |
23+
924
# Release 8.1.0
1025
## Major features and improvements
1126

@@ -15,14 +30,16 @@
1530
| ------------------------------ | ------------------------------------------------------------- | ------------------------------------ |
1631
| `polars.PolarsDatabaseDataset` | A dataset to load and save data to a SQL backend using Polars | `kedro_datasets_experimental.polars` |
1732

33+
- Added `mode` save argument to `ibis.TableDataset`, supporting "append", "overwrite", "error"/"errorifexists", and "ignore" save modes. The deprecated `overwrite` save argument is mapped to `mode` for backward compatibility and will be removed in a future release. Specifying both `mode` and `overwrite` results in an error.
34+
1835
## Bug fixes and other changes
1936

2037
- Added primary key constraint to BaseTable.
2138
- Added save/load with `use_pyarrow=True` save_args for LazyPolarsDataset partitioned parquet files.
2239
- Updated the json schema for Kedro 1.0.0.
2340

24-
## Breaking Changes
2541
## Community contributions
42+
2643
- [Minura Punchihewa](https://github.com/MinuraPunchihewa)
2744
- [gitgud5000](https://github.com/gitgud5000)
2845

@@ -56,7 +73,6 @@ Many thanks to the following Kedroids for contributing PRs to this release:
5673
- [Seohyun Park](https://github.com/soyamimi)
5774
- [Daniel Russell-Brain](https://github.com/killerfridge)
5875

59-
6076
# Release 7.0.0
6177

6278
## Major features and improvements

kedro-datasets/docs/api/kedro_datasets_experimental/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@ Name | Description
1111
[langchain.ChatCohereDataset](langchain.ChatCohereDataset.md) | ``ChatCohereDataset`` loads a ChatCohere `langchain` model.
1212
[langchain.ChatOpenAIDataset](langchain.ChatOpenAIDataset.md) | OpenAI dataset used to access credentials at runtime.
1313
[langchain.OpenAIEmbeddingsDataset](langchain.OpenAIEmbeddingsDataset.md) | ``OpenAIEmbeddingsDataset`` loads a OpenAIEmbeddings `langchain` model.
14+
[langchain.LangChainPromptDataset](langchain.LangChainPromptDataset.md) | ``LangChainPromptDataset`` loads a `langchain` prompt template.
1415
[netcdf.NetCDFDataset](netcdf.NetCDFDataset.md) | ``NetCDFDataset`` loads/saves data from/to a NetCDF file using an underlying filesystem (e.g.: local, S3, GCS). It uses xarray to handle the NetCDF file.
16+
[pypdf.PDFDataset](pypdf.PDFDataset.md) | ``PDFDataset`` loads data from PDF files using pypdf to extract text from pages. Read-only dataset.
1517
[polars.PolarsDatabaseDataset](polars.PolarsDatabaseDataset.md) | ``PolarsDatabaseDataset`` implementation to access databases as Polars DataFrames. It supports reading from a SQL query and writing to a database table.
1618
[prophet.ProphetModelDataset](prophet.ProphetModelDataset.md) | ``ProphetModelDataset`` loads/saves Facebook Prophet models to a JSON file using an underlying filesystem (e.g., local, S3, GCS). It uses Prophet's built-in serialisation to handle the JSON file.
1719
[pytorch.PyTorchDataset](pytorch.PyTorchDataset.md) | ``PyTorchDataset`` loads and saves PyTorch models' `state_dict` using PyTorch's recommended zipfile serialization protocol. To avoid security issues with Pickle.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
::: kedro_datasets_experimental.langchain.LangChainPromptDataset
2+
options:
3+
members: true
4+
show_source: true
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
::: kedro_datasets_experimental.pypdf.PDFDataset
2+
options:
3+
members: true
4+
show_source: true

kedro-datasets/kedro_datasets/ibis/table_dataset.py

Lines changed: 70 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,37 @@
11
"""Provide data loading and saving functionality for Ibis's backends."""
22
from __future__ import annotations
33

4+
import sys
45
from copy import deepcopy
6+
from enum import auto
57
from typing import TYPE_CHECKING, Any, ClassVar
8+
from warnings import warn
9+
10+
if sys.version_info >= (3, 11):
11+
from enum import StrEnum # pragma: no cover
12+
else:
13+
from backports.strenum import StrEnum # pragma: no cover
614

715
import ibis.expr.types as ir
8-
from kedro.io import AbstractDataset
16+
from kedro.io import AbstractDataset, DatasetError
917

18+
from kedro_datasets import KedroDeprecationWarning
1019
from kedro_datasets._utils import ConnectionMixin
1120

1221
if TYPE_CHECKING:
1322
from ibis import BaseBackend
1423

1524

25+
class SaveMode(StrEnum):
26+
"""`SaveMode` is used to specify the expected behavior of saving a table."""
27+
28+
APPEND = auto()
29+
OVERWRITE = auto()
30+
ERROR = auto()
31+
ERRORIFEXISTS = auto()
32+
IGNORE = auto()
33+
34+
1635
class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]):
1736
"""`TableDataset` loads/saves data from/to Ibis table expressions.
1837
@@ -28,14 +47,18 @@ class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]):
2847
database: company.db
2948
save_args:
3049
materialized: table
50+
mode: append
3151
3252
motorbikes:
3353
type: ibis.TableDataset
3454
table_name: motorbikes
3555
connection:
3656
backend: duckdb
3757
database: company.db
38-
```
58+
save_args:
59+
materialized: view
60+
mode: overwrite
61+
```
3962
4063
Using the [Python API](https://docs.kedro.org/en/stable/catalog-data/advanced_data_catalog_usage/):
4164
@@ -62,7 +85,7 @@ class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]):
6285
DEFAULT_LOAD_ARGS: ClassVar[dict[str, Any]] = {}
6386
DEFAULT_SAVE_ARGS: ClassVar[dict[str, Any]] = {
6487
"materialized": "view",
65-
"overwrite": True,
88+
"mode": "overwrite",
6689
}
6790

6891
_CONNECTION_GROUP: ClassVar[str] = "ibis"
@@ -109,7 +132,12 @@ def __init__( # noqa: PLR0913
109132
`create_{materialized}` method. By default, ``ir.Table``
110133
objects are materialized as views. To save a table using
111134
a different materialization strategy, supply a value for
112-
`materialized` in `save_args`.
135+
`materialized` in `save_args`. The `mode` parameter controls
136+
the behavior when saving data:
137+
- _"overwrite"_: Overwrite existing data in the table.
138+
- _"append"_: Append contents of the new data to the existing table (does not overwrite).
139+
- _"error"_ or _"errorifexists"_: Throw an exception if the table already exists.
140+
- _"ignore"_: Silently ignore the operation if the table already exists.
113141
metadata: Any arbitrary metadata. This is ignored by Kedro,
114142
but may be consumed by users or external plugins.
115143
"""
@@ -134,6 +162,28 @@ def __init__( # noqa: PLR0913
134162

135163
self._materialized = self._save_args.pop("materialized")
136164

165+
# Handle mode/overwrite conflict.
166+
if save_args and "mode" in save_args and "overwrite" in self._save_args:
167+
raise ValueError("Cannot specify both 'mode' and deprecated 'overwrite'.")
168+
169+
# Map legacy overwrite if present.
170+
if "overwrite" in self._save_args:
171+
warn(
172+
"'overwrite' is deprecated and will be removed in a future release. "
173+
"Please use 'mode' instead.",
174+
KedroDeprecationWarning,
175+
stacklevel=2,
176+
)
177+
legacy = self._save_args.pop("overwrite")
178+
# Remove any lingering 'mode' key from defaults to avoid
179+
# leaking into writer kwargs.
180+
del self._save_args["mode"]
181+
mode = "overwrite" if legacy else "error"
182+
else:
183+
mode = self._save_args.pop("mode")
184+
185+
self._mode = SaveMode(mode)
186+
137187
def _connect(self) -> BaseBackend:
138188
import ibis # noqa: PLC0415
139189

@@ -151,7 +201,21 @@ def load(self) -> ir.Table:
151201

152202
def save(self, data: ir.Table) -> None:
153203
writer = getattr(self.connection, f"create_{self._materialized}")
154-
writer(self._table_name, data, **self._save_args)
204+
if self._mode == "append":
205+
if not self._exists():
206+
writer(self._table_name, data, overwrite=False, **self._save_args)
207+
elif hasattr(self.connection, "insert"):
208+
self.connection.insert(self._table_name, data, **self._save_args)
209+
else:
210+
raise DatasetError(
211+
f"The {self.connection.name} backend for Ibis does not support inserts."
212+
)
213+
elif self._mode == "overwrite":
214+
writer(self._table_name, data, overwrite=True, **self._save_args)
215+
elif self._mode in {"error", "errorifexists"}:
216+
writer(self._table_name, data, overwrite=False, **self._save_args)
217+
elif self._mode == "ignore" and not self._exists():
218+
writer(self._table_name, data, overwrite=False, **self._save_args)
155219

156220
def _describe(self) -> dict[str, Any]:
157221
load_args = deepcopy(self._load_args)
@@ -165,6 +229,7 @@ def _describe(self) -> dict[str, Any]:
165229
"load_args": load_args,
166230
"save_args": save_args,
167231
"materialized": self._materialized,
232+
"mode": self._mode,
168233
}
169234

170235
def _exists(self) -> bool:

kedro-datasets/kedro_datasets_experimental/langchain/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,23 @@
77
from ._anthropic import ChatAnthropicDataset
88
from ._cohere import ChatCohereDataset
99
from ._openai import ChatOpenAIDataset, OpenAIEmbeddingsDataset
10+
from .langchain_prompt_dataset import LangChainPromptDataset
11+
1012
except (ImportError, RuntimeError):
1113
# For documentation builds that might fail due to dependency issues
1214
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
1315
ChatAnthropicDataset: Any
1416
ChatOpenAIDataset: Any
1517
OpenAIEmbeddingsDataset: Any
1618
ChatCohereDataset: Any
19+
LangChainPromptDataset: Any
1720

1821
__getattr__, __dir__, __all__ = lazy.attach(
1922
__name__,
2023
submod_attrs={
2124
"_openai": ["ChatOpenAIDataset", "OpenAIEmbeddingsDataset"],
2225
"_anthropic": ["ChatAnthropicDataset"],
2326
"_cohere": ["ChatCohereDataset"],
27+
"langchain_prompt_dataset": ["LangChainPromptDataset"],
2428
},
2529
)

0 commit comments

Comments
 (0)