Skip to content

Commit 1ca77c1

Browse files
Finishing docs and tests
1 parent 555459b commit 1ca77c1

File tree

4 files changed

+146
-16
lines changed

4 files changed

+146
-16
lines changed

doc/source/development/extending.rst

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -489,6 +489,69 @@ registers the default "matplotlib" backend as follows.
489489
More information on how to implement a third-party plotting backend can be found at
490490
https://github.com/pandas-dev/pandas/blob/main/pandas/plotting/__init__.py#L1.
491491

492+
.. _extending.plotting-backends:
493+
494+
IO engines
495+
-----------
496+
497+
pandas provides several IO connectors such as :func:`read_csv` or :meth:`to_parquet`, and many
498+
of those support multiple engines. For example, :func:`read_csv` supports the ``python``, ``c``
499+
and ``pyarrow`` engines, each with its advantages and disadvantages, making each more appropriate
500+
for certain use cases.
501+
502+
Third-party package developers can implement engines for any of the pandas readers and writers.
503+
When a ``pandas.read_*`` function or ``DataFrame.to_*`` method are called with an ``engine="<name>"``
504+
that is not known to pandas, pandas will look into the entry points registered in the group
505+
``pandas.io_engine`` by the packages in the environment, and will call the corresponding method.
506+
507+
An engine is a simple Python class which implements one or more of the pandas readers and writers
508+
as class methods:
509+
510+
.. code-block:: python
511+
512+
class EmptyDataEngine:
513+
@classmethod
514+
def read_json(cls, path_or_buf=None, **kwargs):
515+
return pd.DataFrame()
516+
517+
@classmethod
518+
def to_json(cls, path_or_buf=None, **kwargs):
519+
with open(path_or_buf, "w") as f:
520+
f.write()
521+
522+
@classmethod
523+
def read_clipboard(cls, sep='\\s+', dtype_backend=None, **kwargs):
524+
return pd.DataFrame()
525+
526+
A single engine can support multiple readers and writers. When possible, it is a good practice for
527+
a reader to provide both a reader and writer for the supported formats. But it is possible to
528+
provide just one of them.
529+
530+
The package implementing the engine needs to create an entry point for pandas to be able to discover
531+
it. This is done in ``pyproject.toml``:
532+
533+
```toml
534+
[project.entry-points."pandas.io_engine"]
535+
empty = empty_data:EmptyDataEngine
536+
```
537+
538+
The first line should always be the same, creating the entry point in the ``pandas.io_engine`` group.
539+
In the second line, ``empty`` is the name of the engine, and ``empty_data:EmptyDataEngine`` is where
540+
to find the engine class in the package (``empty_data`` is the module name in this case).
541+
542+
If a user have the package of the example installed, them it would be possible to use:
543+
544+
.. code-block:: python
545+
546+
pd.read_json("myfile.json", engine="empty")
547+
548+
When pandas detects that no ``empty`` engine exists for the ``read_json`` reader in pandas, will
549+
look at the entry points, will find the ``EmptyDataEngine`` engine, and will call the ``read_json``
550+
method on it with the arguments provided by the user (except the ``engine`` parameter).
551+
552+
To avoid conflicts in the names of engines, we keep an "IO engines" section in our
553+
[Ecosystem page](https://pandas.pydata.org/community/ecosystem.html#io-engines).
554+
492555
.. _extending.pandas_priority:
493556

494557
Arithmetic with 3rd party types

pandas/io/common.py

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1340,11 +1340,10 @@ def _get_io_engine(name: str):
13401340
for entry_point in entry_points().select(group="pandas.io_engine"):
13411341
package_name = entry_point.dist.metadata["Name"]
13421342
if entry_point.name in _io_engines:
1343-
_io_engines[entry_point.name]._other_providers.append(package_name)
1343+
_io_engines[entry_point.name]._packages.append(package_name)
13441344
else:
13451345
_io_engines[entry_point.name] = entry_point.load()
1346-
_io_engines[entry_point.name]._provider_name = package_name
1347-
_io_engines[entry_point.name]._other_providers = []
1346+
_io_engines[entry_point.name]._packages = [package_name]
13481347

13491348
try:
13501349
engine = _io_engines[name]
@@ -1354,23 +1353,22 @@ def _get_io_engine(name: str):
13541353
"after installing the package that provides them."
13551354
) from err
13561355

1357-
if engine._other_providers:
1356+
if len(engine._packages) > 1:
13581357
msg = (
13591358
f"The engine '{name}' has been registered by the package "
1360-
f"'{engine._provider_name}' and will be used. "
1359+
f"'{engine._packages[0]}' and will be used. "
13611360
)
1362-
if len(engine._other_providers):
1361+
if len(engine._packages) == 2:
13631362
msg += (
1364-
"The package '{engine._other_providers}' also tried to register "
1363+
f"The package '{engine._packages[1]}' also tried to register "
13651364
"the engine, but it couldn't because it was already registered."
13661365
)
13671366
else:
13681367
msg += (
1369-
"Other packages that tried to register the engine, but they couldn't "
1370-
"because it was already registered are: "
1371-
f"{str(engine._other_providers)[1:-1]}."
1368+
"The packages {str(engine._packages[1:]}[1:-1] also tried to register "
1369+
"the engine, but they couldn't because it was already registered."
13721370
)
1373-
warnings.warn(RuntimeWarning, msg, stacklevel=find_stack_level())
1371+
warnings.warn(msg, RuntimeWarning, stacklevel=find_stack_level())
13741372

13751373
return engine
13761374

pandas/tests/io/test_io_engines.py

Lines changed: 62 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,57 @@
1+
from types import SimpleNamespace
2+
13
import pytest
24

5+
import pandas._testing as tm
6+
37
from pandas.io import common
48

59

10+
class _MockIoEngine:
11+
@classmethod
12+
def read_foo(cls, fname):
13+
return "third-party"
14+
15+
616
@pytest.fixture
717
def patch_engine(monkeypatch):
8-
class MockIoEngine:
9-
@classmethod
10-
def read_foo(cls, fname):
11-
return "third-party"
18+
monkeypatch.setattr(common, "_get_io_engine", lambda name: _MockIoEngine)
19+
20+
21+
@pytest.fixture
22+
def patch_entry_points(monkeypatch):
23+
class MockEntryPoint:
24+
name = "myengine"
25+
dist = SimpleNamespace(metadata={"Name": "mypackage"})
26+
27+
@staticmethod
28+
def load():
29+
return _MockIoEngine
1230

13-
monkeypatch.setattr(common, "_get_io_engine", lambda name: MockIoEngine)
31+
class MockDuplicate1:
32+
name = "duplicate"
33+
dist = SimpleNamespace(metadata={"Name": "package1"})
34+
35+
@staticmethod
36+
def load():
37+
return SimpleNamespace(read_foo=lambda fname: "dup1")
38+
39+
class MockDuplicate2:
40+
name = "duplicate"
41+
dist = SimpleNamespace(metadata={"Name": "package2"})
42+
43+
@staticmethod
44+
def load():
45+
return SimpleNamespace(read_foo=lambda fname: "dup1")
46+
47+
monkeypatch.setattr(common, "_io_engines", None)
48+
monkeypatch.setattr(
49+
common,
50+
"entry_points",
51+
lambda: SimpleNamespace(
52+
select=lambda group: [MockEntryPoint, MockDuplicate1, MockDuplicate2]
53+
),
54+
)
1455

1556

1657
class TestIoEngines:
@@ -46,3 +87,19 @@ def read_bar(fname, engine=None):
4687
msg = "'third-party' does not provide a 'read_bar'"
4788
with pytest.raises(ValueError, match=msg):
4889
read_bar("myfile.foo", engine="third-party")
90+
91+
def test_correct_io_engine(self, patch_entry_points):
92+
result = common._get_io_engine("myengine")
93+
assert result is _MockIoEngine
94+
95+
def test_unknown_io_engine(self, patch_entry_points):
96+
with pytest.raises(ValueError, match="'unknown' is not a known engine"):
97+
common._get_io_engine("unknown")
98+
99+
def test_duplicate_engine(self, patch_entry_points):
100+
with tm.assert_produces_warning(
101+
RuntimeWarning,
102+
match="'duplicate' has been registered by the package 'package1'",
103+
):
104+
result = common._get_io_engine("duplicate")
105+
assert hasattr(result, "read_foo")

web/pandas/community/ecosystem.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -712,6 +712,18 @@ authors to coordinate on the namespace.
712712
| [staircase](https://www.staircase.dev/) | `sc` | `Series`, `DataFrame` |
713713
| [woodwork](https://github.com/alteryx/woodwork) | `slice` | `Series`, `DataFrame` |
714714

715+
## IO engines
716+
717+
Table with the third-party [IO engines](https://pandas.pydata.org/docs/development/extending.html#io-engines)
718+
available to `read_*` functions and `DataFrame.to_*` methods.
719+
720+
| Engine name | Library | Supported formats |
721+
| ----------------|------------------------------------------------------ | ------------------------------- |
722+
| | | |
723+
724+
IO engines can be used by specifying the engine when calling a reader or writer
725+
(e.g. `pd.read_csv("myfile.csv", engine="myengine")`).
726+
715727
## Development tools
716728

717729
### [pandas-stubs](https://github.com/VirtusLab/pandas-stubs)

0 commit comments

Comments
 (0)