Skip to content

Commit d318d0f

Browse files
committed
Add section on data sources
1 parent e8a2004 commit d318d0f

File tree

8 files changed

+199
-2
lines changed

8 files changed

+199
-2
lines changed

docs/source/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,9 +71,10 @@ Example
7171

7272
user-guide/introduction
7373
user-guide/basics
74-
user-guide/configuration
74+
user-guide/data-sources
7575
user-guide/common-operations/index
7676
user-guide/io/index
77+
user-guide/configuration
7778
user-guide/sql
7879

7980

docs/source/user-guide/common-operations/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
Common Operations
1919
=================
2020

21+
The contents of this section are designed to guide a new user through how to use DataFusion.
22+
2123
.. toctree::
2224
:maxdepth: 2
2325

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
.. _user_guide_data_sources:
19+
20+
Data Sources
21+
============
22+
23+
DataFusion provides a wide variety of ways to get data into a DataFrame to perform operations.
24+
25+
Local file
26+
----------
27+
28+
DataFusion has the abilty to read from a variety of popular file formats, such as :ref:`Parquet <io_parquet>`,
29+
:ref:`CSV <io_csv>`, :ref:`JSON <io_json>`, and :ref:`AVRO <io_avro>`.
30+
31+
.. ipython:: python
32+
33+
from datafusion import SessionContext
34+
ctx = SessionContext()
35+
df = ctx.read_csv("pokemon.csv")
36+
df.show()
37+
38+
Create in-memory
39+
----------------
40+
41+
Sometimes it can be convenient to create a small DataFrame from a Python list or dictionary object.
42+
To do this in DataFusion, you can use one of the three functions
43+
:py:func:`~datafusion.context.SessionContext.from_pydict`,
44+
:py:func:`~datafusion.context.SessionContext.from_pylist`, or
45+
:py:func:`~datafusion.context.SessionContext.create_dataframe`.
46+
47+
As their names suggest, ``from_pydict`` and ``from_pylist`` will create DataFrames from Python
48+
dictionary and list objects, respectively. ``create_dataframe`` assumes you will pass in a list
49+
of list of `PyArrow Record Batches <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_.
50+
51+
The following three examples all will create identical DataFrames:
52+
53+
.. ipython:: python
54+
55+
import pyarrow as pa
56+
57+
ctx.from_pylist([
58+
{ "a": 1, "b": 10.0, "c": "alpha" },
59+
{ "a": 2, "b": 20.0, "c": "beta" },
60+
{ "a": 3, "b": 30.0, "c": "gamma" },
61+
]).show()
62+
63+
ctx.from_pydict({
64+
"a": [1, 2, 3],
65+
"b": [10.0, 20.0, 30.0],
66+
"c": ["alpha", "beta", "gamma"],
67+
}).show()
68+
69+
batch = pa.RecordBatch.from_arrays(
70+
[
71+
pa.array([1, 2, 3]),
72+
pa.array([10.0, 20.0, 30.0]),
73+
pa.array(["alpha", "beta", "gamma"]),
74+
],
75+
names=["a", "b", "c"],
76+
)
77+
78+
ctx.create_dataframe([[batch]]).show()
79+
80+
81+
Object Store
82+
------------
83+
84+
DataFusion has support for multiple storage options in addition to local files.
85+
The example below requires an appropriate S3 account with access credentials.
86+
87+
Supported Object Stores are
88+
89+
- :py:class:`~datafusion.object_store.AmazonS3`
90+
- :py:class:`~datafusion.object_store.GoogleCloud`
91+
- :py:class:`~datafusion.object_store.Http`
92+
- :py:class:`~datafusion.object_store.LocalFileSystem`
93+
- :py:class:`~datafusion.object_store.MicrosoftAzure`
94+
95+
.. code-block:: python
96+
97+
from datafusion.object_store import AmazonS3
98+
99+
region = "us-east-1"
100+
bucket_name = "yellow-trips"
101+
102+
s3 = AmazonS3(
103+
bucket_name=bucket_name,
104+
region=region,
105+
access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
106+
secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
107+
)
108+
109+
path = f"s3://{bucket_name}/"
110+
ctx.register_object_store("s3://", s3, None)
111+
112+
ctx.register_parquet("trips", path)
113+
114+
ctx.table("trips").show()
115+
116+
Other DataFrame Libraries
117+
-------------------------
118+
119+
DataFusion can import DataFrames directly from other libraries, such as
120+
`Polars <https://pola.rs/>`_ and `Pandas <https://pandas.pydata.org/>`_.
121+
Since DataFusion version 42.0.0, any DataFrame library that supports the Arrow FFI PyCapsule
122+
interface can be imported to DataFusion using the
123+
:py:func:`~datafusion.context.SessionContext.from_arrow` function. Older verions of Polars may
124+
not support the arrow interface. In those cases, you can still import via the
125+
:py:func:`~datafusion.context.SessionContext.from_polars` function.
126+
127+
.. ipython:: python
128+
129+
import pandas as pd
130+
131+
data = { "a": [1, 2, 3], "b": [10.0, 20.0, 30.0], "c": ["alpha", "beta", "gamma"] }
132+
pandas_df = pd.DataFrame(data)
133+
134+
datafusion_df = ctx.from_arrow(pandas_df)
135+
datafusion_df.show()
136+
137+
import polars as pl
138+
polars_df = pl.DataFrame(data)
139+
140+
datafusion_df = ctx.from_arrow(polars_df)
141+
datafusion_df.show()
142+
143+
Delta Lake
144+
----------
145+
146+
DataFusion 43.0.0 and later support the ability to register table providers from sources such
147+
as Delta Lake. This will require a recent version of
148+
`deltalake <https://delta-io.github.io/delta-rs/>`_ to provide the required interfaces.
149+
150+
.. code-block:: python
151+
152+
from deltalake import DeltaTable
153+
154+
delta_table = DeltaTable("path_to_table")
155+
ctx.register_table_provider("my_delta_table", delta_table)
156+
df = ctx.table("my_delta_table")
157+
df.show()
158+
159+
On older versions of ``deltalake`` (prior to 0.22) you can use the
160+
`Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_
161+
interface to import to DataFusion, but this does not support features such as filter push down
162+
which can lead to a significant performance difference.
163+
164+
.. code-block:: python
165+
166+
from deltalake import DeltaTable
167+
168+
delta_table = DeltaTable("path_to_table")
169+
ctx.register_dataset("my_delta_table", delta_table.to_pyarrow_dataset())
170+
df = ctx.table("my_delta_table")
171+
df.show()
172+
173+
Iceberg
174+
-------
175+
176+
Coming soon!
177+
178+
Custom Table Provider
179+
---------------------
180+
181+
You can implement a custom Data Provider in Rust and expose it to DataFusion through the
182+
the interface as describe in the :ref:`Custom Table Provider <io_custom_table_provider>`
183+
section. This is an advanced topic, but a
184+
`user example <https://github.com/apache/datafusion-python/tree/main/examples/ffi-table-provider>`_
185+
is provided in the DataFusion repository.

docs/source/user-guide/io/avro.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
.. specific language governing permissions and limitations
1616
.. under the License.
1717
18+
.. _io_avro:
19+
1820
Avro
1921
====
2022

docs/source/user-guide/io/csv.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
.. specific language governing permissions and limitations
1616
.. under the License.
1717
18+
.. _io_csv:
19+
1820
CSV
1921
===
2022

docs/source/user-guide/io/json.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
.. specific language governing permissions and limitations
1616
.. under the License.
1717
18+
.. _io_json:
19+
1820
JSON
1921
====
2022
`JSON <https://www.json.org/json-en.html>`_ (JavaScript Object Notation) is a lightweight data-interchange format.

docs/source/user-guide/io/parquet.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,14 +15,15 @@
1515
.. specific language governing permissions and limitations
1616
.. under the License.
1717
18+
.. _io_parquet:
19+
1820
Parquet
1921
=======
2022

2123
It is quite simple to read a parquet file using the :py:func:`~datafusion.context.SessionContext.read_parquet` function.
2224

2325
.. code-block:: python
2426
25-
2627
from datafusion import SessionContext
2728
2829
ctx = SessionContext()

docs/source/user-guide/io/table_provider.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
.. specific language governing permissions and limitations
1616
.. under the License.
1717
18+
.. _io_custom_table_provider:
19+
1820
Custom Table Provider
1921
=====================
2022

0 commit comments

Comments
 (0)