|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | +.. or more contributor license agreements. See the NOTICE file |
| 3 | +.. distributed with this work for additional information |
| 4 | +.. regarding copyright ownership. The ASF licenses this file |
| 5 | +.. to you under the Apache License, Version 2.0 (the |
| 6 | +.. "License"); you may not use this file except in compliance |
| 7 | +.. with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | +.. software distributed under the License is distributed on an |
| 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | +.. KIND, either express or implied. See the License for the |
| 15 | +.. specific language governing permissions and limitations |
| 16 | +.. under the License. |
| 17 | +
|
| 18 | +.. _user_guide_data_sources: |
| 19 | + |
| 20 | +Data Sources |
| 21 | +============ |
| 22 | + |
| 23 | +DataFusion provides a wide variety of ways to get data into a DataFrame to perform operations. |
| 24 | + |
| 25 | +Local file |
| 26 | +---------- |
| 27 | + |
| 28 | +DataFusion has the abilty to read from a variety of popular file formats, such as :ref:`Parquet <io_parquet>`, |
| 29 | +:ref:`CSV <io_csv>`, :ref:`JSON <io_json>`, and :ref:`AVRO <io_avro>`. |
| 30 | + |
| 31 | +.. ipython:: python |
| 32 | +
|
| 33 | + from datafusion import SessionContext |
| 34 | + ctx = SessionContext() |
| 35 | + df = ctx.read_csv("pokemon.csv") |
| 36 | + df.show() |
| 37 | +
|
| 38 | +Create in-memory |
| 39 | +---------------- |
| 40 | + |
| 41 | +Sometimes it can be convenient to create a small DataFrame from a Python list or dictionary object. |
| 42 | +To do this in DataFusion, you can use one of the three functions |
| 43 | +:py:func:`~datafusion.context.SessionContext.from_pydict`, |
| 44 | +:py:func:`~datafusion.context.SessionContext.from_pylist`, or |
| 45 | +:py:func:`~datafusion.context.SessionContext.create_dataframe`. |
| 46 | + |
| 47 | +As their names suggest, ``from_pydict`` and ``from_pylist`` will create DataFrames from Python |
| 48 | +dictionary and list objects, respectively. ``create_dataframe`` assumes you will pass in a list |
| 49 | +of list of `PyArrow Record Batches <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_. |
| 50 | + |
| 51 | +The following three examples all will create identical DataFrames: |
| 52 | + |
| 53 | +.. ipython:: python |
| 54 | +
|
| 55 | + import pyarrow as pa |
| 56 | +
|
| 57 | + ctx.from_pylist([ |
| 58 | + { "a": 1, "b": 10.0, "c": "alpha" }, |
| 59 | + { "a": 2, "b": 20.0, "c": "beta" }, |
| 60 | + { "a": 3, "b": 30.0, "c": "gamma" }, |
| 61 | + ]).show() |
| 62 | +
|
| 63 | + ctx.from_pydict({ |
| 64 | + "a": [1, 2, 3], |
| 65 | + "b": [10.0, 20.0, 30.0], |
| 66 | + "c": ["alpha", "beta", "gamma"], |
| 67 | + }).show() |
| 68 | +
|
| 69 | + batch = pa.RecordBatch.from_arrays( |
| 70 | + [ |
| 71 | + pa.array([1, 2, 3]), |
| 72 | + pa.array([10.0, 20.0, 30.0]), |
| 73 | + pa.array(["alpha", "beta", "gamma"]), |
| 74 | + ], |
| 75 | + names=["a", "b", "c"], |
| 76 | + ) |
| 77 | +
|
| 78 | + ctx.create_dataframe([[batch]]).show() |
| 79 | +
|
| 80 | +
|
| 81 | +Object Store |
| 82 | +------------ |
| 83 | + |
| 84 | +DataFusion has support for multiple storage options in addition to local files. |
| 85 | +The example below requires an appropriate S3 account with access credentials. |
| 86 | + |
| 87 | +Supported Object Stores are |
| 88 | + |
| 89 | +- :py:class:`~datafusion.object_store.AmazonS3` |
| 90 | +- :py:class:`~datafusion.object_store.GoogleCloud` |
| 91 | +- :py:class:`~datafusion.object_store.Http` |
| 92 | +- :py:class:`~datafusion.object_store.LocalFileSystem` |
| 93 | +- :py:class:`~datafusion.object_store.MicrosoftAzure` |
| 94 | + |
| 95 | +.. code-block:: python |
| 96 | +
|
| 97 | + from datafusion.object_store import AmazonS3 |
| 98 | +
|
| 99 | + region = "us-east-1" |
| 100 | + bucket_name = "yellow-trips" |
| 101 | +
|
| 102 | + s3 = AmazonS3( |
| 103 | + bucket_name=bucket_name, |
| 104 | + region=region, |
| 105 | + access_key_id=os.getenv("AWS_ACCESS_KEY_ID"), |
| 106 | + secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"), |
| 107 | + ) |
| 108 | +
|
| 109 | + path = f"s3://{bucket_name}/" |
| 110 | + ctx.register_object_store("s3://", s3, None) |
| 111 | +
|
| 112 | + ctx.register_parquet("trips", path) |
| 113 | +
|
| 114 | + ctx.table("trips").show() |
| 115 | +
|
| 116 | +Other DataFrame Libraries |
| 117 | +------------------------- |
| 118 | + |
| 119 | +DataFusion can import DataFrames directly from other libraries, such as |
| 120 | +`Polars <https://pola.rs/>`_ and `Pandas <https://pandas.pydata.org/>`_. |
| 121 | +Since DataFusion version 42.0.0, any DataFrame library that supports the Arrow FFI PyCapsule |
| 122 | +interface can be imported to DataFusion using the |
| 123 | +:py:func:`~datafusion.context.SessionContext.from_arrow` function. Older verions of Polars may |
| 124 | +not support the arrow interface. In those cases, you can still import via the |
| 125 | +:py:func:`~datafusion.context.SessionContext.from_polars` function. |
| 126 | + |
| 127 | +.. ipython:: python |
| 128 | +
|
| 129 | + import pandas as pd |
| 130 | +
|
| 131 | + data = { "a": [1, 2, 3], "b": [10.0, 20.0, 30.0], "c": ["alpha", "beta", "gamma"] } |
| 132 | + pandas_df = pd.DataFrame(data) |
| 133 | +
|
| 134 | + datafusion_df = ctx.from_arrow(pandas_df) |
| 135 | + datafusion_df.show() |
| 136 | +
|
| 137 | + import polars as pl |
| 138 | + polars_df = pl.DataFrame(data) |
| 139 | +
|
| 140 | + datafusion_df = ctx.from_arrow(polars_df) |
| 141 | + datafusion_df.show() |
| 142 | +
|
| 143 | +Delta Lake |
| 144 | +---------- |
| 145 | + |
| 146 | +DataFusion 43.0.0 and later support the ability to register table providers from sources such |
| 147 | +as Delta Lake. This will require a recent version of |
| 148 | +`deltalake <https://delta-io.github.io/delta-rs/>`_ to provide the required interfaces. |
| 149 | + |
| 150 | +.. code-block:: python |
| 151 | +
|
| 152 | + from deltalake import DeltaTable |
| 153 | +
|
| 154 | + delta_table = DeltaTable("path_to_table") |
| 155 | + ctx.register_table_provider("my_delta_table", delta_table) |
| 156 | + df = ctx.table("my_delta_table") |
| 157 | + df.show() |
| 158 | +
|
| 159 | +On older versions of ``deltalake`` (prior to 0.22) you can use the |
| 160 | +`Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_ |
| 161 | +interface to import to DataFusion, but this does not support features such as filter push down |
| 162 | +which can lead to a significant performance difference. |
| 163 | + |
| 164 | +.. code-block:: python |
| 165 | +
|
| 166 | + from deltalake import DeltaTable |
| 167 | +
|
| 168 | + delta_table = DeltaTable("path_to_table") |
| 169 | + ctx.register_dataset("my_delta_table", delta_table.to_pyarrow_dataset()) |
| 170 | + df = ctx.table("my_delta_table") |
| 171 | + df.show() |
| 172 | +
|
| 173 | +Iceberg |
| 174 | +------- |
| 175 | + |
| 176 | +Coming soon! |
| 177 | + |
| 178 | +Custom Table Provider |
| 179 | +--------------------- |
| 180 | + |
| 181 | +You can implement a custom Data Provider in Rust and expose it to DataFusion through the |
| 182 | +the interface as describe in the :ref:`Custom Table Provider <io_custom_table_provider>` |
| 183 | +section. This is an advanced topic, but a |
| 184 | +`user example <https://github.com/apache/datafusion-python/tree/main/examples/ffi-table-provider>`_ |
| 185 | +is provided in the DataFusion repository. |
0 commit comments