|
20 | 20 | Concepts |
21 | 21 | ======== |
22 | 22 |
|
23 | | -In this section, we will cover a basic example to introduce a few key concepts. |
| 23 | +In this section, we will cover a basic example to introduce a few key concepts. We will use the same |
| 24 | +source file as described in the :ref:`Introduction <guide>`, the Pokemon data set. |
24 | 25 |
|
25 | | -.. code-block:: python |
| 26 | +.. ipython:: python |
26 | 27 |
|
27 | | - import datafusion |
28 | | - from datafusion import col |
29 | | - import pyarrow |
| 28 | + from datafusion import SessionContext, functions as F |
30 | 29 |
|
31 | | - # create a context |
32 | | - ctx = datafusion.SessionContext() |
| 30 | + ctx = SessionContext() |
33 | 31 |
|
34 | | - # create a RecordBatch and a new DataFrame from it |
35 | | - batch = pyarrow.RecordBatch.from_arrays( |
36 | | - [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])], |
37 | | - names=["a", "b"], |
38 | | - ) |
39 | | - df = ctx.create_dataframe([[batch]]) |
| 32 | + df = ctx.read_csv("pokemon.csv") |
40 | 33 |
|
41 | | - # create a new statement |
42 | 34 | df = df.select( |
43 | | - col("a") + col("b"), |
44 | | - col("a") - col("b"), |
| 35 | + '"Name"', |
| 36 | + (col('"Attack"') - col('"Defense"')).alias("delta"), |
| 37 | + col('"Speed"') |
45 | 38 | ) |
46 | 39 |
|
47 | | - # execute and collect the first (and only) batch |
48 | | - result = df.collect()[0] |
| 40 | + df.show() |
49 | 41 |
|
50 | | -The first statement group: |
| 42 | +Session Context |
| 43 | +--------------- |
| 44 | + |
| 45 | +The first statement group creates a :py:class:`~datafusion.context.SessionContext`. |
51 | 46 |
|
52 | 47 | .. code-block:: python |
53 | 48 |
|
54 | 49 | # create a context |
55 | 50 | ctx = datafusion.SessionContext() |
56 | 51 |
|
57 | | -creates a :py:class:`~datafusion.context.SessionContext`, that is, the main interface for executing queries with DataFusion. It maintains the state |
58 | | -of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality: |
| 52 | +A Session Context is the main interface for executing queries with DataFusion. It maintains the state |
| 53 | +of the connection between a user and an instance of the DataFusion engine. Additionally it provides |
| 54 | +the following functionality: |
59 | 55 |
|
60 | | -- Create a DataFrame from a CSV or Parquet data source. |
61 | | -- Register a CSV or Parquet data source as a table that can be referenced from a SQL query. |
62 | | -- Register a custom data source that can be referenced from a SQL query. |
| 56 | +- Create a DataFrame from a data source. |
| 57 | +- Register a data source as a table that can be referenced from a SQL query. |
63 | 58 | - Execute a SQL query |
64 | 59 |
|
| 60 | +DataFrame |
| 61 | +--------- |
| 62 | + |
65 | 63 | The second statement group creates a :code:`DataFrame`, |
66 | 64 |
|
67 | 65 | .. code-block:: python |
68 | 66 |
|
69 | | - # create a RecordBatch and a new DataFrame from it |
70 | | - batch = pyarrow.RecordBatch.from_arrays( |
71 | | - [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])], |
72 | | - names=["a", "b"], |
73 | | - ) |
74 | | - df = ctx.create_dataframe([[batch]]) |
| 67 | + # Create a DataFrame from a file |
| 68 | + df = ctx.read_csv("pokemon.csv") |
75 | 69 |
|
76 | 70 | A DataFrame refers to a (logical) set of rows that share the same column names, similar to a `Pandas DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_. |
77 | 71 | DataFrames are typically created by calling a method on :py:class:`~datafusion.context.SessionContext`, such as :code:`read_csv`, and can then be modified by |
78 | 72 | calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`, |
79 | 73 | and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition. |
80 | 74 |
|
| 75 | +Expressions |
| 76 | +----------- |
| 77 | + |
81 | 78 | The third statement uses :code:`Expressions` to build up a query definition. |
82 | 79 |
|
83 | 80 | .. code-block:: python |
84 | 81 |
|
85 | 82 | df = df.select( |
86 | | - col("a") + col("b"), |
87 | | - col("a") - col("b"), |
| 83 | + '"Name"', |
| 84 | + (col('"Attack"') - col('"Defense"')).alias("delta"), |
| 85 | + col('"Speed"') |
88 | 86 | ) |
89 | 87 |
|
90 | | -Finally the :py:func:`~datafusion.dataframe.DataFrame.collect` method converts the logical plan represented by the DataFrame into a physical plan and execute it, |
91 | | -collecting all results into a list of `RecordBatch <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_. |
| 88 | +Finally the :py:func:`~datafusion.dataframe.DataFrame.show` method converts the logical plan |
| 89 | +represented by the DataFrame into a physical plan and execute it, collecting all results and |
| 90 | +displaying them to the user. It is important to note that DataFusion performs lazy evaluation |
| 91 | +of the DataFrame. Until you call a method such as :py:func:`~datafusion.dataframe.DataFrame.show` |
| 92 | +or :py:func:`~datafusion.dataframe.DataFrame.collect`, DataFusion will not perform the query. |
0 commit comments