Skip to content

Commit cdfb5a8

Browse files
committed
Update basics doc to be a little more straight forward
1 parent 7fe5779 commit cdfb5a8

File tree

1 file changed

+35
-34
lines changed

1 file changed

+35
-34
lines changed

docs/source/user-guide/basics.rst

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -20,72 +20,73 @@
2020
Concepts
2121
========
2222

23-
In this section, we will cover a basic example to introduce a few key concepts.
23+
In this section, we will cover a basic example to introduce a few key concepts. We will use the same
24+
source file as described in the :ref:`Introduction <guide>`, the Pokemon data set.
2425

25-
.. code-block:: python
26+
.. ipython:: python
2627
27-
import datafusion
28-
from datafusion import col
29-
import pyarrow
28+
from datafusion import SessionContext, functions as F
3029
31-
# create a context
32-
ctx = datafusion.SessionContext()
30+
ctx = SessionContext()
3331
34-
# create a RecordBatch and a new DataFrame from it
35-
batch = pyarrow.RecordBatch.from_arrays(
36-
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
37-
names=["a", "b"],
38-
)
39-
df = ctx.create_dataframe([[batch]])
32+
df = ctx.read_csv("pokemon.csv")
4033
41-
# create a new statement
4234
df = df.select(
43-
col("a") + col("b"),
44-
col("a") - col("b"),
35+
'"Name"',
36+
(col('"Attack"') - col('"Defense"')).alias("delta"),
37+
col('"Speed"')
4538
)
4639
47-
# execute and collect the first (and only) batch
48-
result = df.collect()[0]
40+
df.show()
4941
50-
The first statement group:
42+
Session Context
43+
---------------
44+
45+
The first statement group creates a :py:class:`~datafusion.context.SessionContext`.
5146

5247
.. code-block:: python
5348
5449
# create a context
5550
ctx = datafusion.SessionContext()
5651
57-
creates a :py:class:`~datafusion.context.SessionContext`, that is, the main interface for executing queries with DataFusion. It maintains the state
58-
of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:
52+
A Session Context is the main interface for executing queries with DataFusion. It maintains the state
53+
of the connection between a user and an instance of the DataFusion engine. Additionally it provides
54+
the following functionality:
5955

60-
- Create a DataFrame from a CSV or Parquet data source.
61-
- Register a CSV or Parquet data source as a table that can be referenced from a SQL query.
62-
- Register a custom data source that can be referenced from a SQL query.
56+
- Create a DataFrame from a data source.
57+
- Register a data source as a table that can be referenced from a SQL query.
6358
- Execute a SQL query
6459

60+
DataFrame
61+
---------
62+
6563
The second statement group creates a :code:`DataFrame`,
6664

6765
.. code-block:: python
6866
69-
# create a RecordBatch and a new DataFrame from it
70-
batch = pyarrow.RecordBatch.from_arrays(
71-
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
72-
names=["a", "b"],
73-
)
74-
df = ctx.create_dataframe([[batch]])
67+
# Create a DataFrame from a file
68+
df = ctx.read_csv("pokemon.csv")
7569
7670
A DataFrame refers to a (logical) set of rows that share the same column names, similar to a `Pandas DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_.
7771
DataFrames are typically created by calling a method on :py:class:`~datafusion.context.SessionContext`, such as :code:`read_csv`, and can then be modified by
7872
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
7973
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
8074

75+
Expressions
76+
-----------
77+
8178
The third statement uses :code:`Expressions` to build up a query definition.
8279

8380
.. code-block:: python
8481
8582
df = df.select(
86-
col("a") + col("b"),
87-
col("a") - col("b"),
83+
'"Name"',
84+
(col('"Attack"') - col('"Defense"')).alias("delta"),
85+
col('"Speed"')
8886
)
8987
90-
Finally the :py:func:`~datafusion.dataframe.DataFrame.collect` method converts the logical plan represented by the DataFrame into a physical plan and execute it,
91-
collecting all results into a list of `RecordBatch <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_.
88+
Finally the :py:func:`~datafusion.dataframe.DataFrame.show` method converts the logical plan
89+
represented by the DataFrame into a physical plan and execute it, collecting all results and
90+
displaying them to the user. It is important to note that DataFusion performs lazy evaluation
91+
of the DataFrame. Until you call a method such as :py:func:`~datafusion.dataframe.DataFrame.show`
92+
or :py:func:`~datafusion.dataframe.DataFrame.collect`, DataFusion will not perform the query.

0 commit comments

Comments
 (0)