Skip to content

Commit 584afd7

Browse files
committed
feat: add API reference documentation for DataFrame and index
1 parent 278a33e commit 584afd7

File tree

3 files changed

+403
-0
lines changed

3 files changed

+403
-0
lines changed

docs/source/api/dataframe.rst

Lines changed: 374 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,374 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
=================
19+
DataFrame API
20+
=================
21+
22+
Overview
23+
--------
24+
25+
The ``DataFrame`` class is the core abstraction in DataFusion that represents tabular data and operations
26+
on that data. DataFrames provide a flexible API for transforming data through various operations such as
27+
filtering, projection, aggregation, joining, and more.
28+
29+
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
30+
terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called.
31+
32+
Creating DataFrames
33+
------------------
34+
35+
DataFrames can be created in several ways:
36+
37+
* From SQL queries via a ``SessionContext``:
38+
39+
.. code-block:: python
40+
41+
from datafusion import SessionContext
42+
43+
ctx = SessionContext()
44+
df = ctx.sql("SELECT * FROM your_table")
45+
46+
* From registered tables:
47+
48+
.. code-block:: python
49+
50+
df = ctx.table("your_table")
51+
52+
* From various data sources:
53+
54+
.. code-block:: python
55+
56+
# From CSV files
57+
df = ctx.read_csv("path/to/data.csv")
58+
59+
# From Parquet files
60+
df = ctx.read_parquet("path/to/data.parquet")
61+
62+
# From JSON files
63+
df = ctx.read_json("path/to/data.json")
64+
65+
# From Pandas DataFrame
66+
import pandas as pd
67+
pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
68+
df = ctx.from_pandas(pandas_df)
69+
70+
# From Arrow data
71+
import pyarrow as pa
72+
batch = pa.RecordBatch.from_arrays(
73+
[pa.array([1, 2, 3]), pa.array([4, 5, 6])],
74+
names=["a", "b"]
75+
)
76+
df = ctx.from_arrow(batch)
77+
78+
Common DataFrame Operations
79+
--------------------------
80+
81+
DataFusion's DataFrame API offers a wide range of operations:
82+
83+
.. code-block:: python
84+
85+
from datafusion import column, literal
86+
87+
# Select specific columns
88+
df = df.select("col1", "col2")
89+
90+
# Select with expressions
91+
df = df.select(column("a") + column("b"), column("a") - column("b"))
92+
93+
# Filter rows
94+
df = df.filter(column("age") > literal(25))
95+
96+
# Add computed columns
97+
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))
98+
99+
# Multiple column additions
100+
df = df.with_columns(
101+
(column("a") + column("b")).alias("sum"),
102+
(column("a") * column("b")).alias("product")
103+
)
104+
105+
# Sort data
106+
df = df.sort(column("age").sort(ascending=False))
107+
108+
# Join DataFrames
109+
df = df1.join(df2, on="user_id", how="inner")
110+
111+
# Aggregate data
112+
from datafusion import functions as f
113+
df = df.aggregate(
114+
[], # Group by columns (empty for global aggregation)
115+
[f.sum(column("amount")).alias("total_amount")]
116+
)
117+
118+
# Limit rows
119+
df = df.limit(100)
120+
121+
# Drop columns
122+
df = df.drop("temporary_column")
123+
124+
Terminal Operations
125+
------------------
126+
127+
To materialize the results of your DataFrame operations:
128+
129+
.. code-block:: python
130+
131+
# Collect all data as PyArrow RecordBatches
132+
result_batches = df.collect()
133+
134+
# Convert to various formats
135+
pandas_df = df.to_pandas() # Pandas DataFrame
136+
polars_df = df.to_polars() # Polars DataFrame
137+
arrow_table = df.to_arrow_table() # PyArrow Table
138+
py_dict = df.to_pydict() # Python dictionary
139+
py_list = df.to_pylist() # Python list of dictionaries
140+
141+
# Display results
142+
df.show() # Print tabular format to console
143+
144+
# Count rows
145+
count = df.count()
146+
147+
HTML Rendering in Jupyter
148+
------------------------
149+
150+
When working in Jupyter notebooks or other environments that support rich HTML display,
151+
DataFusion DataFrames automatically render as nicely formatted HTML tables. This functionality
152+
is provided by the ``_repr_html_`` method, which is automatically called by Jupyter.
153+
154+
Basic HTML Rendering
155+
~~~~~~~~~~~~~~~~~~~
156+
157+
In a Jupyter environment, simply displaying a DataFrame object will trigger HTML rendering:
158+
159+
.. code-block:: python
160+
161+
# Will display as HTML table in Jupyter
162+
df
163+
164+
# Explicit display also uses HTML rendering
165+
display(df)
166+
167+
HTML Rendering Customization
168+
---------------------------
169+
170+
DataFusion provides extensive customization options for HTML table rendering through the
171+
``datafusion.html_formatter`` module.
172+
173+
Configuring the HTML Formatter
174+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
175+
176+
You can customize how DataFrames are rendered by configuring the formatter:
177+
178+
.. code-block:: python
179+
180+
from datafusion.html_formatter import configure_formatter
181+
182+
configure_formatter(
183+
max_cell_length=30, # Maximum length of cell content before truncation
184+
max_width=800, # Maximum width of table in pixels
185+
max_height=400, # Maximum height of table in pixels
186+
max_memory_bytes=2 * 1024 * 1024,# Maximum memory used for rendering (2MB)
187+
min_rows_display=10, # Minimum rows to display
188+
repr_rows=20, # Number of rows to display in representation
189+
enable_cell_expansion=True, # Allow cells to be expandable on click
190+
custom_css=None, # Custom CSS to apply
191+
show_truncation_message=True, # Show message when data is truncated
192+
style_provider=None, # Custom style provider class
193+
use_shared_styles=True # Share styles across tables to reduce duplication
194+
)
195+
196+
Custom Style Providers
197+
~~~~~~~~~~~~~~~~~~~~~
198+
199+
For advanced styling needs, you can create a custom style provider class:
200+
201+
.. code-block:: python
202+
203+
from datafusion.html_formatter import configure_formatter
204+
205+
class CustomStyleProvider:
206+
def get_cell_style(self) -> str:
207+
return "background-color: #f5f5f5; color: #333; padding: 8px; border: 1px solid #ddd;"
208+
209+
def get_header_style(self) -> str:
210+
return "background-color: #4285f4; color: white; font-weight: bold; padding: 10px;"
211+
212+
# Apply custom styling
213+
configure_formatter(style_provider=CustomStyleProvider())
214+
215+
Custom Type Formatters
216+
~~~~~~~~~~~~~~~~~~~~~
217+
218+
You can register custom formatters for specific data types:
219+
220+
.. code-block:: python
221+
222+
from datafusion.html_formatter import get_formatter
223+
224+
formatter = get_formatter()
225+
226+
# Format integers with color based on value
227+
def format_int(value):
228+
return f'<span style="color: {"red" if value > 100 else "blue"}">{value}</span>'
229+
230+
formatter.register_formatter(int, format_int)
231+
232+
# Format date values
233+
def format_date(value):
234+
return f'<span class="date-value">{value.isoformat()}</span>'
235+
236+
formatter.register_formatter(datetime.date, format_date)
237+
238+
Custom Cell Builders
239+
~~~~~~~~~~~~~~~~~~~
240+
241+
For complete control over cell rendering:
242+
243+
.. code-block:: python
244+
245+
formatter = get_formatter()
246+
247+
def custom_cell_builder(value, row, col, table_id):
248+
try:
249+
num_value = float(value)
250+
if num_value > 0: # Positive values get green
251+
return f'<td style="background-color: #d9f0d3">{value}</td>'
252+
if num_value < 0: # Negative values get red
253+
return f'<td style="background-color: #f0d3d3">{value}</td>'
254+
except (ValueError, TypeError):
255+
pass
256+
257+
# Default styling for non-numeric or zero values
258+
return f'<td style="border: 1px solid #ddd">{value}</td>'
259+
260+
formatter.set_custom_cell_builder(custom_cell_builder)
261+
262+
Custom Header Builders
263+
~~~~~~~~~~~~~~~~~~~~~
264+
265+
Similarly, you can customize the rendering of table headers:
266+
267+
.. code-block:: python
268+
269+
def custom_header_builder(field):
270+
tooltip = f"Type: {field.type}"
271+
return f'<th style="background-color: #333; color: white" title="{tooltip}">{field.name}</th>'
272+
273+
formatter.set_custom_header_builder(custom_header_builder)
274+
275+
Managing Formatter State
276+
-----------------------
277+
278+
The HTML formatter maintains global state that can be managed:
279+
280+
.. code-block:: python
281+
282+
from datafusion.html_formatter import reset_formatter, reset_styles_loaded_state, get_formatter
283+
284+
# Reset the formatter to default settings
285+
reset_formatter()
286+
287+
# Reset only the styles loaded state (useful when styles were loaded but need reloading)
288+
reset_styles_loaded_state()
289+
290+
# Get the current formatter instance to make changes
291+
formatter = get_formatter()
292+
293+
Advanced Example: Dashboard-Style Formatting
294+
------------------------------------------
295+
296+
This example shows how to create a dashboard-like styling for your DataFrames:
297+
298+
.. code-block:: python
299+
300+
from datafusion.html_formatter import configure_formatter, get_formatter
301+
302+
# Define custom CSS
303+
custom_css = """
304+
.datafusion-table {
305+
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
306+
border-collapse: collapse;
307+
width: 100%;
308+
box-shadow: 0 2px 3px rgba(0,0,0,0.1);
309+
}
310+
.datafusion-table th {
311+
position: sticky;
312+
top: 0;
313+
z-index: 10;
314+
}
315+
.datafusion-table tr:hover td {
316+
background-color: #f1f7fa !important;
317+
}
318+
.datafusion-table .numeric-positive {
319+
color: #0a7c00;
320+
}
321+
.datafusion-table .numeric-negative {
322+
color: #d13438;
323+
}
324+
"""
325+
326+
class DashboardStyleProvider:
327+
def get_cell_style(self) -> str:
328+
return "padding: 8px 12px; border-bottom: 1px solid #e0e0e0;"
329+
330+
def get_header_style(self) -> str:
331+
return ("background-color: #0078d4; color: white; font-weight: 600; "
332+
"padding: 12px; text-align: left; border-bottom: 2px solid #005a9e;")
333+
334+
# Apply configuration
335+
configure_formatter(
336+
max_height=500,
337+
enable_cell_expansion=True,
338+
custom_css=custom_css,
339+
style_provider=DashboardStyleProvider(),
340+
max_cell_length=50
341+
)
342+
343+
# Add custom formatters for numbers
344+
formatter = get_formatter()
345+
346+
def format_number(value):
347+
try:
348+
num = float(value)
349+
cls = "numeric-positive" if num > 0 else "numeric-negative" if num < 0 else ""
350+
return f'<span class="{cls}">{value:,}</span>' if cls else f'{value:,}'
351+
except (ValueError, TypeError):
352+
return str(value)
353+
354+
formatter.register_formatter(int, format_number)
355+
formatter.register_formatter(float, format_number)
356+
357+
Best Practices
358+
-------------
359+
360+
1. **Memory Management**: For large datasets, use ``max_memory_bytes`` to limit memory usage.
361+
362+
2. **Responsive Design**: Set reasonable ``max_width`` and ``max_height`` values to ensure tables display well on different screens.
363+
364+
3. **Style Optimization**: Use ``use_shared_styles=True`` to avoid duplicate style definitions when displaying multiple tables.
365+
366+
4. **Reset When Needed**: Call ``reset_formatter()`` when you want to start fresh with default settings.
367+
368+
5. **Cell Expansion**: Use ``enable_cell_expansion=True`` when cells might contain longer content that users may want to see in full.
369+
370+
Additional Resources
371+
-------------------
372+
373+
* `DataFusion User Guide <../user-guide/dataframe.html>`_ - Complete guide to using DataFrames
374+
* `API Reference <https://arrow.apache.org/datafusion-python/api/index.html>`_ - Full API reference

0 commit comments

Comments
 (0)