Skip to content

Commit a9ad2a9

Browse files
committed
docs: add configuration tips for maximizing CPU usage and new benchmark script
1 parent d6d6ea6 commit a9ad2a9

File tree

3 files changed

+128
-0
lines changed

3 files changed

+128
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,10 @@ DataFusion's Python bindings can be used as a foundation for building new data s
4242
- Serialize and deserialize query plans in Substrait format.
4343
- Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.
4444

45+
For tips on tuning parallelism, see
46+
[Maximizing CPU Usage](docs/source/user-guide/configuration.rst#maximizing-cpu-usage)
47+
in the configuration guide.
48+
4549
## Example Usage
4650

4751
The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results

benchmarks/max_cpu_usage.py

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
"""Benchmark script showing how to maximize CPU usage."""
18+
19+
from __future__ import annotations
20+
21+
import argparse
22+
import multiprocessing
23+
import time
24+
25+
import pyarrow as pa
26+
from datafusion import SessionConfig, SessionContext, col
27+
from datafusion import functions as f
28+
29+
30+
def main(num_rows: int, partitions: int) -> None:
31+
"""Run a simple aggregation after repartitioning."""
32+
# Create some example data
33+
array = pa.array(range(num_rows))
34+
batch = pa.record_batch([array], names=["a"])
35+
36+
# Configure the session to use a higher target partition count and
37+
# enable automatic repartitioning.
38+
config = (
39+
SessionConfig()
40+
.with_target_partitions(partitions)
41+
.with_repartition_joins(enabled=True)
42+
.with_repartition_aggregations(enabled=True)
43+
.with_repartition_windows(enabled=True)
44+
)
45+
ctx = SessionContext(config)
46+
47+
# Register the input data and repartition manually to ensure that all
48+
# partitions are used.
49+
df = ctx.create_dataframe([[batch]]).repartition(partitions)
50+
51+
start = time.time()
52+
df = df.aggregate([], [f.sum(col("a"))])
53+
df.collect()
54+
end = time.time()
55+
56+
print(
57+
f"Processed {num_rows} rows using {partitions} partitions in {end - start:.3f}s"
58+
)
59+
60+
61+
if __name__ == "__main__":
62+
parser = argparse.ArgumentParser(description=__doc__)
63+
parser.add_argument(
64+
"--rows",
65+
type=int,
66+
default=1_000_000,
67+
help="Number of rows in the generated dataset",
68+
)
69+
parser.add_argument(
70+
"--partitions",
71+
type=int,
72+
default=multiprocessing.cpu_count(),
73+
help="Target number of partitions to use",
74+
)
75+
args = parser.parse_args()
76+
main(args.rows, args.partitions)

docs/source/user-guide/configuration.rst

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,54 @@ a :py:class:`~datafusion.context.SessionConfig` and :py:class:`~datafusion.conte
4646
ctx = SessionContext(config, runtime)
4747
print(ctx)
4848
49+
Maximizing CPU Usage
50+
--------------------
51+
52+
DataFusion uses partitions to parallelize work. For small queries the
53+
default configuration (number of CPU cores) is often sufficient, but to
54+
fully utilize available hardware you can tune how many partitions are
55+
created and when DataFusion will repartition data automatically.
56+
57+
Configure a ``SessionContext`` with a higher partition count:
58+
59+
.. code-block:: python
60+
61+
from datafusion import SessionConfig, SessionContext
62+
63+
# allow up to 16 concurrent partitions
64+
config = SessionConfig().with_target_partitions(16)
65+
ctx = SessionContext(config)
66+
67+
Automatic repartitioning for joins, aggregations, window functions and
68+
other operations can be enabled to increase parallelism:
69+
70+
.. code-block:: python
71+
72+
config = (
73+
SessionConfig()
74+
.with_target_partitions(16)
75+
.with_repartition_joins(True)
76+
.with_repartition_aggregations(True)
77+
.with_repartition_windows(True)
78+
)
79+
80+
Manual repartitioning is available on DataFrames when you need precise
81+
control:
82+
83+
.. code-block:: python
84+
85+
from datafusion import col
86+
87+
df = ctx.read_parquet("data.parquet")
88+
89+
# Evenly divide into 16 partitions
90+
df = df.repartition(16)
91+
92+
# Or partition by the hash of a column
93+
df = df.repartition_by_hash(col("a"), num=16)
94+
95+
result = df.collect()
96+
4997
5098
You can read more about available :py:class:`~datafusion.context.SessionConfig` options in the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
5199
and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.

0 commit comments

Comments
 (0)