Skip to content

Commit c6d2da1

Browse files
committed
revert branch UNPICK
1 parent ba11206 commit c6d2da1

File tree

3 files changed

+1
-176
lines changed

3 files changed

+1
-176
lines changed

README.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,10 +42,6 @@ DataFusion's Python bindings can be used as a foundation for building new data s
4242
- Serialize and deserialize query plans in Substrait format.
4343
- Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF.
4444

45-
For tips on tuning parallelism, see
46-
[Maximizing CPU Usage](docs/source/user-guide/configuration.rst#maximizing-cpu-usage)
47-
in the configuration guide.
48-
4945
## Example Usage
5046

5147
The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results

benchmarks/max_cpu_usage.py

Lines changed: 0 additions & 76 deletions
This file was deleted.

docs/source/user-guide/configuration.rst

Lines changed: 1 addition & 96 deletions
Original file line numberDiff line numberDiff line change
@@ -46,101 +46,6 @@ a :py:class:`~datafusion.context.SessionConfig` and :py:class:`~datafusion.conte
4646
ctx = SessionContext(config, runtime)
4747
print(ctx)
4848
49-
Maximizing CPU Usage
50-
--------------------
5149
52-
DataFusion uses partitions to parallelize work. For small queries the
53-
default configuration (number of CPU cores) is often sufficient, but to
54-
fully utilize available hardware you can tune how many partitions are
55-
created and when DataFusion will repartition data automatically.
56-
57-
Configure a ``SessionContext`` with a higher partition count:
58-
59-
.. code-block:: python
60-
61-
from datafusion import SessionConfig, SessionContext
62-
63-
# allow up to 16 concurrent partitions
64-
config = SessionConfig().with_target_partitions(16)
65-
ctx = SessionContext(config)
66-
67-
Automatic repartitioning for joins, aggregations, window functions and
68-
other operations can be enabled to increase parallelism:
69-
70-
.. code-block:: python
71-
72-
config = (
73-
SessionConfig()
74-
.with_target_partitions(16)
75-
.with_repartition_joins(True)
76-
.with_repartition_aggregations(True)
77-
.with_repartition_windows(True)
78-
)
79-
80-
Manual repartitioning is available on DataFrames when you need precise
81-
control:
82-
83-
.. code-block:: python
84-
85-
from datafusion import col
86-
87-
df = ctx.read_parquet("data.parquet")
88-
89-
# Evenly divide into 16 partitions
90-
df = df.repartition(16)
91-
92-
# Or partition by the hash of a column
93-
df = df.repartition_by_hash(col("a"), num=16)
94-
95-
result = df.collect()
96-
97-
98-
Benchmark Example
99-
^^^^^^^^^^^^^^^^^
100-
101-
The repository includes a benchmark script that demonstrates how to maximize CPU usage
102-
with DataFusion. The :code:`benchmarks/max_cpu_usage.py` script shows a practical example
103-
of configuring DataFusion for optimal parallelism.
104-
105-
You can run the benchmark script to see the impact of different configuration settings:
106-
107-
.. code-block:: bash
108-
109-
# Run with default settings (uses all CPU cores)
110-
python benchmarks/max_cpu_usage.py
111-
112-
# Run with specific number of rows and partitions
113-
python benchmarks/max_cpu_usage.py --rows 5000000 --partitions 16
114-
115-
# See all available options
116-
python benchmarks/max_cpu_usage.py --help
117-
118-
Here's an example showing the performance difference between single and multiple partitions:
119-
120-
.. code-block:: bash
121-
122-
# Single partition - slower processing
123-
$ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 1
124-
Processed 10000000 rows using 1 partitions in 0.107s
125-
126-
# Multiple partitions - faster processing
127-
$ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 10
128-
Processed 10000000 rows using 10 partitions in 0.038s
129-
130-
This example demonstrates nearly 3x performance improvement (0.107s vs 0.038s) when using
131-
10 partitions instead of 1, showcasing how proper partitioning can significantly improve
132-
CPU utilization and query performance.
133-
134-
The script demonstrates several key optimization techniques:
135-
136-
1. **Higher target partition count**: Uses :code:`with_target_partitions()` to set the number of concurrent partitions
137-
2. **Automatic repartitioning**: Enables repartitioning for joins, aggregations, and window functions
138-
3. **Manual repartitioning**: Uses :code:`repartition()` to ensure all partitions are utilized
139-
4. **CPU-intensive operations**: Performs aggregations that can benefit from parallelization
140-
141-
The benchmark creates synthetic data and measures the time taken to perform a sum aggregation
142-
across the specified number of partitions. This helps you understand how partition configuration
143-
affects performance on your specific hardware.
144-
145-
For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
50+
You can read more about available :py:class:`~datafusion.context.SessionConfig` options in the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
14651
and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.

0 commit comments

Comments
 (0)