You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/user-guide/configuration.rst
+96-1Lines changed: 96 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,6 +46,101 @@ a :py:class:`~datafusion.context.SessionConfig` and :py:class:`~datafusion.conte
46
46
ctx = SessionContext(config, runtime)
47
47
print(ctx)
48
48
49
+
Maximizing CPU Usage
50
+
--------------------
49
51
50
-
You can read more about available :py:class:`~datafusion.context.SessionConfig` options in the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
52
+
DataFusion uses partitions to parallelize work. For small queries the
53
+
default configuration (number of CPU cores) is often sufficient, but to
54
+
fully utilize available hardware you can tune how many partitions are
55
+
created and when DataFusion will repartition data automatically.
56
+
57
+
Configure a ``SessionContext`` with a higher partition count:
58
+
59
+
.. code-block:: python
60
+
61
+
from datafusion import SessionConfig, SessionContext
Processed 10000000 rows using 10 partitions in 0.038s
129
+
130
+
This example demonstrates nearly 3x performance improvement (0.107s vs 0.038s) when using
131
+
10 partitions instead of 1, showcasing how proper partitioning can significantly improve
132
+
CPU utilization and query performance.
133
+
134
+
The script demonstrates several key optimization techniques:
135
+
136
+
1. **Higher target partition count**: Uses :code:`with_target_partitions()` to set the number of concurrent partitions
137
+
2. **Automatic repartitioning**: Enables repartitioning for joins, aggregations, and window functions
138
+
3. **Manual repartitioning**: Uses :code:`repartition()` to ensure all partitions are utilized
139
+
4. **CPU-intensive operations**: Performs aggregations that can benefit from parallelization
140
+
141
+
The benchmark creates synthetic data and measures the time taken to perform a sum aggregation
142
+
across the specified number of partitions. This helps you understand how partition configuration
143
+
affects performance on your specific hardware.
144
+
145
+
For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
51
146
and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.
0 commit comments