You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add benchmark script and documentation for maximizing CPU usage in DataFusion Python (#1216)
* docs: add configuration tips for maximizing CPU usage and new benchmark script
* docs: enhance benchmark example for maximizing CPU usage in DataFusion
* docs: enhance benchmark script and configuration guide for maximizing CPU usage
Copy file name to clipboardExpand all lines: docs/source/user-guide/configuration.rst
+136-1Lines changed: 136 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,6 +46,141 @@ a :py:class:`~datafusion.context.SessionConfig` and :py:class:`~datafusion.conte
46
46
ctx = SessionContext(config, runtime)
47
47
print(ctx)
48
48
49
+
Maximizing CPU Usage
50
+
--------------------
49
51
50
-
You can read more about available :py:class:`~datafusion.context.SessionConfig` options in the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
52
+
DataFusion uses partitions to parallelize work. For small queries the
53
+
default configuration (number of CPU cores) is often sufficient, but to
54
+
fully utilize available hardware you can tune how many partitions are
55
+
created and when DataFusion will repartition data automatically.
56
+
57
+
Configure a ``SessionContext`` with a higher partition count:
58
+
59
+
.. code-block:: python
60
+
61
+
from datafusion import SessionConfig, SessionContext
- **CPU architecture**: Different processors have varying parallel processing capabilities
169
+
- **Available memory**: Limited RAM may require different optimization strategies
170
+
- **System load**: Other applications competing for resources affect DataFusion performance
171
+
172
+
**Recommendations for Production Use:**
173
+
174
+
To optimize DataFusion for your specific use case, it is strongly recommended to:
175
+
176
+
1. **Create custom benchmarks** using your actual data sources, formats, and query patterns
177
+
2. **Test with representative data volumes** that match your production workloads
178
+
3. **Measure end-to-end performance** including data loading, processing, and result handling
179
+
4. **Evaluate different configuration combinations** for your specific hardware and workload
180
+
5. **Monitor resource utilization** (CPU, memory, I/O) to identify bottlenecks in your environment
181
+
182
+
This approach will provide more accurate insights into how DataFusion configuration options
183
+
will impact your particular applications and infrastructure.
184
+
185
+
For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
51
186
and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.
0 commit comments