Skip to content

Conversation

@gabotechs
Copy link
Collaborator

The original DataFusion repo has the ability to run TPCH benchmarks by pre-loading the TPCH dataset in-memory. This is good because it removes disk IOs out of the picture during benchmarks.

Unfortunately, this project cannot serialize/deserialize in-memory nodes, as that would imply sending over the wire the full data loaded in memory.

This PR introduces a new node that overcomes this by storing the data in-memory in a global CACHE variable rather than in the node itself, that way, the node can be serialized and sent over the wire without baking all the in-memory data in. Upon receiving a request, the node will load the data from the global cache, which works no matter if the node was deserialized or was built natively.

An in-memory benchmark can be executed with the -m flag:

cargo run -p datafusion-distributed-benchmarks --release -- tpch -m  --path benchmarks/data/tpch_sf1

NGA-TRAN
NGA-TRAN previously approved these changes Sep 9, 2025
Copy link
Collaborator

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching and warming up are good. I have a few questions to understand the when things happen

Copy link
Collaborator

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gabotechs gabotechs merged commit 7d2488f into main Sep 11, 2025
3 of 4 checks passed
@gabotechs gabotechs deleted the gabrielmusat/in-memory-tpch branch September 11, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants