Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Oct 26, 2025

Closes #206

Open to opinions but I think there's little point in limiting message sizes in this application? Should we just go with usize::MAX as the recommendation?

I don't think we can set a default value like usize::MAX as the library because if you don't also configure your gRPC client you'd run into serious issues that occur seemingly at random (i.e. won't show up until you put it into prod possibly).

Comment on lines 140 to 146
encoder_builder = encoder_builder.with_max_flight_data_size(max_message_size);
}

let stream =
encoder_builder
.with_max_flight_data_size(usize::MAX)
.build(stream.map_err(|err| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think there's something wrong here. See how on line 140 you set the user-provided max_message_size just to immediately get unconditionally replaced with usize::MAX on line 145.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about leaving this as:

        let stream = FlightDataEncoderBuilder::new()
            .with_schema(stream.schema().clone())
            .with_max_flight_data_size(self.max_message_size)
            .build(stream.map_err(|err| {
                FlightError::Tonic(Box::new(datafusion_error_to_tonic_status(&err)))
            }));

And make self.max_message_size instead of being Option<usize> to let it be some reasonable non-optional usize default? maybe 128 * 1024 * 1024? (2 * 1024 * 1024 would match the current FlightDataEncoderBuilder defaults, but I imagine we probably need something bigger)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we choose to have a higher default, it would probably also make sense to add a

impl ArrowFlightEndpoint {
     fn into_flight_server(self) ->FlightServiceServer
}

Where we drive the message sizes ourselves.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, it feels to me like it pushes datafusion-distributed to be more of a "package" than a "library". It also is a footgun as long as you can create your own incorrectly configured FlightServiceServer. So I'm split. If you want to go that way happy to do so, but let me fix up the current thing first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do some quick benchmarks. If I see that it's pretty obvious that saner defaults improve performance let's go for it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==== Comparison with the previous benchmark from 2025-10-27 15:32:35 UTC ====
os:        macos
arch:      aarch64
cpu cores: 16
threads:   2 -> 2
workers:   8 -> 8
=============================================================================
 Query 1: prev=2672 ms, new=2658 ms, diff=1.01 faster ✔
 Query 2: prev= 359 ms, new= 383 ms, diff=1.07 slower ✖
 Query 3: prev=1067 ms, new=1088 ms, diff=1.02 slower ✖
 Query 4: prev=1126 ms, new=1010 ms, diff=1.11 faster ✔
 Query 5: prev=1587 ms, new=1553 ms, diff=1.02 faster ✔
 Query 6: prev=1127 ms, new=1200 ms, diff=1.06 slower ✖
 Query 7: prev=2717 ms, new=2532 ms, diff=1.07 faster ✔
 Query 8: prev=2486 ms, new=2495 ms, diff=1.00 slower ✖
 Query 9: prev=3106 ms, new=2963 ms, diff=1.05 faster ✔
Query 10: prev=1589 ms, new=1513 ms, diff=1.05 faster ✔
Query 11: prev= 585 ms, new= 491 ms, diff=1.19 faster ✔
Query 12: prev=1674 ms, new=1641 ms, diff=1.02 faster ✔
Query 13: prev=2273 ms, new=1927 ms, diff=1.18 faster ✔
Query 14: prev=1528 ms, new=1154 ms, diff=1.32 faster ✅
Query 15: prev=1323 ms, new=1176 ms, diff=1.12 faster ✔
Query 16: prev= 258 ms, new= 256 ms, diff=1.01 faster ✔
Query 17: prev=4174 ms, new=3533 ms, diff=1.18 faster ✔
Query 18: prev=4411 ms, new=4304 ms, diff=1.02 faster ✔
Query 19: prev=2539 ms, new=2760 ms, diff=1.09 slower ✖
Query 20: prev=1558 ms, new=2006 ms, diff=1.29 slower ❌
Query 21: prev=4545 ms, new=4137 ms, diff=1.10 faster ✔
Query 22: prev= 385 ms, new= 457 ms, diff=1.19 slower ✖

It has always been a bit flaky to run these benchmarks in our own laptops, so there might be a fair amount of noise here. However, I do see far more green checks than red ones, so I'm inclined to go for higher defaults.

.add_service(FlightServiceServer::new(endpoint))
.add_service(
FlightServiceServer::new(endpoint)
.max_decoding_message_size(MAX_MESSAGE_SIZE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I see that this value is 4Mb by default, which seems fine for your normal day-to-day CRUD API, but a fairly low default for a distributed query engine.

I'd advocate for taking this PR one step further and actually provide more reasonable defaults apart from letting people configure them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think if we have a default it should be very high if not unlimited. I'll let you make a decision based on #207 (comment) and then happy to implement either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say let's go for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know what you think of b653b64

Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 thanks for all the extensive docs and comments!

@gabotechs gabotechs merged commit 6f69516 into datafusion-contrib:main Oct 27, 2025
4 checks passed
@adriangb adriangb deleted the message-sizes branch October 27, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider increasing IPC message size

2 participants