feat: Add arrow flight proxy to scheduler #1351

sebbegg · 2025-12-18T16:20:22Z

Which issue does this PR close?

Rationale for this change

See #1349. Having a proxy on the scheduler makes it easier to e.g. expose this in docker-compose or kubernetes.

Making this a draft first, let me know what you think.

What changes are included in this PR?

This adds a "flight proxy" service to the scheduler that's optionally startet, when advertise_flight_sql_endpoint is set.
This implements only do_get and simply proxies the requests to the actual executor.
I reasoned that having this as a separate service (instead of another method in the scheduler grpc) makes this more flexible since the code on the client side remains almost unchanged - except for the logic to pick scheduler or executor host as the endpoint.

Are there any user-facing changes?

Kind of - using advertise_flight_sql_endpoint now actually has an effect.

milenkovicm · 2025-12-18T23:06:19Z

Thanks @sebbegg, I'm a bit stuck time wise will try to have a look tomorrow if not will follow up over holiday period

milenkovicm · 2025-12-18T23:07:43Z

Perhaps @martin-g may be able to help with quick review, I'd be very thankful

martin-g

There are no new tests

ballista/core/src/execution_plans/distributed_query.rs

ballista/scheduler/src/flight_proxy_service.rs

martin-g · 2025-12-19T07:14:37Z

ballista/scheduler/src/flight_proxy_service.rs

+            ))
+        })?;
+    let flight_client = FlightServiceClient::new(connection)
+        .max_decoding_message_size(16 * 1024 * 1024)


Can/should we use config.grpc_server_max_encoding_message_size or a new setting ?
Same for min below.

Certainly.

No idea on whether this should be a new setting 🤔
I'd guess starting with the current one might be fine and if there should be need for a dedicated setting one could revisit?

ballista/scheduler/src/flight_proxy_service.rs

ballista/scheduler/src/scheduler_process.rs

ballista/core/proto/ballista.proto

ballista/scheduler/src/flight_proxy_service.rs

martin-g · 2025-12-19T07:59:41Z

ballista/scheduler/src/scheduler_process.rs

+        info!("Built-in arrow flight server proxy listening on: {address:?} max_encoding_size: {max_encoding_message_size} max_decoding_size: {max_decoding_message_size}");
+
+        let grpc_server_config = GrpcServerConfig::default();
+        let server_future = create_grpc_server(&grpc_server_config)


There is no authentication layer.
But there is no authentication for the main service too, so this is not required at the moment.

milenkovicm

its a good start, i just think we could remove part additional request to check for proxy endpoint

milenkovicm · 2025-12-19T11:02:11Z

ballista/core/src/execution_plans/distributed_query.rs

                let duration = Duration::from_millis(duration);

                info!("Job {job_id} finished executing in {duration:?} ");
+                let FlightEndpointInfo {


I dont think we should do a round trip to fetch endpoint info. Could we add optional response parameter in message SuccessfulJob ? and if it is present do a proxy request?

Changing SuccessfulJob would require a few changes in more modules… not sure it’s the right place?
An alternative would be to add it as a second field to GetJobStatusResult?
That would limit the impact to the scheduler grpc server only

datafusion-ballista/ballista/core/src/execution_plans/distributed_query.rs

Line 356 in 118327c

partition_location,

gives you back partition locations which indicates where data is, so adding optional proxy parameter you could ask it "give me data from partition_location" otherwise you just fetch it as it is at the moment (if proxy not provided)

I get that, my worry was about where to fill the information.
The SuccessfulJob and PartitionLocation objects are all created in execution_graph.rs it seems. It appears weird to somehow forward the proxy information all the way into the execution graph in order to be able to fill a new field like SuccessfulJob.flight_endpoint

The alternative could be to clone & update the SuccessfulJob in the grpc endpoint:

datafusion-ballista/ballista/scheduler/src/scheduler_server/grpc.rs

Line 450 in 118327c

Ok(status) => Ok(Response::new(GetJobStatusResult { status })),

pseudo:

fn get_job_status(job_id) { job_status = task_manager.get_job_status(job_id) if job_status.status is SuccessfulJob: job_status.status.flight_endpoint = self.state.config.advertise_flight_sql_endpoint return job_status }

ballista/core/src/serde/generated/ballista.rs

milenkovicm · 2026-01-06T10:23:25Z

hey @sebbegg is there anything i can do to help you with this PR?

sebbegg · 2026-01-06T10:29:56Z

Didn’t find the time I hoped for during the holidays…I’ll try to get back to it this week.Am 06.01.2026 um 11:23 schrieb Marko Milenković ***@***.***>:milenkovicm left a comment (apache/datafusion-ballista#1351) hey @sebbegg is there anything i can do to help you with this PR? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

# Conflicts: # ballista/core/src/execution_plans/distributed_query.rs # ballista/scheduler/src/lib.rs # ballista/scheduler/src/scheduler_process.rs # ballista/scheduler/src/scheduler_server/grpc.rs

sebbegg · 2026-01-08T20:03:19Z

@milenkovicm Feel free to have another look - made some updates:

The GetJobStatusResult now has the flight_endpoint - so there's no extra request involved to fetch this information.
As @martin-g suggested the proxy now checks that the requested host:port belongs to an active executor
Used tokio::select! to exit in case the flight proxy panics

Unfortunately this is still missing tests.
A proper test would probably be some sort of integration test with scheduler and at least one executor.
I looked through the test-utils, but I'm not sure there's something that could be used for that...

milenkovicm · 2026-01-09T09:32:36Z

thanks @sebbegg will have a look today/tomorrow

milenkovicm

this looks good to me.

Main problem I have with this approach is that scheduler may be overloaded with data transport which could affect scheduling.

But I also find this approach as valuable as ballista can open single port towards the clients. It does make sense to me that "proxy" can be on different address / port.

if proxy is not configured, it should not listen for connections.
If proxy is configured without specific ip/port, i'd suggest to bind it to same port as scheduler, as I believe it would be sensible default and simplify deployment.
If proxy is configured with specific ip/port we could treat it as external process.

what do you think @sebbegg ?

also, It would be great if we could add a test or two

milenkovicm · 2026-01-10T17:12:18Z

ballista/core/proto/ballista.proto


 message GetJobStatusResult {
  JobStatus status = 1;
+  optional string flight_endpoint = 2;


would it make sense to name this as "flight proxy" or similar?

milenkovicm · 2026-01-10T17:14:10Z

ballista/core/src/execution_plans/distributed_query.rs

-        let GetJobStatusResult { status } = scheduler
+        let GetJobStatusResult {
+            status,
+            flight_endpoint,


would it make sense to support Some("") in which case client should use scheduler address and port ? This way scheduler should not relly care about its public port?

not sure should we use Some("") or we have proto enums to represent proxy cases

milenkovicm · 2026-01-10T17:20:43Z

ballista/scheduler/src/scheduler_process.rs

+    match config.advertise_flight_sql_endpoint {
+        Some(_) => {
+            info!("Starting flight proxy");
+            let flight_proxy = start_flight_proxy_server(config, scheduler.state.clone());


Would it make sense to run proxy as a service on the same port with a scheduler service? It would simplify configuration.

milenkovicm

thanks @sebbegg,

just to clarify, we can have three configuration options:

proxy not configured, client needs to fetc data from executors
proxy configured, no ip address or port provided, scheduler needs to start proxy on the same port (withing process)
proxy configured ip/port provided, scheduler considers this as extenal process running proxy, it just needs to put that value in the response, scheduler will not start proxy. client needs to use that ip/port combination to connect to process

milenkovicm · 2026-01-12T19:34:10Z

ballista/scheduler/src/scheduler_process.rs

+        config.advertise_flight_sql_endpoint
+    );
+    match config.advertise_flight_sql_endpoint.clone() {
+        Some(s) if s != "" => {


Sorry I might be unclear. If we specify different port (or ip port) that would mean there is external process running proxy, not the scheduler process.

So we have three configuration options

no proxy

in process (no need to specify ip/port client should use scheduler ip port)

external process (ip / port specified) client should use given ip/port

That’s what happens though?
This just puts the logic to use the scheduler host:port in the scheduler rather than the client.

Yes, but you need to specify advertising address and configure it correctly, which may be tricky in docker containers.

Suggestion would eliminate that as client already knows scheduler address.

milenkovicm · 2026-01-12T19:35:41Z

ballista/core/proto/ballista.proto


 message GetJobStatusResult {
  JobStatus status = 1;
+  optional string flight_proxy = 2;


can we make this one of to represent proxy statuses

no proxy

in process (no need to specify ip/port client should use scheduler ip port)

external process (ip / port specified) client should use given ip/port

that would remove check if for empty string on client side

E.g. like

oneof flight_proxy { bool no_proxy = 2; bool in_scheduler = 3; string external_address = 4; }

?

yes something like that,
perhaps,

oneof flight_proxy { bool local = 1; string external = 4; }

something like that

milenkovicm · 2026-01-12T19:37:21Z

ballista/scheduler/src/scheduler_server/grpc.rs

+                    .advertise_flight_sql_endpoint
+                    .clone()
+                    .map(|s| match s {
+                        s if s.is_empty() => format!(


i guess same thing here, if configuration is empty client should dial back on scheduler address / port

That's what happens - just felt that the "switch" was easier to implement on scheduler side.
This way there's a bit less logic on the client side. Can move this though.

milenkovicm · 2026-01-12T19:38:51Z

ballista/scheduler/src/config.rs

    #[arg(
        long,
-        help = "Route for proxying flight results via scheduler. Should be of the form 'IP:PORT"
+        help = "Route for proxying flight results via scheduler. Should be of the form 'IP:PORT'"


same comment regarding empty address. if address/port not specified client need to dial back on scheduler address

milenkovicm · 2026-01-12T19:40:37Z

ballista/scheduler/src/flight_proxy_service.rs

+use std::sync::Arc;
+use tonic::{Request, Response, Status, Streaming};
+
+/// Service implementing a proxy from scheduler to executor Apache Arrow Flight Protocol


it would be great if we could add more comments here. describe how it can be configured

sebbegg · 2026-01-13T19:56:04Z

thanks @sebbegg,

just to clarify, we can have three configuration options:

proxy not configured, client needs to fetc data from executors

proxy configured, no ip address or port provided, scheduler needs to start proxy on the same port (withing process)

proxy configured ip/port provided, scheduler considers this as extenal process running proxy, it just needs to put that value in the response, scheduler will not start proxy. client needs to use that ip/port combination to connect to process

If I get this right, the last variant would mean we don't need this block, right?

https://github.com/sebbegg/datafusion-ballista/blob/5022263904c37d660bc77e3f5c065206b6720d20/ballista/scheduler/src/scheduler_process.rs#L202-L212

How would you then start this external process?
I guess we could add another crate/binary at ballista/flight-proxy?

Starting a cluster could then look like:

./ballista-flight-proxy --bind-host localhost --bind-port 50040
./ballista-scheduler --advertise-flight-sql-endpoint localhost:50040
./ballista-executor --scheduler-host localhost --scheduler-port 50050

I guess it's smart because like this all services can be run independently.

As far as I can tell all the scheduler-state is in-memory right?
So in this setup we could e.g. not perform the check whether the requested data / executor-host is actually alive and belongs to the cluster.
On the other hand, it would make the proxy stateless, which is probably a good thing.

I wonder though, whether it's worthwhile to add the possibility (and hence the complexity in the cli & protobuf) of running the flight-proxy "embedded" in the scheduler?

milenkovicm · 2026-01-14T09:30:50Z

thanks @sebbegg,
just to clarify, we can have three configuration options:

proxy not configured, client needs to fetc data from executors

proxy configured, no ip address or port provided, scheduler needs to start proxy on the same port (withing process)

proxy configured ip/port provided, scheduler considers this as extenal process running proxy, it just needs to put that value in the response, scheduler will not start proxy. client needs to use that ip/port combination to connect to process

If I get this right, the last variant would mean we don't need this block, right?

https://github.com/sebbegg/datafusion-ballista/blob/5022263904c37d660bc77e3f5c065206b6720d20/ballista/scheduler/src/scheduler_process.rs#L202-L212

yes we don't start in process proxy on a different port

How would you then start this external process? I guess we could add another crate/binary at ballista/flight-proxy?

we can provide new library, or users create their own based on proxy you have created

Starting a cluster could then look like:

./ballista-flight-proxy --bind-host localhost --bind-port 50040

./ballista-scheduler --advertise-flight-sql-endpoint localhost:50040

./ballista-executor --scheduler-host localhost --scheduler-port 50050

I guess it's smart because like this all services can be run independently.

yes, we offload scheduler process from proxying data, and let it in charge of orchestration only

As far as I can tell all the scheduler-state is in-memory right? So in this setup we could e.g. not perform the check whether the requested data / executor-host is actually alive and belongs to the cluster. On the other hand, it would make the proxy stateless, which is probably a good thing.

maybe we could relax this requirement, perhaps i should speak earlier. why do we need to check if executor is there? there is no corrective actions we can take.

I wonder though, whether it's worthwhile to add the possibility (and hence the complexity in the cli & protobuf) of running the flight-proxy "embedded" in the scheduler?

I'm not sure i understand, we still have option to run it "embedded"

* `./ballista-scheduler --advertise-flight-sql-endpoint`

should listen "embedded".
please let me know if i got you wrong

sebbegg · 2026-01-14T16:44:53Z

maybe we could relax this requirement, perhaps i should speak earlier. why do we need to check if executor is there? there is no corrective actions we can take.

That was a comment on the PR - but sure, we can drop this.

we can provide new library, or users create their own based on proxy you have created

Ok, so for the scope of this PR, should we add the extra proxy as an additional executable?
A minimalistic approach could be to only implement the embedded variant and leave an external flight-proxy executable up to users.

milenkovicm · 2026-01-14T16:46:45Z

Ok, so for the scope of this PR, should we add the extra proxy as an additional executable? A minimalistic approach could be to only implement the embedded variant and leave an external flight-proxy executable up to users.

I agree

milenkovicm · 2026-01-18T22:17:41Z

I'll try to review changes tomorrow

milenkovicm · 2026-01-21T20:08:34Z

I apologise @sebbegg, I'm catching up with reviews

milenkovicm

thanks @sebbegg
I think this can be merged, I just have a few minor comments and one case to be fixed

running scheduler with:

cargo run --bin ballista-scheduler -- --advertise-flight-sql-endpoint

will return error

error: a value is required for '--advertise-flight-sql-endpoint <ADVERTISE_FLIGHT_SQL_ENDPOINT>' but none was supplied

not sure how to configure local proxy to test this

ballista/core/src/execution_plans/distributed_query.rs

ballista/scheduler/src/scheduler_process.rs

milenkovicm · 2026-01-21T21:29:09Z

ballista/scheduler/src/config.rs

 #[command(version, about, long_about = None)]
 pub struct Config {
    /// Route for proxying flight results via scheduler (IP:PORT format).
    #[arg(


#[arg( long, num_args = 0..=1, default_missing_value = "", help = "Route for proxying flight results via scheduler. Use 'HOST:PORT' to let clients fetch results from the specified address. If empty a flight proxy will be started on the scheduler host and port." )]

milenkovicm · 2026-01-21T21:29:37Z

ballista/scheduler/src/flight_proxy_service.rs

+    max_decoding_message_size: usize,
+    max_encoding_message_size: usize,
+) -> Result<FlightServiceClient<tonic::transport::channel::Channel>, BallistaError> {
+    let addr = format!("http://{host}:{port}");


we should not assume http here

Hm, other usages of create_grpc_client_connection follow the same usage:

datafusion-ballista/ballista/core/src/client.rs

Lines 73 to 76 in caea929

let addr = format!("http://{host}:{port}");

let grpc_config = GrpcClientConfig::default();

debug!("BallistaClient connecting to {addr}");

let connection = create_grpc_client_connection(addr.clone(), &grpc_config)

datafusion-ballista/ballista/executor/src/executor_process.rs

Line 201 in caea929

let scheduler_url = format!("http://{scheduler_host}:{scheduler_port}");

It's somewhat inconsistent that some parts of the code use host+port while at other places require Urls or url-like strings.

sebbegg added 4 commits December 18, 2025 12:34

Add flight proxy on scheduler

74497e3

reuse existing ticket

606d5c7

rename FlightEndpoint messages

96c3986

rename FlightEndpoint messages

c43989b

martin-g reviewed Dec 19, 2025

View reviewed changes

milenkovicm reviewed Dec 19, 2025

View reviewed changes

ballista/core/src/serde/generated/ballista.rs Outdated Show resolved Hide resolved

sebbegg added 4 commits December 22, 2025 09:47

check executor host and refactor flight endpoint

99a47ee

use cargo fmt

8a6f706

fix clippy warning

21af44d

remove some superfluous debug statements

f12c261

milenkovicm changed the title ~~Add arrow flight proxy~~ feat: Add arrow flight proxy to scheduler Jan 3, 2026

sebbegg added 5 commits January 6, 2026 16:37

Merge branch 'main' into add-arrow-flight-proxy

7eb80ea

# Conflicts: # ballista/core/src/execution_plans/distributed_query.rs # ballista/scheduler/src/lib.rs # ballista/scheduler/src/scheduler_process.rs # ballista/scheduler/src/scheduler_server/grpc.rs

cleanup merge errors

3b29896

use filter_map

2fa1b22

undo irrelevant proto formatting changes

92c8bf3

refactor host:port check to use a tuple instead of string

f6d3510

sebbegg added 2 commits January 8, 2026 21:10

Remove incomplete test

16f1fa9

add license header

c9a68ef

milenkovicm marked this pull request as ready for review January 10, 2026 17:20

milenkovicm reviewed Jan 10, 2026

View reviewed changes

Address comments - allow binding proxy on scheduler port

5022263

milenkovicm reviewed Jan 12, 2026

View reviewed changes

milenkovicm mentioned this pull request Jan 12, 2026

fix: remove advertise_flight_sql_endpoint config from scheduler #1350

Closed

sebbegg added 5 commits January 17, 2026 14:53

refactor to embedded only

df0d16e

refactor and add unit tests

4110a01

reorder imports

ae0a854

add .gitignore back in

2ab4c35

match instead of if

ce804d3

happy clippy

41f1d57

milenkovicm reviewed Jan 21, 2026

View reviewed changes

	let addr = format!("http://{host}:{port}");
	let grpc_config = GrpcClientConfig::default();
	debug!("BallistaClient connecting to {addr}");
	let connection = create_grpc_client_connection(addr.clone(), &grpc_config)

feat: Add arrow flight proxy to scheduler #1351

Are you sure you want to change the base?

feat: Add arrow flight proxy to scheduler #1351

Conversation

sebbegg commented Dec 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

milenkovicm commented Dec 18, 2025

Uh oh!

milenkovicm commented Dec 18, 2025

Uh oh!

martin-g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milenkovicm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sebbegg Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

milenkovicm commented Jan 6, 2026

Uh oh!

sebbegg commented Jan 6, 2026 via email

Uh oh!

sebbegg commented Jan 8, 2026

Uh oh!

milenkovicm commented Jan 9, 2026

Uh oh!

milenkovicm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milenkovicm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sebbegg Dec 19, 2025 •

edited

Loading

milenkovicm left a comment •

edited

Loading