-
Notifications
You must be signed in to change notification settings - Fork 260
feat: Add arrow flight proxy to scheduler #1351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks @sebbegg, I'm a bit stuck time wise will try to have a look tomorrow if not will follow up over holiday period |
|
Perhaps @martin-g may be able to help with quick review, I'd be very thankful |
martin-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no new tests
| )) | ||
| })?; | ||
| let flight_client = FlightServiceClient::new(connection) | ||
| .max_decoding_message_size(16 * 1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can/should we use config.grpc_server_max_encoding_message_size or a new setting ?
Same for min below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly.
No idea on whether this should be a new setting 🤔
I'd guess starting with the current one might be fine and if there should be need for a dedicated setting one could revisit?
| info!("Built-in arrow flight server proxy listening on: {address:?} max_encoding_size: {max_encoding_message_size} max_decoding_size: {max_decoding_message_size}"); | ||
|
|
||
| let grpc_server_config = GrpcServerConfig::default(); | ||
| let server_future = create_grpc_server(&grpc_server_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no authentication layer.
But there is no authentication for the main service too, so this is not required at the moment.
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a good start, i just think we could remove part additional request to check for proxy endpoint
| let duration = Duration::from_millis(duration); | ||
|
|
||
| info!("Job {job_id} finished executing in {duration:?} "); | ||
| let FlightEndpointInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we should do a round trip to fetch endpoint info. Could we add optional response parameter in message SuccessfulJob ? and if it is present do a proxy request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing SuccessfulJob would require a few changes in more modules… not sure it’s the right place?
An alternative would be to add it as a second field to GetJobStatusResult?
That would limit the impact to the scheduler grpc server only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| partition_location, |
gives you back partition locations which indicates where data is, so adding optional proxy parameter you could ask it "give me data from partition_location" otherwise you just fetch it as it is at the moment (if proxy not provided)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that, my worry was about where to fill the information.
The SuccessfulJob and PartitionLocation objects are all created in execution_graph.rs it seems. It appears weird to somehow forward the proxy information all the way into the execution graph in order to be able to fill a new field like SuccessfulJob.flight_endpoint
The alternative could be to clone & update the SuccessfulJob in the grpc endpoint:
| Ok(status) => Ok(Response::new(GetJobStatusResult { status })), |
pseudo:
fn get_job_status(job_id) {
job_status = task_manager.get_job_status(job_id)
if job_status.status is SuccessfulJob:
job_status.status.flight_endpoint = self.state.config.advertise_flight_sql_endpoint
return job_status
}
|
hey @sebbegg is there anything i can do to help you with this PR? |
|
Didn’t find the time I hoped for during the holidays…I’ll try to get back to it this week.Am 06.01.2026 um 11:23 schrieb Marko Milenković ***@***.***>:milenkovicm left a comment (apache/datafusion-ballista#1351)
hey @sebbegg is there anything i can do to help you with this PR?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
# Conflicts: # ballista/core/src/execution_plans/distributed_query.rs # ballista/scheduler/src/lib.rs # ballista/scheduler/src/scheduler_process.rs # ballista/scheduler/src/scheduler_server/grpc.rs
|
@milenkovicm Feel free to have another look - made some updates:
Unfortunately this is still missing tests. |
|
thanks @sebbegg will have a look today/tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me.
Main problem I have with this approach is that scheduler may be overloaded with data transport which could affect scheduling.
But I also find this approach as valuable as ballista can open single port towards the clients. It does make sense to me that "proxy" can be on different address / port.
- if proxy is not configured, it should not listen for connections.
- If proxy is configured without specific ip/port, i'd suggest to bind it to same port as scheduler, as I believe it would be sensible default and simplify deployment.
- If proxy is configured with specific ip/port we could treat it as external process.
what do you think @sebbegg ?
also, It would be great if we could add a test or two
ballista/core/proto/ballista.proto
Outdated
|
|
||
| message GetJobStatusResult { | ||
| JobStatus status = 1; | ||
| optional string flight_endpoint = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to name this as "flight proxy" or similar?
| let GetJobStatusResult { status } = scheduler | ||
| let GetJobStatusResult { | ||
| status, | ||
| flight_endpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to support Some("") in which case client should use scheduler address and port ? This way scheduler should not relly care about its public port?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure should we use Some("") or we have proto enums to represent proxy cases
| match config.advertise_flight_sql_endpoint { | ||
| Some(_) => { | ||
| info!("Starting flight proxy"); | ||
| let flight_proxy = start_flight_proxy_server(config, scheduler.state.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to run proxy as a service on the same port with a scheduler service? It would simplify configuration.
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sebbegg,
just to clarify, we can have three configuration options:
- proxy not configured, client needs to fetc data from executors
- proxy configured, no ip address or port provided, scheduler needs to start proxy on the same port (withing process)
- proxy configured ip/port provided, scheduler considers this as extenal process running proxy, it just needs to put that value in the response, scheduler will not start proxy. client needs to use that ip/port combination to connect to process
| config.advertise_flight_sql_endpoint | ||
| ); | ||
| match config.advertise_flight_sql_endpoint.clone() { | ||
| Some(s) if s != "" => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I might be unclear. If we specify different port (or ip port) that would mean there is external process running proxy, not the scheduler process.
So we have three configuration options
- no proxy
- in process (no need to specify ip/port client should use scheduler ip port)
- external process (ip / port specified) client should use given ip/port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s what happens though?
This just puts the logic to use the scheduler host:port in the scheduler rather than the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you need to specify advertising address and configure it correctly, which may be tricky in docker containers.
Suggestion would eliminate that as client already knows scheduler address.
ballista/core/proto/ballista.proto
Outdated
|
|
||
| message GetJobStatusResult { | ||
| JobStatus status = 1; | ||
| optional string flight_proxy = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this one of to represent proxy statuses
- no proxy
- in process (no need to specify ip/port client should use scheduler ip port)
- external process (ip / port specified) client should use given ip/port
that would remove check if for empty string on client side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. like
oneof flight_proxy {
bool no_proxy = 2;
bool in_scheduler = 3;
string external_address = 4;
}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes something like that,
perhaps,
oneof flight_proxy {
bool local = 1;
string external = 4;
}
something like that
| .advertise_flight_sql_endpoint | ||
| .clone() | ||
| .map(|s| match s { | ||
| s if s.is_empty() => format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess same thing here, if configuration is empty client should dial back on scheduler address / port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what happens - just felt that the "switch" was easier to implement on scheduler side.
This way there's a bit less logic on the client side. Can move this though.
ballista/scheduler/src/config.rs
Outdated
| #[arg( | ||
| long, | ||
| help = "Route for proxying flight results via scheduler. Should be of the form 'IP:PORT" | ||
| help = "Route for proxying flight results via scheduler. Should be of the form 'IP:PORT'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment regarding empty address. if address/port not specified client need to dial back on scheduler address
| use std::sync::Arc; | ||
| use tonic::{Request, Response, Status, Streaming}; | ||
|
|
||
| /// Service implementing a proxy from scheduler to executor Apache Arrow Flight Protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if we could add more comments here. describe how it can be configured
If I get this right, the last variant would mean we don't need this block, right? How would you then start this external process? Starting a cluster could then look like:
I guess it's smart because like this all services can be run independently. As far as I can tell all the scheduler-state is in-memory right? I wonder though, whether it's worthwhile to add the possibility (and hence the complexity in the cli & protobuf) of running the flight-proxy "embedded" in the scheduler? |
yes we don't start in process proxy on a different port
we can provide new library, or users create their own based on proxy you have created
yes, we offload scheduler process from proxying data, and let it in charge of orchestration only
maybe we could relax this requirement, perhaps i should speak earlier. why do we need to check if executor is there? there is no corrective actions we can take.
I'm not sure i understand, we still have option to run it "embedded" should listen "embedded". |
That was a comment on the PR - but sure, we can drop this.
Ok, so for the scope of this PR, should we add the extra proxy as an additional executable? |
I agree |
|
I'll try to review changes tomorrow |
|
I apologise @sebbegg, I'm catching up with reviews |
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sebbegg
I think this can be merged, I just have a few minor comments and one case to be fixed
running scheduler with:
cargo run --bin ballista-scheduler -- --advertise-flight-sql-endpoint
will return error
error: a value is required for '--advertise-flight-sql-endpoint <ADVERTISE_FLIGHT_SQL_ENDPOINT>' but none was supplied
not sure how to configure local proxy to test this
| #[command(version, about, long_about = None)] | ||
| pub struct Config { | ||
| /// Route for proxying flight results via scheduler (IP:PORT format). | ||
| #[arg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[arg(
long,
num_args = 0..=1,
default_missing_value = "",
help = "Route for proxying flight results via scheduler. Use 'HOST:PORT' to let clients fetch results from the specified address. If empty a flight proxy will be started on the scheduler host and port."
)]
| max_decoding_message_size: usize, | ||
| max_encoding_message_size: usize, | ||
| ) -> Result<FlightServiceClient<tonic::transport::channel::Channel>, BallistaError> { | ||
| let addr = format!("http://{host}:{port}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not assume http here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, other usages of create_grpc_client_connection follow the same usage:
datafusion-ballista/ballista/core/src/client.rs
Lines 73 to 76 in caea929
| let addr = format!("http://{host}:{port}"); | |
| let grpc_config = GrpcClientConfig::default(); | |
| debug!("BallistaClient connecting to {addr}"); | |
| let connection = create_grpc_client_connection(addr.clone(), &grpc_config) |
| let scheduler_url = format!("http://{scheduler_host}:{scheduler_port}"); |
It's somewhat inconsistent that some parts of the code use host+port while at other places require Urls or url-like strings.
Which issue does this PR close?
Closes #1349
Rationale for this change
See #1349. Having a proxy on the scheduler makes it easier to e.g. expose this in docker-compose or kubernetes.
Making this a draft first, let me know what you think.
What changes are included in this PR?
This adds a "flight proxy" service to the scheduler that's optionally startet, when
advertise_flight_sql_endpointis set.This implements only
do_getand simply proxies the requests to the actual executor.I reasoned that having this as a separate service (instead of another method in the scheduler grpc) makes this more flexible since the code on the client side remains almost unchanged - except for the logic to pick scheduler or executor host as the endpoint.
Are there any user-facing changes?
Kind of - using
advertise_flight_sql_endpointnow actually has an effect.