Skip to content

Commit 067fd2a

Browse files
zeroshadekouianmcookraulcd
authored
GH-46193: [Flight][Format] Extend Flight Location URI Semantics (#46194)
### Rationale for this change Updating the documentation in Flight.proto and Flight.rst to extend the semantics of the allowed Flight location URIs. ### What changes are included in this PR? Just documentation changes. Currently, none of the Arrow Flight implementations actually implement handling of the returned URIs beyond possibly parsing them and wrapping in a `Location` structure. It is left to the consumer to implement the logic of whether to re-use the same client or spin up a new client with the new location etc. to perform the `DoGet` request against. As such, there wasn't a need to make any code/library changes to accomodate this as part of this PR. * GitHub Issue: #46193 --------- Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Ian Cook <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]>
1 parent 53ef438 commit 067fd2a

File tree

2 files changed

+88
-2
lines changed

2 files changed

+88
-2
lines changed

docs/source/format/Flight.rst

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -333,6 +333,13 @@ schemes for the given transports:
333333
+----------------------------+--------------------------------+
334334
| (reuse connection) | arrow-flight-reuse-connection: |
335335
+----------------------------+--------------------------------+
336+
| HTTP (1) | http: or https: |
337+
+----------------------------+--------------------------------+
338+
339+
Notes:
340+
341+
* \(1) See :ref:`flight-extended-uris` for semantics when using
342+
http/https as the transport. It should be accessible via a GET request.
336343

337344
Connection Reuse
338345
----------------
@@ -360,6 +367,52 @@ string, so the obvious candidates are not compatible. The chosen
360367
representation can be parsed by both implementations, as well as Go's
361368
``net/url`` and Python's ``urllib.parse``.
362369

370+
.. _flight-extended-uris:
371+
372+
Extended Location URIs
373+
----------------------
374+
375+
In addition to alternative transports, a server may also return
376+
URIs that reference an external service or object storage location.
377+
This can be useful in cases where intermediate data is cached as
378+
Apache Parquet files on cloud storage or is otherwise accessible
379+
via an HTTP service. In these scenarios, it is more efficient to be
380+
able to provide a URI where the client may simply download the data
381+
directly, rather than requiring a Flight service to read it back into
382+
memory and serve it from a ``DoGet`` request.
383+
384+
To avoid the complexities of Flight clients having to implement support
385+
for multiple different cloud storage vendors (e.g. AWS S3, Google Cloud),
386+
we extend the URIs to only allow an HTTP/HTTPS URI where the client can
387+
perform a simple GET request to download the data. Authentication can be
388+
handled either by negotiating externally to the Flight protocol or by the
389+
server sending a presigned URL that the client can make a GET request to.
390+
This should be supported by all current major cloud storage vendors, meaning
391+
only the server needs to know the semantics of the underlying object store APIs.
392+
393+
When using an extended location URI, the client should ignore any
394+
value in the ``Ticket`` field of the ``FlightEndpoint``. The
395+
``Ticket`` is only used for identifying data in the context of a
396+
Flight service, and is not needed when the client is directly
397+
downloading data from an external service.
398+
399+
Clients should assume that, unless otherwise specified, the data is
400+
being returned using the :ref:`format-ipc` just as it would
401+
via a ``DoGet`` call. If the returned ``Content-Type`` header is a generic
402+
media type such as ``application/octet-stream``, the client should still assume
403+
it is an Arrow IPC stream. For other media types, such as Apache Parquet,
404+
the server should use the appropriate IANA Media Type that a client
405+
would recognize.
406+
407+
Finally, the server may also allow the client to choose what format the
408+
data is returned in by respecting the ``Accept`` header in the request.
409+
If multiple formats are requested and supported, the choice of which to
410+
use is server-specific. If none of the requested content-types are
411+
supported, the server may respond with either 406 (Not Acceptable),
412+
415 (Unsupported Media Type), or successfuly respond with a different
413+
format that it does support, along with the correct ``Content-Type``
414+
header.
415+
363416
Error Handling
364417
==============
365418

format/Flight.proto

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -426,8 +426,41 @@ message Ticket {
426426
}
427427

428428
/*
429-
* A location where a Flight service will accept retrieval of a particular
430-
* stream given a ticket.
429+
* A location to retrieve a particular stream from. This URI should be one of
430+
* the following:
431+
* - An empty string or the string 'arrow-flight-reuse-connection://?':
432+
* indicating that the ticket can be redeemed on the service where the
433+
* ticket was generated via a DoGet request.
434+
* - A valid grpc URI (grpc://, grpc+tls://, grpc+unix://, etc.):
435+
* indicating that the ticket can be redeemed on the service at the given
436+
* URI via a DoGet request.
437+
* - A valid HTTP URI (http://, https://, etc.):
438+
* indicating that the client should perform a GET request against the
439+
* given URI to retrieve the stream. The ticket should be empty
440+
* in this case and should be ignored by the client. Cloud object storage
441+
* can be utilized by presigned URLs or mediating the auth separately and
442+
* returning the full URL (e.g. https://amzn-s3-demo-bucket.s3.us-west-2.amazonaws.com/...).
443+
*
444+
* We allow non-Flight URIs for the purpose of allowing Flight services to indicate that
445+
* results can be downloaded in formats other than Arrow (such as Parquet) or to allow
446+
* direct fetching of results from a URI to reduce excess copying and data movement.
447+
* In these cases, the following conventions should be followed by servers and clients:
448+
*
449+
* - Unless otherwise specified by the 'Content-Type' header of the response,
450+
* a client should assume the response is using the Arrow IPC Streaming format.
451+
* Usage of an IANA media type like 'application/octet-stream' should be assumed to
452+
* be using the Arrow IPC Streaming format.
453+
* - The server may allow the client to choose a specific response format by
454+
* specifying an 'Accept' header in the request, such as 'application/vnd.apache.parquet'
455+
* or 'application/vnd.apache.arrow.stream'. If multiple types are requested and
456+
* supported by the server, the choice of which to use is server-specific. If
457+
* none of the requested content-types are supported, the server may respond with
458+
* either 406 (Not Acceptable) or 415 (Unsupported Media Type), or successfully
459+
* respond with a different format that it does support along with the correct
460+
* 'Content-Type' header.
461+
*
462+
* Note: new schemes may be proposed in the future to allow for more flexibility based
463+
* on community requests.
431464
*/
432465
message Location {
433466
string uri = 1;

0 commit comments

Comments
 (0)