Spark source for Flight enabled endpoints

This uses the new Source V2 Interface to connect to Apache Arrow Flight endpoints. It is a prototype of what is possible with Arrow Flight. The prototype has achieved 50x speed up compared to serial jdbc driver and scales with the number of Flight endpoints/spark executors being run in parallel.

It currently supports:

Columnar Batch reading
Reading in parallel many flight endpoints as Spark partitions
filter and project pushdown

It currently lacks:

support for all Spark/Arrow data types and filters
write interface to use DoPut to write Spark dataframes back to an Arrow Flight endpoint
leverage the transactional capabilities of the Spark Source V2 interface
publish benchmark test

Example

# ensure spark knows of the jar:
conf = SparkConf().set("spark.jars", "/home/hadoop/flight-spark-source-1.0-SNAPSHOT-shaded.jar")
conf.set("spark.sql.execution.arrow.enabled", "true")
...
reader = sqlContext.read.format("cdap.org.apache.arrow.flight.spark") 
df = (reader.option("port", port)
            .option("host", host)
            .option("username", username)
            .option("password", password)
            .option("parallel", parallel)
            .load(sql))

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.mvn		.mvn
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark source for Flight enabled endpoints

Example

About

Uh oh!

Releases

Packages

Languages

License

amerguy/flight-spark-source

Folders and files

Latest commit

History

Repository files navigation

Spark source for Flight enabled endpoints

Example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages