Support Spark Connect server

By introducing the Spark Connect server, a Spark application can run on a remote server via the Spark Connect protocol: https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/

This new feature removes the direct access to the JVM for PySpark via `py4j`. Almost all features of the `spark-extension` PySpark package rely on `py4j`.

The Spark Connect protocol supports plugins for `Relations` (DataFrames), `Commands` (side-effect actions without returning data) and `Expressions`. This can be used to gain access to JVM-side classes and instances: https://semyonsinchenko.github.io/ssinchenko/post/extending-spark-connect/

Alternatively, any logic based on Scala Dataset API can be rewritten purely in PySpark DataFrame API. However, this duplicates code. A/B testing required.

Making Scala classes available through Spark Connect plugins also requires some duplication of classes in Python and Protobuf. Additionally, such plugins require some more configuration on the Spark Connect server to work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Spark Connect server #248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Spark Connect server #248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions