Skip to content

Support Spark Connect server #248

@EnricoMi

Description

@EnricoMi

By introducing the Spark Connect server, a Spark application can run on a remote server via the Spark Connect protocol: https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/

This new feature removes the direct access to the JVM for PySpark via py4j. Almost all features of the spark-extension PySpark package rely on py4j.

The Spark Connect protocol supports plugins for Relations (DataFrames), Commands (side-effect actions without returning data) and Expressions. This can be used to gain access to JVM-side classes and instances: https://semyonsinchenko.github.io/ssinchenko/post/extending-spark-connect/

Alternatively, any logic based on Scala Dataset API can be rewritten purely in PySpark DataFrame API. However, this duplicates code. A/B testing required.

Making Scala classes available through Spark Connect plugins also requires some duplication of classes in Python and Protobuf. Additionally, such plugins require some more configuration on the Spark Connect server to work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions