-
Notifications
You must be signed in to change notification settings - Fork 30
Description
By introducing the Spark Connect server, a Spark application can run on a remote server via the Spark Connect protocol: https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/
This new feature removes the direct access to the JVM for PySpark via py4j. Almost all features of the spark-extension PySpark package rely on py4j.
The Spark Connect protocol supports plugins for Relations (DataFrames), Commands (side-effect actions without returning data) and Expressions. This can be used to gain access to JVM-side classes and instances: https://semyonsinchenko.github.io/ssinchenko/post/extending-spark-connect/
Alternatively, any logic based on Scala Dataset API can be rewritten purely in PySpark DataFrame API. However, this duplicates code. A/B testing required.
Making Scala classes available through Spark Connect plugins also requires some duplication of classes in Python and Protobuf. Additionally, such plugins require some more configuration on the Spark Connect server to work.