Skip to content

SparkDataset failing with databricks-connect serveless cluster #1038

@star-yar

Description

@star-yar

Description

SparkDataset isn't working with serverless cluster retrieved by databricks-connect. Is there a way to overwrite the way spark session is created in the dataset class?

Context

I'm trying to run simple kedro pipeline locally. It uses the catalog item of the type SparkDataset. In this project I create spark session via databricks-connect as:

DatabricksSession.builder.profile("profile_name").serverless(enabled=True).getOrCreate()

But the get_spark function inside the dataset does:

DatabricksSession.builder.getOrCreate()

And my pipeline fails with: Cluster id or serverless are required but were not specified

Steps to Reproduce

  1. Install Databricks-connect
  2. Authenticate in the workspace
  3. Create serverless executor in the Databricks workspace
  4. Add data catalog item "dataset" to catalog, with type SparkDataset
  5. Define simple pipeline:pipeline([node(lambda x: x), inputs="dataset", outputs="dataset"])
  6. Try to run kedro pipeline

Expected Result

The pipeline runs w/o issues

Actual Result

The pipeline runs with Cluster id or serverless are required but was not specified

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Operating system and version: Win 10

python==3.10.10
Kedro==0.19.11
kedro-datasets==6.0.0
databricks-connect==16.1.0
databricks-sdk==0.40.0

Related to #700

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedContribution task, outside help would be appreciated!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions