-
Notifications
You must be signed in to change notification settings - Fork 109
Open
Labels
help wantedContribution task, outside help would be appreciated!Contribution task, outside help would be appreciated!
Description
Description
SparkDataset isn't working with serverless cluster retrieved by databricks-connect. Is there a way to overwrite the way spark session is created in the dataset class?
Context
I'm trying to run simple kedro pipeline locally. It uses the catalog item of the type SparkDataset. In this project I create spark session via databricks-connect as:
DatabricksSession.builder.profile("profile_name").serverless(enabled=True).getOrCreate()But the get_spark function inside the dataset does:
DatabricksSession.builder.getOrCreate()And my pipeline fails with: Cluster id or serverless are required but were not specified
Steps to Reproduce
- Install Databricks-connect
- Authenticate in the workspace
- Create serverless executor in the Databricks workspace
- Add data catalog item
"dataset"to catalog, with typeSparkDataset - Define simple pipeline:
pipeline([node(lambda x: x), inputs="dataset", outputs="dataset"]) - Try to run kedro pipeline
Expected Result
The pipeline runs w/o issues
Actual Result
The pipeline runs with Cluster id or serverless are required but was not specified
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
Operating system and version: Win 10
python==3.10.10
Kedro==0.19.11
kedro-datasets==6.0.0
databricks-connect==16.1.0
databricks-sdk==0.40.0
Related to #700
Metadata
Metadata
Assignees
Labels
help wantedContribution task, outside help would be appreciated!Contribution task, outside help would be appreciated!