Skip to content

[BUG] User cannot deploy Merlin image >=23.04 on Azure Databricks #1055

@rnyak

Description

@rnyak

Bug description

The user reported this error when they try to deploy merlin-tensorflow image >= 23.04. They are able to deploy merlin-tensorflow:23.02 image on Azure databricks. One main different is cuda versions in these images.

Spark driver could not be reached on startup. This issue can be caused by invalid Spark configurations or malfunctioning [init scripts](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Fdatabricks%2Fclusters%2Finit-scripts%23global-and-cluster-named-init-script-logs&data=05%7C01%7Cronaya%40nvidia.com%7Cfe78a893b81e491de97208db82eee73e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638247734960282987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=inGDUr3qE2Xy%2BYdYVbF6C39%2BCH4syUZkTOOgaRvk6J4%3D&reserved=0). Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.

Internal error message: Spark failed to start: Could not connect to driver instance. Possible reason: network misconfiguration.

Steps/Code to reproduce bug

Expected behavior

Environment details

  • Merlin version:
  • Platform:
  • Python version:
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?):

Additional context

An eng from Rapids team did some debugging about the spark cluster issue that this user is facing with merlin-tensorflow:23.04 image. Rapids eng spent some time converting the instructions from https://docs.databricks.com/clusters/custom-containers.html#option-2-build-your-own-docker-base into some tests that we can run with container canary:

https://github.com/NVIDIA/container-canary/blob/main/examples/databricks.yaml

Here are some quick notes on running the test:

https://gist.github.com/jacobtomlinson/73f30f5657a370e7ed2a559b0eb7123f

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions