-
Notifications
You must be signed in to change notification settings - Fork 124
Description
Bug description
The user reported this error when they try to deploy merlin-tensorflow image >= 23.04. They are able to deploy merlin-tensorflow:23.02 image on Azure databricks. One main different is cuda versions in these images.
Spark driver could not be reached on startup. This issue can be caused by invalid Spark configurations or malfunctioning [init scripts](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Fdatabricks%2Fclusters%2Finit-scripts%23global-and-cluster-named-init-script-logs&data=05%7C01%7Cronaya%40nvidia.com%7Cfe78a893b81e491de97208db82eee73e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638247734960282987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=inGDUr3qE2Xy%2BYdYVbF6C39%2BCH4syUZkTOOgaRvk6J4%3D&reserved=0). Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark failed to start: Could not connect to driver instance. Possible reason: network misconfiguration.
Steps/Code to reproduce bug
Expected behavior
Environment details
- Merlin version:
- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
Additional context
An eng from Rapids team did some debugging about the spark cluster issue that this user is facing with merlin-tensorflow:23.04 image. Rapids eng spent some time converting the instructions from https://docs.databricks.com/clusters/custom-containers.html#option-2-build-your-own-docker-base into some tests that we can run with container canary:
https://github.com/NVIDIA/container-canary/blob/main/examples/databricks.yaml
Here are some quick notes on running the test:
https://gist.github.com/jacobtomlinson/73f30f5657a370e7ed2a559b0eb7123f