Skip to content

Checkpointing on DatabricksΒ #995

@jrbourbeau

Description

@jrbourbeau

White attempting to run multi-node model training on Databricks, I ran into some issues around model checkpointing.

There are a few different places one can write files to on Databricks:

  • Local VM filesystem. This is a POSIX filesystem, but is both ephemeral (goes away when VM shuts down) and not globally accessible (each VM used in training writes separately to their own local filesystem).
  • Workspace files (docs). This is persistent and globally accessible (πŸŽ‰ ) but has a 500 MB file size limit, so can't really be used for saving model weights in practice.
  • Unity catalog volumes (docs). This is persistent and globally accessible (πŸŽ‰ ) but has some relevant limiations:
    • Doesn't support symlinks (which are used in checkpointing) because it's backed by cloud blob storage
    • Direct-append or non-sequential (random) writes aren't supported (docs)

A couple different options come to mind. The first is to see if there's a way to avoid some of the Databricks-specific limitations (e.g. symlinks) while also not breaking things on other systems. The second is to have each worker VM write to their own local filesystem and then have some final step that aggregates those results and saves them to some persistent location.

I'm sure there are other options too, those are just the first things that come to mind for me

cc @HuiyingLi for visibility

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions