-
Notifications
You must be signed in to change notification settings - Fork 42
Open
Labels
bugSomething isn't workingSomething isn't working
Description
White attempting to run multi-node model training on Databricks, I ran into some issues around model checkpointing.
There are a few different places one can write files to on Databricks:
- Local VM filesystem. This is a POSIX filesystem, but is both ephemeral (goes away when VM shuts down) and not globally accessible (each VM used in training writes separately to their own local filesystem).
- Workspace files (docs). This is persistent and globally accessible (π ) but has a 500 MB file size limit, so can't really be used for saving model weights in practice.
- Unity catalog volumes (docs). This is persistent and globally accessible (π ) but has some relevant limiations:
- Doesn't support symlinks (which are used in checkpointing) because it's backed by cloud blob storage
- Direct-append or non-sequential (random) writes aren't supported (docs)
A couple different options come to mind. The first is to see if there's a way to avoid some of the Databricks-specific limitations (e.g. symlinks) while also not breaking things on other systems. The second is to have each worker VM write to their own local filesystem and then have some final step that aggregates those results and saves them to some persistent location.
I'm sure there are other options too, those are just the first things that come to mind for me
cc @HuiyingLi for visibility
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working