Checkpointing on Databricks

White attempting to run multi-node model training on Databricks, I ran into some issues around model checkpointing. 

There are a few different places one can write files to on Databricks:

- Local VM filesystem. This is a POSIX filesystem, but is both ephemeral (goes away when VM shuts down) and not globally accessible (each VM used in training writes separately to their own local filesystem). 
- Workspace files ([docs](https://docs.databricks.com/aws/en/files/workspace)). This is persistent and globally accessible (🎉 ) but has a 500 MB file size limit, so can't really be used for saving model weights in practice. 
- Unity catalog volumes ([docs](https://docs.databricks.com/aws/en/volumes/volume-files)). This is persistent and globally accessible (🎉 ) but has some relevant limiations:
    - Doesn't support symlinks (which are used in checkpointing) because it's backed by cloud blob storage
    - Direct-append or non-sequential (random) writes aren't supported ([docs](https://docs.databricks.com/aws/en/volumes/volume-files#limitations-of-working-with-files-in-volumes))

A couple different options come to mind. The first is to see if there's a way to avoid some of the Databricks-specific limitations (e.g. symlinks) while also not breaking things on other systems. The second is to have each worker VM write to their own local filesystem and then have some final step that aggregates those results and saves them to some persistent location. 

I'm sure there are other options too, those are just the first things that come to mind for me

cc @HuiyingLi for visibility 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpointing on Databricks #995

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpointing on Databricks #995

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions