-
Notifications
You must be signed in to change notification settings - Fork 582
Run a custom model with Petals
To run Petals servers with your own model, you need to convert the model weights into a Petals-compatible format. This conversion splits each individual block into a separate branch. This allows each peer to download only the layers they need, instead of the entire 350GB model.
For BLOOM models, you can convert them using the following script:
# convert model from HF hub to a distributed format (can take hours depending on your connection!)
MY_WRITE_TOKEN=TODO_WRITE_TOKEN_FROM_https://huggingface.co/settings/token
python -m cli.convert_model --model bigscience/bloom-6b3 \
--output_path ./converted_model --output_repo bigscience/test-bloomd-6b3 \
--use_auth_token $MY_WRITE_TOKEN # ^-- todo replace output repo with something you have access toIf you want to run a non-BLOOM model (e.g. OPT or YALM),
you will need to edit the code a bit.
Currently, Petals uses a vanilla implementation of BLOOM in src/bloom, so it is possible to replace it with other models from Hugging Face transformers.
Assuming your model is already is compatible with Hugging Face, you will need 3 extra steps:
- Edit
cli/convert_model.pyto partition your model checkpoint into individual blocks and non-transformer layers. Once you are done, run this script to convert your model and upload it to Hugging Face. If your model is private, you can use your internal storage instead (see next step). - In
src/bloom/from_pretrained.py, editload_pretrained_blockto load a single block of your custom model. Your block should be able to run.forward(hidden_states=..., use_cache=true_or_false, layer_past=optional_tensors). After this step, you should be able to launch a server with the new model name. - Open
src/client/remote_model.pyand changeDistributedBloomModelto load the model of your choice. Create non-transformer layers (e.g. embeddings and logits) as usual. Instead of loading transformer blocks, create a RemoteSequential instance.
Once you are done, run tests/test_full_model.py to verify that your conversion went correctly.
In future, we hope to streamline this process, making it possible to serve any language model available on Hugging Face.
If you with this future to come sooner and willing to work on a pull-request, please contact us via issues.
If you encounter any issues or want to share feedback, please join #running-a-server channel of our Discord.