-
Notifications
You must be signed in to change notification settings - Fork 582
Run a custom model with Petals
Alexander Borzunov edited this page Jul 17, 2023
·
18 revisions
Starting with Petals 1.2.0, you don't have to convert a new model to a special Petals-compatible format and can serve it directly from a Hugging Face hub repository.
Still, Petals supports only a predefined set of model architectures defined in the petals.models package. If you'd like to support a new architecture, you need to copy the src/petals/models/bloom or src/petals/models/llama directory and update all files to work with your new model.
We recommend to do that in the following order:
- Edit
config.pyand__init__.py:
- Make sure that the config is correctly loaded from a Hugging Face Hub repo when using
AutoDistributedConfig.from_pretrained(...).
- Edit
block.py:
- Make sure that you can run a Petals server with your model's blocks.
- Make sure the server returns correct results for forward and backward passes (the outputs are close the ones of a locally hosted block).
- You have to pay attention to the dimension order in attention caches (both keys and values), since many implementations use different dimension orders (e.g., see dimension reordering code in
src/petals/models/llama/block.py). - Run the server with
--throughput evalto test inference code and check that you have no shape errors.
-
Edit
model.py:- Create distributed model wrappers using code from the 🤗 Transformers implementation.
- Check that you can run a Petals client and get correct results for inference, forward, backward passes and all model types (the outputs are close to a locally hosted model).
- Check that
AutoDistributedModel.from_pretrained(...),AutoDistributedModelForCausalLM.from_pretrained(...), and similar functions correctly load the model from Hugging Face Hub.
If you encounter any issues, don't hesitate to ask in the #running-a-server channel of our Discord.