Run a custom model with Petals

Starting with Petals 1.2.0, you don't have to convert a new model to a special Petals-compatible format and can serve it directly from a Hugging Face hub repository.

Still, Petals supports only a predefined set of model architectures defined in the petals.models package. If you'd like to support a new architecture, you need to copy the src/petals/models/bloom or src/petals/models/llama directory and update all files to work with your new model.

We recommend to do that in the following order:

Edit config.py and __init__.py, make sure that the config is correctly loaded from a Hugging Face Hub repo.
Edit server.py, make sure that you can run a Petals server with your model's blocks and it returns correct results for forward and backward passes (compared to a locally hosted block). You have to pay attention to the dimension order in attention caches (both keys and values), since many implementations use different dimension orders (e.g., see dimension reordering code in src/petals/models/llama/block.py). You can run the server with --throughput eval to try running inference and check that you have no shape errors.
Edit client.py, copy the code of model wrappers (e.g., from the 🤗 Transformers implementation) and check that you can run a Petals client and gives correct results for inference, forward and backward passes.

If you encounter any issues, don't hesitate to ask in the #running-a-server channel of our Discord.

This project is a part of the BigScience research workshop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Run a custom model with Petals

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally