Skip to content

Run a custom model with Petals

Alexander Borzunov edited this page Jul 17, 2023 · 18 revisions

Starting with Petals 1.2.0, you don't have to convert a new model to a special Petals-compatible format and can serve it directly from a Hugging Face hub repository.

Still, Petals supports only a predefined set of model architectures defined in the petals.models package. If you'd like to support a new architecture, you need to copy the src/petals/models/bloom or src/petals/models/llama directory and update all files to work with your new model.

We recommend to do that in the following order:

  1. Edit config.py and __init__.py:
  • Make sure that the config is correctly loaded from a Hugging Face Hub repo when using AutoDistributedConfig.from_pretrained(...).
  1. Edit block.py:
  • Make sure that you can run a Petals server with your model's blocks.
  • Make sure the server returns correct results for forward and backward passes (the outputs are close the ones of a locally hosted block).
  • You have to pay attention to the dimension order in attention caches (both keys and values), since many implementations use different dimension orders (e.g., see dimension reordering code in src/petals/models/llama/block.py).
  • Run the server with --throughput eval to test inference code and check that you have no shape errors.
  1. Edit model.py:

    • Create distributed model wrappers using code from the 🤗 Transformers implementation.
    • Check that you can run a Petals client and get correct results for inference, forward, backward passes and all model types (the outputs are close to a locally hosted model).
    • Check that AutoDistributedModel.from_pretrained(...), AutoDistributedModelForCausalLM.from_pretrained(...), and similar functions correctly load the model from Hugging Face Hub.

If you encounter any issues, don't hesitate to ask in the #running-a-server channel of our Discord.

Clone this wiki locally