-
Notifications
You must be signed in to change notification settings - Fork 582
Run a custom model with Petals
To run Petals servers with your own model, you need to convert the model weights into a Petals-compatible format. This conversion splits each individual block into a separate branch. This allows each peer to download only the layers they need, instead of the entire 350GB model.
For BLOOM models, you can convert them using the following script:
# convert model from HF hub to a distributed format (can take hours depending on your connection!)
MY_WRITE_TOKEN=TODO_WRITE_TOKEN_FROM_https://huggingface.co/settings/token
python -m petals.cli.convert_model --model bigscience/bloom-6b3 \
--output_path ./converted_model --output_repo your-hf-name/test-bloomd-6b3 \
--use_auth_token $MY_WRITE_TOKEN # ^-- todo replace output repo with something you have access toIf you want to run a non-BLOOM model (e.g. OPT or YALM), you will need to edit the code a bit. Currently, Petals uses a vanilla implementation of BLOOM, so it is possible to replace it with other models from Hugging Face transformers.
Assuming your model is already is compatible with Hugging Face, you can follow these steps:
This instruction will require you to change some of the internal code of Petals. We recommend forking the petals repository and working on your fork - but you can skip this step and this guide will still work.
Petals servers need a way to load a single transformer block, preferably without downloading all model weights. Furthermore, Petals clients need to download the "local" parts of the model: word embeddings and some extra layers.
Edit src/petals/cli/convert_model.py to partition your model checkpoint into individual blocks and non-transformer layers.
Once you are done, run this script to convert your model and upload it to Hugging Face. If your model is private,
you can use your internal storage instead (see next step).
After you are done, run python -m petals.cli.convert_model ... - see the full command in the first part of this instruction (ctrl+f petals.cli.convert_model) - and upload the model to your Hugging Face account.
Open src/petals/client/remote_model.py and change some model-related classes:
You will need to change the transformers config and model, respectively.
Config: DistributedBloomConfig(BloomConfig) becomes DistributedYourModelConfig(transformers.YourModelConfig). The rest of the code for this class can stay the same.
Model: DistributedBloomModel becomes DistributedYourModel, but this time, you will need to modify some code.
You should create non-transformer layers (e.g. embeddings, input layernorm) as usual, but instead of loading transformer blocks,
create a RemoteSequential instance.
In the original DistributedBloomModel, we achieve this by creating a model with zero layers - which is a smart-ass way of creating all layers except for transformer blocks. However, you may choose to create all these layers manually instead. Either way, please make sure you use these layers in model.forward
After this part works, you may want to wrap task-specific models, e.g. DistributedBloomForCausalLM -> DistributedYourModelForCausalLM. These wrappers will keep a DistributedYourModel as a sub-module and wrap it with a custom local modules. Using DistributedYourModel in DistributedYourModelForCausaLMorSequenceClassification requires the same code as with local YourModel instances.
Test yourself: after you finish this step, you should be able to create a model with zero transformer blocks. Temporarily replace RemoteSequential with a no-op and make sure that your model can be created (e.g. in a jupyter notebook) and runs model.forward(input_ids). Of course, this "fake model" is not equivalent to running the actual model - but it is a good sanity check that the model runs without errors.
In src/petals/bloom/from_pretrained.py, edit load_pretrained_block to load a single block of your custom model.
Your block should be able to run block.forward(hidden_states=..., use_cache=true_or_false, layer_past=optional_tensors).
If your block needs some extra inputs to forward (e.g. alibi mask), wrap block module similar to this WrappedBloomBLock:
Test yourself by calling load_pretrained_block in a notebook and check that it works like in this code snippet.
First, you should simply go over all usages of the old model blocks and replace them:
- go over all usages of WrappedBloomBlock and replace them with your new block type (type of what is returned by load_pretrained_block)
- in
src/petals/server.py, switch all uses of BloomConfig with your model's config type
Make forward/backward work: go to src/petals/utils/convert_block.py. Find a line tp_config=load_bloom_config(...) - it allows you to automatically parallelize your block across local GPUs. Simply set tp_config=None. This will make tensor_parallel create a config for you automatically. You can write a custom config, but that part is optional.
Test yourself: after this step, you should be able to start a petals server and run forward and backward directly - but not generation (yet). To test this, open tests/test_remote_sequential.py and modify def test_remote_sequential to test your model.
Make inference work: in normal transformers, fast autoregressive generation works by running a huggignface model on one token at a time and using layer_past for attention to past tokens. We need to wrap this layer_past for use in a petals server. Unfortunately, different huggingface models may have different shapes for layer_past tensors.
Run output, layer_past = block.forward(some_dummy_inputs, use_cache=True) and take a look of layer_past. Your goal is to figure out how layer_past is organized. Typically, layer_past is a list / tuple that contains two tensors: keys and values. However, the shape of these tensors can be model-specific.
For instance, BLOOM layer_past has the following structure:
layer_past = (keys, values), where
keys shape: (batch_size * num_heads, head_dim, max_length)
values shape: (batch_size * num_heads, max_length, head_dim)
Note that in BLOOM, keys and values have different shapes. In contrast, GPT2, both key and value have the same shape.
Once you figured out how layer_past works for your model, go to src/petals/server/backend.py.
You will need to modify all or some of these methods:
-
get_inference_cache_descriptors- return TensorDescriptor-s in the right shape for your model. -
_reorder_cache_inplace- given a 1-dimensional tensor of indices - integers in range [0,batch_size- 1) - select layer caches for these indices. Used for beam search. -
_select_layer_past- select a prefix of cache tensors up to the specified prefix length. This is needed because past keys/values are pre-allocated for maximum length, and then sliced for the actual length. -
_update_cache_inplacemodify cache tensors after one inference step, i.e. write new token keys/values back into attention cache
Once you are done, run tests/test_full_model.py to verify that your conversion went correctly.
In future, we hope to streamline this process, making it possible to serve any language model available on Hugging Face. If you with this future to come sooner and willing to work on a pull-request, please contact us via issues and/or discord.
If you encounter any issues or want to share feedback, please join #running-a-server channel of our Discord.