-
Notifications
You must be signed in to change notification settings - Fork 453
Description
Background
LLM Compressor uses sequential onloading to onload subgraphs of a model. How and which subgraphs are created is determined by the sequential_targets argument, either passed by the user or inferred from the model definition (typically one DecoderLayer).
The choice of sequential targets comes with tradeoffs. Larger sequential targets are more runtime efficient, but use more memory. Smaller sequential targets use less memory, but are less runtime efficeint.
Currently, a user can only specify the sequential_targets argument. However, there are cases where a user will want to pack multiple sequential targets into a single subgraph (for example, onload two decoder layers per subgraph, rather than just one).
The logic for creating subgraphs from targets is implemented by the topological_partition function.
Requested Changes
- Design an interface for allow users to target multiple sequential targets per subgraph (for example, if your GPUs can fit 2-3 decoder layers)
- Modify
topological_partitionto allow multiple targets to be assigned to the same subgraph - Test the feature. You should see higher vram usage and lower runtime