Skip to content

[Feature] Allow targeting multiples of sequential targets #2481

@kylesayrs

Description

@kylesayrs

Background

LLM Compressor uses sequential onloading to onload subgraphs of a model. How and which subgraphs are created is determined by the sequential_targets argument, either passed by the user or inferred from the model definition (typically one DecoderLayer).

The choice of sequential targets comes with tradeoffs. Larger sequential targets are more runtime efficient, but use more memory. Smaller sequential targets use less memory, but are less runtime efficeint.

Currently, a user can only specify the sequential_targets argument. However, there are cases where a user will want to pack multiple sequential targets into a single subgraph (for example, onload two decoder layers per subgraph, rather than just one).

The logic for creating subgraphs from targets is implemented by the topological_partition function.

Requested Changes

  1. Design an interface for allow users to target multiple sequential targets per subgraph (for example, if your GPUs can fit 2-3 decoder layers)
  2. Modify topological_partition to allow multiple targets to be assigned to the same subgraph
  3. Test the feature. You should see higher vram usage and lower runtime

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgood first issueA good first issue for users wanting to contributetracingIssues related to model tracing

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions