[Feature] Allow targeting multiples of sequential targets

## Background ##
LLM Compressor uses [sequential onloading](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/big_models_and_distributed/sequential_onloading/) to onload subgraphs of a model. How and which subgraphs are created is determined by the `sequential_targets` argument, either passed by the user or inferred from the model definition (typically one DecoderLayer).

The choice of sequential targets comes with tradeoffs. Larger sequential targets are more runtime efficient, but use more memory. Smaller sequential targets use less memory, but are less runtime efficeint.

Currently, a user can only specify the `sequential_targets` argument. However, there are cases where a user will want to pack multiple sequential targets into a single subgraph (for example, onload two decoder layers per subgraph, rather than just one).

The logic for creating subgraphs from targets is implemented by the [`topological_partition`](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/pipelines/sequential/helpers.py#L263-L273) function.

## Requested Changes ##
1. Design an interface for allow users to target multiple sequential targets per subgraph (for example, if your GPUs can fit 2-3 decoder layers)
2. Modify `topological_partition` to allow multiple targets to be assigned to the same subgraph
3. Test the feature. You should see higher vram usage and lower runtime

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Allow targeting multiples of sequential targets #2481

Background

Requested Changes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Allow targeting multiples of sequential targets #2481

Description

Background

Requested Changes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions