[Help Wanted] Refactor/ Clean up MoE calibration logic

## Background ##
MoE models require special logic in order to calibrate because the majority of expert activations are gated to behind the MoE gating mechanism. Attempting to naively calibrate experts with the base huggingface model definition means that some activations will not receive enough samples in order to properly calibrate.

Another reason why MoEs are difficult to calibrate is that sometimes the huggingface model definition fuses all experts together into one weight. This one weight might be too large to fit into one GPU's memory, which is a limiting factor for memory-limited use cases.

The solution is to write specialized logic to replace MoE experts modules into individual expert modules which can be individually calibrated and offloaded.

At a high level, the routing logic is converted as follows:

```python3
if calibrate_all_experts:
   output += expert(x)[top_k_tokens] * weights[expert_index]
else:
   output += expert(x[top_k_tokens]) * weights[expert_index]
```

Right now, the logic for determining when and how module replacements happen is written in [modeling/prepare.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/modeling/prepare.py). However, this logic is split between `replace_modules_for_calibration` and `moe_calibration_context`. This confusing logic makes it difficult for people to contribute new MoE model replacements.

> **_NOTE:_** Some models such as Llama4 can be loaded in vllm in their replaced form, and therefore do not need to be restored after `moe_calibration_context` exit.

## Goals ##
* Make MoE model contribution as easy and standardized as possible
* Standardize on `moe_calibration_context` (remove/deprecate `replace_modules_for_calibration`

## Suggested task list ##
- [ ] Remove/deprecate `replace_modules_for_calibration`
- [ ] Remove/deprecate [DatasetArgs.calibrate_moe_context](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/args/dataset_arguments.py#L120-L129) (it should always be on)
- [ ] Refactor `moe_calibration_context` to not require that the context stack is passed as argument
- [ ] Create a standardized and simple interface that `moe_calibration_context` and the files in the [modeling](https://github.com/vllm-project/llm-compressor/tree/main/src/llmcompressor/modeling) folder can use to easy contribute moe contexts
- [ ] Add tests for changing and restoring model structure in and outside of the `moe_calibration_context` (you can skip downloading model weights for the tests, see [these examples](https://github.com/vllm-project/llm-compressor/blob/main/tests/llmcompressor/modeling/test_calib_deepseek_v3.py#L21-L22))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Help Wanted] Refactor/ Clean up MoE calibration logic #1829

Background

Goals

Suggested task list

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Help Wanted] Refactor/ Clean up MoE calibration logic #1829

Description

Background

Goals

Suggested task list

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions