-
Notifications
You must be signed in to change notification settings - Fork 248
Description
Background
MoE models require special logic in order to calibrate because the majority of expert activations are gated to behind the MoE gating mechanism. Attempting to naively calibrate experts with the base huggingface model definition means that some activations will not receive enough samples in order to properly calibrate.
Another reason why MoEs are difficult to calibrate is that sometimes the huggingface model definition fuses all experts together into one weight. This one weight might be too large to fit into one GPU's memory, which is a limiting factor for memory-limited use cases.
The solution is to write specialized logic to replace MoE experts modules into individual expert modules which can be individually calibrated and offloaded.
At a high level, the routing logic is converted as follows:
if calibrate_all_experts:
output += expert(x)[top_k_tokens] * weights[expert_index]
else:
output += expert(x[top_k_tokens]) * weights[expert_index]
Right now, the logic for determining when and how module replacements happen is written in modeling/prepare.py. However, this logic is split between replace_modules_for_calibration
and moe_calibration_context
. This confusing logic makes it difficult for people to contribute new MoE model replacements.
NOTE: Some models such as Llama4 can be loaded in vllm in their replaced form, and therefore do not need to be restored after
moe_calibration_context
exit.
Goals
- Make MoE model contribution as easy and standardized as possible
- Standardize on
moe_calibration_context
(remove/deprecatereplace_modules_for_calibration
Suggested task list
- Remove/deprecate
replace_modules_for_calibration
- Remove/deprecate DatasetArgs.calibrate_moe_context (it should always be on)
- Refactor
moe_calibration_context
to not require that the context stack is passed as argument - Create a standardized and simple interface that
moe_calibration_context
and the files in the modeling folder can use to easy contribute moe contexts - Add tests for changing and restoring model structure in and outside of the
moe_calibration_context
(you can skip downloading model weights for the tests, see these examples)