Skip to content

Commit 215f767

Browse files
authored
MoE Merge Rework (#263)
Expands the script `mergekit-moe` to support two new output architectures, Deepseek MoE and Qwen 2 MoE. Both architectures include support for "shared" experts. Currently the script supports adding a single shared expert. The Deepseek architecture uses the shared experts ungated and unweighted, so you probably want to set the new `residual_scale` option on the shared expert to a relatively low value (think 0.1ish) to keep the model from being completely overcooked. Qwen 2 MoE has a gate parameter associated with the shared expert so this is less necessary, but still advisable. Deepseek MoE supports either Llama or Mistral based models as inputs. Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models. Addresses #117, #244, and #134.
1 parent 846eb3a commit 215f767

File tree

15 files changed

+1327
-492
lines changed

15 files changed

+1327
-492
lines changed

README.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@ Features:
1010
- Lazy loading of tensors for low memory use
1111
- Interpolated gradients for parameter values (inspired by Gryphe's [BlockMerge_Gradient](https://github.com/Gryphe/BlockMerge_Gradient) script)
1212
- Piecewise assembly of language models from layers ("Frankenmerging")
13+
- [Mixture of Experts merging](#mixture-of-experts-merging)
1314

14-
🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see https://github.com/arcee-ai/mergekit/issues/207.
15+
🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see <https://github.com/arcee-ai/mergekit/issues/207>.
1516

1617
🌐 GUI Launch Alert 🤗 - We are excited to announce the launch of a graphical user interface for mergekit in Hugging Face Spaces! This GUI simplifies the merging process, making it more accessible to a broader audience. Check it out and contribute at [Hugging Face Spaces - mergekit-community](https://huggingface.co/mergekit-community).
1718

@@ -179,13 +180,17 @@ Parameters:
179180

180181
Mergekit allows extracting PEFT-compatible low-rank approximations of finetuned models.
181182

182-
### Usage:
183+
### Usage
183184

184185
```sh
185186
mergekit-extract-lora finetuned_model_id_or_path base_model_id_or_path output_path [--no-lazy-unpickle] --rank=desired_rank
186187
```
187188

188-
# Citation
189+
## Mixture of Experts merging
190+
191+
The `mergekit-moe` script supports merging multiple dense models into a mixture of experts, either for direct use or for further training. For more details see the [`mergekit-moe` documentation](docs/moe.md).
192+
193+
## Citation
189194

190195
We now have a [paper](https://arxiv.org/abs/2403.13257) you can cite for the MergeKit library:
191196

docs/moe.md

Lines changed: 82 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# mergekit-moe
22

3-
`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models. `mergekit-moe` uses its own YML configuration syntax, which looks like so:
3+
`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models.
4+
5+
If using the `hidden` or `cheap_embed` gate mode, the output model will be usable without any further training. If you are initializing a model to do further training on, such as for sparse upcycling, then use the `random` gate mode to get a model ready for training.
6+
7+
## Configuration
8+
9+
`mergekit-moe` uses its own YML configuration syntax, which looks like so:
410

511
```yml
612
base_model: path/to/self_attn_donor
@@ -21,18 +27,89 @@ experts:
2127

2228
The script takes two arguments, an input config and an output path: `mergekit-moe ./config.yml ./my-clowncar-moe-12x180B`
2329

24-
## Gate Modes
30+
Currently the script can output models that use the Mixtral, Deepseek MoE, or Qwen MoE architectures. Some output architectures support a shared expert which will be activated for all tokens, which can be configured like this:
31+
32+
```yml
33+
base_model: path/to/self_attn_donor
34+
gate_mode: hidden # one of "hidden", "cheap_embed", or "random"
35+
dtype: bfloat16 # output dtype (float32, float16, or bfloat16)
36+
experts:
37+
...
38+
shared_experts:
39+
- source_model: model_name
40+
positive_prompts: # required by Qwen MoE for "hidden" gate mode, otherwise not allowed
41+
- "blah blah"
42+
# (optional, but recommended:)
43+
residual_scale: 0.1 # downweight output from shared expert to prevent overcooking the model
44+
```
45+
46+
Currently only up to one shared expert is supported.
47+
48+
An appropriate architecture will be inferred based on the input models and presence or absence of shared experts in your configuration. Alternatively, you can explicitly specify an output architecture by setting the `architecture:` field in your config. For example:
49+
50+
```yml
51+
base_model: path/to/self_attn_donor
52+
architecture: qwen
53+
# ... and so on
54+
```
55+
56+
### Gate Modes
2557

2658
There are three methods for populating the MoE gates implemented.
2759

28-
### "hidden"
60+
#### "hidden"
2961

3062
Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use `--load-in-8bit` or `--load-in-4bit` to reduce VRAM usage.
3163

32-
### "cheap_embed"
64+
#### "cheap_embed"
3365

3466
Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.
3567

36-
### "random"
68+
#### "random"
3769

3870
Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won't judge.
71+
72+
## Example Configurations
73+
74+
Sparse upcycling of smol_llama into a 8x220M MoE:
75+
76+
```yml
77+
base_model: BEE-spoke-data/smol_llama-220M-GQA
78+
gate_mode: random
79+
dtype: bfloat16
80+
experts:
81+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
82+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
83+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
84+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
85+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
86+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
87+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
88+
- source_model: BEE-spoke-data/smol_llama-220M-GQA
89+
# and then train the sucker!
90+
```
91+
92+
Shove some Mistral models in a clown car:
93+
94+
```yml
95+
base_model: NousResearch/Hermes-2-Pro-Mistral-7B
96+
gate_mode: hidden
97+
dtype: bfloat16
98+
experts:
99+
- source_model: NousResearch/Hermes-2-Pro-Mistral-7B
100+
positive_prompts:
101+
- "<|im_start|>user\nHello, who are you?<|im_end|>"
102+
- "<|im_start|>user\nI need help with"
103+
- source_model: BioMistral/BioMistral-7B-DARE
104+
positive_prompts:
105+
- "As a doctor of medicine,"
106+
- source_model: PocketDoc/Dans-AdventurousWinds-7b
107+
positive_prompts:
108+
- "[Genres: Science Fiction]\n[Tags: humor, old school, sci fi]"
109+
- "> get ye flask"
110+
- "[Mode: Interactive Storyteller]"
111+
- source_model: VAGOsolutions/SauerkrautLM-7b-HerO
112+
positive_prompts:
113+
- "<|im_start|>user\nWie geht es dir?<|im_end|>"
114+
- "Das ist ein Satz auf Deutsch."
115+
```

mergekit/architecture.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -350,6 +350,7 @@ def _load_all_architectures() -> (
350350

351351
JSON_ARCHITECTURES, NAME_TO_ARCH = _load_all_architectures()
352352
MISTRAL_INFO = _load_json_arch("mistral.json")
353+
QWEN2_INFO = _load_json_arch("qwen2.json")
353354

354355

355356
def get_architecture_info(config: PretrainedConfig) -> ArchitectureInfo:

mergekit/common.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,10 @@ def __str__(self) -> str:
184184
return str(self.model)
185185

186186

187-
def dtype_from_name(name: Optional[str]) -> torch.dtype:
187+
def dtype_from_name(name: Optional[str]) -> Optional[torch.dtype]:
188+
if not name:
189+
return None
190+
188191
if name.startswith("torch."):
189192
name = name[len("torch.") :]
190193

mergekit/moe/__init__.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from typing import List
2+
3+
from mergekit.moe.arch import MoEOutputArchitecture
4+
from mergekit.moe.deepseek import DeepseekMoE
5+
from mergekit.moe.mixtral import MixtralMoE
6+
7+
ALL_OUTPUT_ARCHITECTURES: List[MoEOutputArchitecture] = [MixtralMoE(), DeepseekMoE()]
8+
9+
try:
10+
from mergekit.moe.qwen import QwenMoE
11+
except ImportError:
12+
pass
13+
else:
14+
ALL_OUTPUT_ARCHITECTURES.append(QwenMoE())
15+
16+
__all__ = [
17+
"ALL_OUTPUT_ARCHITECTURES",
18+
"MoEOutputArchitecture",
19+
]

mergekit/moe/arch.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Copyright (C) 2024 Charles O. Goddard
2+
#
3+
# This software is free software: you can redistribute it and/or
4+
# modify it under the terms of the GNU Lesser General Public License as
5+
# published by the Free Software Foundation, either version 3 of the
6+
# License, or (at your option) any later version.
7+
#
8+
# This software is distributed in the hope that it will be useful, but
9+
# WITHOUT ANY WARRANTY; without even the implied warranty of
10+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
11+
# Lesser General Public License for more details.
12+
#
13+
# You should have received a copy of the GNU Lesser General Public License
14+
# along with this program. If not, see http://www.gnu.org/licenses/.
15+
16+
from abc import ABC, abstractmethod
17+
from typing import List, Optional
18+
19+
import torch
20+
21+
from mergekit.moe.config import MoEMergeConfig
22+
from mergekit.options import MergeOptions
23+
24+
25+
class MoEOutputArchitecture(ABC):
26+
@abstractmethod
27+
def name(self) -> str:
28+
"""Return a human-readable name for the architecture."""
29+
pass
30+
31+
@abstractmethod
32+
def supports_config(
33+
self,
34+
config: MoEMergeConfig,
35+
explain: bool = False,
36+
trust_remote_code: bool = False,
37+
) -> bool:
38+
"""Return whether this architecture supports the given config.
39+
40+
If `explain` is True, log an explanation of why the config is not supported."""
41+
pass
42+
43+
@abstractmethod
44+
def write_model(
45+
self,
46+
out_path: str,
47+
config: MoEMergeConfig,
48+
merge_options: MergeOptions,
49+
router_weights: List[torch.Tensor],
50+
shared_router_weights: Optional[List[torch.Tensor]] = None,
51+
):
52+
"""Write the config and tensors for the output MoE to the given path."""
53+
pass

mergekit/moe/common.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Copyright (C) 2024 Charles O. Goddard
2+
#
3+
# This software is free software: you can redistribute it and/or
4+
# modify it under the terms of the GNU Lesser General Public License as
5+
# published by the Free Software Foundation, either version 3 of the
6+
# License, or (at your option) any later version.
7+
#
8+
# This software is distributed in the hope that it will be useful, but
9+
# WITHOUT ANY WARRANTY; without even the implied warranty of
10+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
11+
# Lesser General Public License for more details.
12+
#
13+
# You should have received a copy of the GNU Lesser General Public License
14+
# along with this program. If not, see http://www.gnu.org/licenses/.
15+
16+
from typing import Dict, Optional
17+
18+
import torch
19+
import tqdm
20+
import transformers
21+
22+
from mergekit.common import ModelReference, dtype_from_name
23+
from mergekit.io import LazyTensorLoader, TensorWriter
24+
from mergekit.merge import MergeOptions
25+
from mergekit.moe.config import Expert, MoEMergeConfig
26+
27+
28+
def initialize_io(
29+
config: MoEMergeConfig,
30+
out_path: str,
31+
merge_options: MergeOptions,
32+
) -> tuple[Dict[ModelReference, LazyTensorLoader], LazyTensorLoader, TensorWriter]:
33+
base_model = config.base_model
34+
loaders: Dict[ModelReference, LazyTensorLoader] = {}
35+
for model in tqdm.tqdm(
36+
[base_model] + [e.source_model for e in config.experts], desc="Warm up loaders"
37+
):
38+
loaders[model] = model.lazy_loader(
39+
cache_dir=merge_options.transformers_cache,
40+
lazy_unpickle=merge_options.lazy_unpickle,
41+
)
42+
43+
base_loader = loaders.get(base_model)
44+
writer = TensorWriter(
45+
out_path=out_path,
46+
max_shard_size=merge_options.out_shard_size,
47+
safe_serialization=merge_options.safe_serialization,
48+
)
49+
50+
return loaders, base_loader, writer
51+
52+
53+
def select_dtype(
54+
config: MoEMergeConfig, base_cfg: transformers.PretrainedConfig
55+
) -> Optional[torch.dtype]:
56+
out_dtype = None
57+
if config.dtype:
58+
out_dtype = dtype_from_name(config.dtype)
59+
60+
if out_dtype is None and base_cfg.torch_dtype:
61+
out_dtype = base_cfg.torch_dtype
62+
if isinstance(out_dtype, str):
63+
out_dtype = dtype_from_name(out_dtype)
64+
return out_dtype
65+
66+
67+
def noise_and_scale(
68+
tensor: torch.Tensor, expert: Expert, is_residual: bool = False
69+
) -> torch.Tensor:
70+
if expert.noise_scale is not None:
71+
noise = torch.randn_like(tensor) * expert.noise_scale
72+
tensor = tensor + noise
73+
if is_residual and expert.residual_scale is not None:
74+
tensor = tensor * expert.residual_scale
75+
return tensor

0 commit comments

Comments
 (0)