You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expands the script `mergekit-moe` to support two new output
architectures, Deepseek MoE and Qwen 2 MoE.
Both architectures include support for "shared" experts. Currently the
script supports adding a single shared expert. The Deepseek architecture
uses the shared experts ungated and unweighted, so you probably want to
set the new `residual_scale` option on the shared expert to a relatively
low value (think 0.1ish) to keep the model from being completely
overcooked. Qwen 2 MoE has a gate parameter associated with the shared
expert so this is less necessary, but still advisable.
Deepseek MoE supports either Llama or Mistral based models as inputs.
Qwen 2 MoE supports Llama, Mistral, or Qwen2 based models.
Addresses #117, #244, and #134.
Copy file name to clipboardExpand all lines: README.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,8 +10,9 @@ Features:
10
10
- Lazy loading of tensors for low memory use
11
11
- Interpolated gradients for parameter values (inspired by Gryphe's [BlockMerge_Gradient](https://github.com/Gryphe/BlockMerge_Gradient) script)
12
12
- Piecewise assembly of language models from layers ("Frankenmerging")
13
+
-[Mixture of Experts merging](#mixture-of-experts-merging)
13
14
14
-
🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see https://github.com/arcee-ai/mergekit/issues/207.
15
+
🔊 Call to Evolve - to solve evolutionary merge methods as a community - please see <https://github.com/arcee-ai/mergekit/issues/207>.
15
16
16
17
🌐 GUI Launch Alert 🤗 - We are excited to announce the launch of a graphical user interface for mergekit in Hugging Face Spaces! This GUI simplifies the merging process, making it more accessible to a broader audience. Check it out and contribute at [Hugging Face Spaces - mergekit-community](https://huggingface.co/mergekit-community).
17
18
@@ -179,13 +180,17 @@ Parameters:
179
180
180
181
Mergekit allows extracting PEFT-compatible low-rank approximations of finetuned models.
The `mergekit-moe` script supports merging multiple dense models into a mixture of experts, either for direct use or for further training. For more details see the [`mergekit-moe` documentation](docs/moe.md).
192
+
193
+
## Citation
189
194
190
195
We now have a [paper](https://arxiv.org/abs/2403.13257) you can cite for the MergeKit library:
Copy file name to clipboardExpand all lines: docs/moe.md
+82-5Lines changed: 82 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,12 @@
1
1
# mergekit-moe
2
2
3
-
`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models. `mergekit-moe` uses its own YML configuration syntax, which looks like so:
3
+
`mergekit-moe` is a script for combining Mistral or Llama models of the same size into Mixtral Mixture of Experts models. The script will combine the self-attention and layer normalization parameters from a "base" model with the MLP parameters from a set of "expert" models.
4
+
5
+
If using the `hidden` or `cheap_embed` gate mode, the output model will be usable without any further training. If you are initializing a model to do further training on, such as for sparse upcycling, then use the `random` gate mode to get a model ready for training.
6
+
7
+
## Configuration
8
+
9
+
`mergekit-moe` uses its own YML configuration syntax, which looks like so:
4
10
5
11
```yml
6
12
base_model: path/to/self_attn_donor
@@ -21,18 +27,89 @@ experts:
21
27
22
28
The script takes two arguments, an input config and an output path: `mergekit-moe ./config.yml ./my-clowncar-moe-12x180B`
23
29
24
-
## Gate Modes
30
+
Currently the script can output models that use the Mixtral, Deepseek MoE, or Qwen MoE architectures. Some output architectures support a shared expert which will be activated for all tokens, which can be configured like this:
31
+
32
+
```yml
33
+
base_model: path/to/self_attn_donor
34
+
gate_mode: hidden # one of "hidden", "cheap_embed", or "random"
35
+
dtype: bfloat16 # output dtype (float32, float16, or bfloat16)
36
+
experts:
37
+
...
38
+
shared_experts:
39
+
- source_model: model_name
40
+
positive_prompts: # required by Qwen MoE for "hidden" gate mode, otherwise not allowed
41
+
- "blah blah"
42
+
# (optional, but recommended:)
43
+
residual_scale: 0.1# downweight output from shared expert to prevent overcooking the model
44
+
```
45
+
46
+
Currently only up to one shared expert is supported.
47
+
48
+
An appropriate architecture will be inferred based on the input models and presence or absence of shared experts in your configuration. Alternatively, you can explicitly specify an output architecture by setting the `architecture:` field in your config. For example:
49
+
50
+
```yml
51
+
base_model: path/to/self_attn_donor
52
+
architecture: qwen
53
+
# ... and so on
54
+
```
55
+
56
+
### Gate Modes
25
57
26
58
There are three methods for populating the MoE gates implemented.
27
59
28
-
### "hidden"
60
+
#### "hidden"
29
61
30
62
Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use `--load-in-8bit` or `--load-in-4bit` to reduce VRAM usage.
31
63
32
-
### "cheap_embed"
64
+
#### "cheap_embed"
33
65
34
66
Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.
35
67
36
-
### "random"
68
+
#### "random"
37
69
38
70
Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won't judge.
0 commit comments