Skip to content

Commit 5a2bfe2

Browse files
committed
feat: add model architecture configuration documentation
This commit introduces a new document detailing the model architecture configuration, including support for transformer architectures and their components. It outlines terminology, properties, and provides an example configuration for better understanding and implementation. Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
1 parent d962898 commit 5a2bfe2

File tree

2 files changed

+363
-0
lines changed

2 files changed

+363
-0
lines changed

docs/architecture.md

Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
# Model Architecture Configuration
2+
3+
Each model artifact has an associated optional architecture configuration that describes the detailed structure and components of the model. Currently, only decoder-type transformer architectures are supported. Future extensions will include:
4+
5+
- Multi-modal language models
6+
- State Space Models
7+
- Diffusion Models
8+
9+
## Terminology
10+
11+
The transformer is the most popular architecture for LLMs. It consists of a stack of structured layers, where each layer contains a self-attention block and a feed-forward network, with normalization layers and residual connections. The complete architecture includes a tokenizer, input embedding layer, position embedding layer, transformer layers, and output embedding layer. The transformer architecture has remained relatively stable since [Attention is all you need][attention-paper]. As shown in the table below, current open-weight model architectures are converging, making it feasible to define a common abstraction.
12+
13+
| Model | Tokenizer | PE | Self-Attention | Norm | Feed-Forward | Residual |
14+
|------------------------------|-----------|------------|----------------|------------|--------------|----------|
15+
| [GPT2][gpt2-repo] | BPE | Sinusoidal | MHA | Layer Norm | MLP | Yes |
16+
| [Llama3][llama3-paper] | BPE | RoPE | GQA | RMS Norm | MLP | Yes |
17+
| [Qwen2][qwen2-paper] | BPE | RoPE | GQA | RMS Norm | MoE | Yes |
18+
| [Gemma2][gemma2-paper] | BPE | RoPE | GQA | RMS Norm | MLP | Yes |
19+
| [Mixtral][mixtral-paper] | BPE | RoPE | SWA | RMS Norm | MoE | Yes |
20+
| [DeepseekV2][deepseek-paper] | BPE | RoPE | MLA | RMS Norm | MoE | Yes |
21+
22+
23+
## Properties
24+
25+
- **transformer** _object_, REQUIRED
26+
27+
Contains the transformer configuration parameters.
28+
29+
- **architecture_version** _string_, REQUIRED
30+
31+
The version of the transformer architecture configuration using semantic versioning. An independent version is required for future extensibility.
32+
33+
- **type** _string_, REQUIRED
34+
35+
The type of transformer architecture. Currently supported: `decoder`. The default is `decoder`.
36+
37+
- **vocabulary_size** _uint64_, REQUIRED
38+
39+
Vocabulary size of the model.
40+
41+
- **hidden_size** _uint64_, REQUIRED
42+
43+
The hidden size of the model.
44+
45+
- **tokenizer** _object_, REQUIRED
46+
47+
Contains the tokenizer configuration parameters.
48+
49+
- **type** _string_, REQUIRED
50+
51+
Tokenizer type. Currently supported: `bpe`. The default is `bpe`.
52+
53+
- **library** _string_, REQUIRED
54+
55+
The name or URL of the tokenizer library. Currently supported: `huggingface`. The default is `huggingface`.
56+
57+
- **revision** _string_, OPTIONAL
58+
59+
Revision of the tokenizer library. Can be a branch name, tag name, commit ID, or `main` (latest version). The default is `main`.
60+
61+
- **token_embedding** _object_, REQUIRED
62+
63+
Contains the token embedding configuration parameters.
64+
65+
- **has_bias** _boolean_, REQUIRED
66+
67+
Whether the embedding has a bias. The default is `false`.
68+
69+
- **has_norm** _boolean_, REQUIRED
70+
71+
Whether the embedding has a normalization. The default is `true`. The normalization configuration is defined in the normalization property.
72+
73+
- **shared_embedding** _boolean_, REQUIRED
74+
75+
Whether the embedding is shared with the model prediction head. The default is `false`.
76+
77+
- **position_embedding** _object_, REQUIRED
78+
79+
Contains the position embedding configuration parameters.
80+
81+
- **type** _string_, REQUIRED
82+
83+
Position embedding type. Currently supported: `rope`. The default is `rope`. For more details, see [RoPE][rope-paper] and its [PyTorch implementation][rope-pytorch].
84+
85+
- **max_position_embeddings** _uint64_, REQUIRED
86+
87+
The maximum number of position embeddings. The default is `1024`.
88+
89+
- **rope_theta** _float_, REQUIRED
90+
91+
The theta parameter in the RoPE position embedding. The default is `10000`.
92+
93+
- **rope_scaling** _object_, OPTIONAL
94+
95+
The scaling configuration for the RoPE embeddings. The default is `null`.
96+
97+
- **transformer_layer** _object_, REQUIRED
98+
99+
Contains the transformer layer configuration parameters. Must specify either uniform_layers or mixed_layers.
100+
101+
- **uniform_layers** _object_, OPTIONAL
102+
103+
Configuration for uniform layers where all layers have identical structure.
104+
105+
- **num_layers** _uint64_, REQUIRED
106+
107+
Number of transformer layers. The default is `0`.
108+
109+
- **attention** _object_, REQUIRED
110+
111+
Contains the attention configuration parameters.
112+
113+
- **type** _string_, REQUIRED
114+
115+
Attention mechanism type. Currently supported: [MHA][mha-paper], [GQA][gqa-paper], [MLA][mla-paper]. The default is `mha`.
116+
117+
- **is_causal** _boolean_, REQUIRED
118+
119+
Whether the attention is causal. The default is `true`.
120+
121+
- **is_qkv_merged** _boolean_, REQUIRED
122+
123+
Whether the QKV projection is merged. The default is `false`.
124+
125+
- **num_attention_heads** _uint64_, REQUIRED
126+
127+
Number of attention heads. The default is `0`.
128+
129+
- **num_key_value_heads** _uint64_, REQUIRED
130+
131+
Number of key-value heads. The default is `0`.
132+
133+
- **head_dim** _uint64_, REQUIRED
134+
135+
The attention head dimension. If 0, defaults to hidden_size / num_attention_heads. The default is `0`.
136+
137+
- **has_residual** _boolean_, REQUIRED
138+
139+
Whether the attention has a residual connection. The default is `true`.
140+
141+
- **has_qkv_bias** _boolean_, REQUIRED
142+
143+
Whether the QKV projection has a bias. The default is `false`.
144+
145+
- **has_output_bias** _boolean_, REQUIRED
146+
147+
Whether the output projection has a bias. The default is `false`.
148+
149+
- **has_pre_norm** _boolean_, REQUIRED
150+
151+
Whether the attention has a pre-normalization. The default is `false`.
152+
153+
- **has_post_norm** _boolean_, REQUIRED
154+
155+
Whether the attention has a post-normalization. The default is `false`.
156+
157+
- **mlp** _object_, OPTIONAL
158+
159+
MLP configuration parameters. Either mlp or moe must be specified.
160+
161+
- **intermediate_size** _uint64_, REQUIRED
162+
163+
The size of the intermediate layer. The default is `0`.
164+
165+
- **activation** _string_, REQUIRED
166+
167+
The activation function. The default is `gelu`.
168+
169+
- **use_gated_activation** _boolean_, REQUIRED
170+
171+
Whether to use gated activation. The default is `true`.
172+
173+
- **has_residual** _boolean_, REQUIRED
174+
175+
Whether the MLP has a residual connection. The default is `true`.
176+
177+
- **has_bias** _boolean_, REQUIRED
178+
179+
Whether the MLP has a bias. The default is `false`.
180+
181+
- **has_pre_norm** _boolean_, REQUIRED
182+
183+
Whether the MLP has a pre-normalization. The default is `false`.
184+
185+
- **has_post_norm** _boolean_, REQUIRED
186+
187+
Whether the MLP has a post-normalization. The default is `false`.
188+
189+
- **is_mlp_merged** _boolean_, REQUIRED
190+
191+
Whether the MLP projection is merged. The default is `false`.
192+
193+
- **moe** _object_, OPTIONAL
194+
195+
MoE configuration parameters. Either mlp or moe must be specified.
196+
197+
- **has_bias** _boolean_, REQUIRED
198+
199+
Whether the MOE has a bias. The default is `false`.
200+
201+
- **activation** _string_, REQUIRED
202+
203+
The activation function. The default is `gelu`.
204+
205+
- **use_gated_activation** _boolean_, REQUIRED
206+
207+
Whether to use gated activation. The default is `true`.
208+
209+
- **num_experts** _uint64_, REQUIRED
210+
211+
Number of experts. The default is `0`.
212+
213+
- **moe_intermediate_size** _uint64_, REQUIRED
214+
215+
The size of the intermediate layer of the routed expert. The default is `0`.
216+
217+
- **num_shared_experts** _uint64_, REQUIRED
218+
219+
Number of shared experts. The default is `0`.
220+
221+
- **shared_expert_intermediate_size** _uint64_, REQUIRED
222+
223+
The size of the intermediate layer of the shared expert. The default is `0`.
224+
225+
- **top_k** _uint64_, REQUIRED
226+
227+
Top k experts to be used. The default is `0`.
228+
229+
- **scoring_function** _string_, REQUIRED
230+
231+
Method of computing expert weights. The default is `softmax`.
232+
233+
- **norm_topk_prob** _boolean_, REQUIRED
234+
235+
Whether to normalize the top k probabilities. The default is `false`.
236+
237+
- **mixed_layers** _object_, OPTIONAL
238+
239+
Configuration for mixed layers where layers have different structures.
240+
241+
- **num_layers** _uint64_, REQUIRED
242+
243+
Number of transformer layers. The default is `0`.
244+
245+
- **mlp_layers** _array_, REQUIRED
246+
247+
Layers that use MLP. If empty, moe_frequency determines sparsity. The default is `[]`.
248+
249+
- **pre_norm_layers** _array_, OPTIONAL
250+
251+
Layers that use pre-normalization. The default is `[]`.
252+
253+
- **post_norm_layers** _array_, OPTIONAL
254+
255+
Layers that use post-normalization. The default is `[]`.
256+
257+
- **moe_frequency** _uint64_, REQUIRED
258+
259+
Frequency of the MoE layer. The default is `0`.
260+
261+
- **attention** _object_, REQUIRED
262+
263+
Attention parameters (same structure as in uniform_layers).
264+
265+
- **mlp** _object_, OPTIONAL
266+
267+
MLP parameters (same structure as in uniform_layers).
268+
269+
- **moe** _object_, OPTIONAL
270+
271+
MoE parameters (same structure as in uniform_layers).
272+
273+
- **normalization** _object_, REQUIRED
274+
275+
Contains the normalization configuration parameters.
276+
277+
- **type** _string_, REQUIRED
278+
279+
Normalization type. Supported: [`RMSNorm`][rmsnorm-paper], [`LayerNorm`][layernorm-paper]. The default is `rmsnorm`.
280+
281+
- **epsilon** _float_, REQUIRED
282+
283+
Epsilon for the normalization. The default is `1e-5`.
284+
285+
## Example
286+
287+
Here is an example transformer architecture configuration:
288+
289+
```json,title=Transformer%20Architecture%20Configuration&mediatype=application/vnd.cncf.model.architecture.v1%2Bjson
290+
{
291+
"transformer": {
292+
"vocabulary_size": 32000,
293+
"hidden_size": 4096,
294+
"tokenizer": {
295+
"type": "bpe",
296+
"library": "huggingface",
297+
"revision": "main"
298+
},
299+
"token_embedding": {
300+
"has_bias": false,
301+
"has_norm": true,
302+
"shared_embedding": false
303+
},
304+
"position_embedding": {
305+
"type": "rope",
306+
"max_position_embeddings": 2048,
307+
"rope_theta": 10000.0,
308+
"rope_scaling": null
309+
},
310+
"transformer_layer": {
311+
"uniform_layers": {
312+
"num_layers": 32,
313+
"attention": {
314+
"type": "gqa",
315+
"is_causal": true,
316+
"is_qkv_merged": false,
317+
"num_attention_heads": 32,
318+
"num_key_value_heads": 8,
319+
"head_dim": 128,
320+
"has_residual": true,
321+
"has_qkv_bias": false,
322+
"has_output_bias": false,
323+
"has_pre_norm": true,
324+
"has_post_norm": false
325+
},
326+
"mlp": {
327+
"intermediate_size": 11008,
328+
"activation": "silu",
329+
"use_gated_activation": true,
330+
"has_residual": true,
331+
"has_bias": false,
332+
"has_pre_norm": false,
333+
"has_post_norm": true,
334+
"is_mlp_merged": false
335+
}
336+
}
337+
},
338+
"normalization": {
339+
"type": "rmsnorm",
340+
"epsilon": 1e-5
341+
}
342+
}
343+
}
344+
```
345+
346+
[attention-paper]: https://arxiv.org/abs/1706.03762
347+
[gpt2-repo]: https://github.com/openai/gpt-2
348+
[llama3-paper]: https://arxiv.org/abs/2407.21783
349+
[qwen2-paper]: https://arxiv.org/abs/2407.10671
350+
[gemma2-paper]: https://arxiv.org/abs/2408.00118
351+
[mixtral-paper]: https://arxiv.org/abs/2401.04088
352+
[deepseek-paper]: https://arxiv.org/abs/2405.04434
353+
[rope-paper]: https://arxiv.org/abs/2104.09864
354+
[rope-pytorch]: https://pytorch.org/torchtune/stable/generated/torchtune.modules.RotaryPositionalEmbeddings.html
355+
[mha-paper]: https://arxiv.org/abs/1706.03762
356+
[gqa-paper]: https://arxiv.org/abs/2305.13245v3
357+
[mla-paper]: https://arxiv.org/abs/2412.19437
358+
[rmsnorm-paper]: https://arxiv.org/abs/1910.07467
359+
[layernorm-paper]: https://arxiv.org/abs/1607.06450

docs/config.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,10 @@ The following terms are used in this section:
7575

7676
The architecture of the model, such as "transformer", "cnn", or "rnn".
7777

78+
- **architecture_config** _object_, OPTIONAL
79+
80+
The configuration of the architecture. the details are defined in the [Architecture](./architectures.md) file.
81+
7882
- **format** _string_, OPTIONAL
7983

8084
The format for the model, such as "onnx", "safetensors", "gguf", or "pt"(pytorch format).

0 commit comments

Comments
 (0)