@@ -34,20 +34,17 @@ class DeepSeekV3ModelArgs(BaseModelArgs):
34
34
n_layers (int): Number of transformer layers.
35
35
n_dense_layers (int): Number of dense layers in the model.
36
36
n_heads (int): Number of attention heads.
37
- n_routed_experts (int): Number of routed experts for MoE layers.
38
- n_shared_experts (int): Number of shared experts for MoE layers.
39
- n_activated_experts (int): Number of activated experts in MoE layers.
37
+ norm_eps (float): Epsilon value used for RMSNorm.
38
+ moe_args (MoEArgs): MoE configuration.
40
39
n_expert_groups (int): Number of expert groups.
41
40
n_limited_groups (int): Number of limited groups for MoE routing.
42
- score_func (Literal["softmax", "sigmoid"]): Scoring function for MoE routing.
43
- route_scale (float): Scaling factor for routing scores.
44
- use_grouped_mm (bool): Whether to use grouped matrix multiplication for MoE layers.
45
- load_balance_coeff (float | None): Auxiliary-Loss-Free Load balancing coefficient for MoE layers.
46
41
q_lora_rank (int): LoRA rank for query projections.
47
42
kv_lora_rank (int): LoRA rank for key-value projections.
48
43
qk_nope_head_dim (int): Dimension for query-key projections without positional embeddings.
49
44
qk_rope_head_dim (int): Dimension for query-key projections with rotary embeddings.
50
45
v_head_dim (int): Dimension for value projections.
46
+ use_flex_attn (bool): Whether to use FlexAttention.
47
+ attn_mask_type (str): Type of attention mask.
51
48
original_seq_len (int): Original sequence length.
52
49
rope_theta (float): Base for rotary positional encoding.
53
50
rope_factor (float): Scaling factor for extended sequence lengths.
0 commit comments