Qwen2-MOE-57B-A14B模型结构解读

本文主要是介绍Qwen2-MOE-57B-A14B模型结构解读，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Qwen2-MOE-57B-A14B模型结构解读

模型代码文件下载

该模型总的参数为57B，激活参数为14B，推理速度比32B的快，而且性能更好。

Qwen2-MOE-57B-A14B模型总体结构

<class 'transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM'>
Qwen2MoeForCausalLM((model): Qwen2MoeModel((embed_tokens): Embedding(151936, 3584)(layers): ModuleList((0-27): 28 x Qwen2MoeDecoderLayer((self_attn): Qwen2MoeSdpaAttention((q_proj): Linear(in_features=3584, out_features=3584, bias=True)(k_proj): Linear(in_features=3584, out_features=512, bias=True)(v_proj): Linear(in_features=3584, out_features=512, bias=True)(o_proj): Linear(in_features=3584, out_features=3584, bias=False)(rotary_emb): Qwen2MoeRotaryEmbedding())(mlp): Qwen2MoeSparseMoeBlock((gate): Linear(in_features=3584, out_features=64, bias=False)(experts): ModuleList((0-63): 64 x Qwen2MoeMLP((gate_proj): Linear(in_features=3584, out_features=2560, bias=False)(up_proj): Linear(in_features=3584, out_features=2560, bias=False)(down_proj): Linear(in_features=2560, out_features=3584, bias=False)(act_fn): SiLU()))(shared_expert): Qwen2MoeMLP((gate_proj): Linear(in_features=3584, out_features=20480, bias=False)(up_proj): Linear(in_features=3584, out_features=20480, bias=False)(down_proj): Linear(in_features=20480, out_features=3584, bias=False)(act_fn): SiLU())(shared_expert_gate): Linear(in_features=3584, out_features=1, bias=False))(input_layernorm): Qwen2MoeRMSNorm()(post_attention_layernorm): Qwen2MoeRMSNorm()))(norm): Qwen2MoeRMSNorm())(lm_head): Linear(in_features=3584, out_features=151936, bias=False)
)

Qwen2-MOE-57B-A14B模型详细结构（下面是从输入到输出的顺序输出的每层的参数量）

#输入的Embedding层
model.embed_tokens.weight: torch.Size([151936, 3584])
#主体的layer层，model.layers.0是第一层，共有28层
#下面是model.layers.0的attention层
model.layers.0.self_attn.q_proj.weight: torch.Size([3584, 3584])
model.layers.0.self_attn.q_proj.bias: torch.Size([3584])
model.layers.0.self_attn.k_proj.weight: torch.Size([512, 3584])
model.layers.0.self_attn.k_proj.bias: torch.Size([512])
model.layers.0.self_attn.v_proj.weight: torch.Size([512, 3584])
model.layers.0.self_attn.v_proj.bias: torch.Size([512])
model.layers.0.self_attn.o_proj.weight: torch.Size([3584, 3584])
model.layers.0.mlp.gate.weight: torch.Size([64, 3584])#下面是model.layers.0的moe结构的mlp层
model.layers.0.mlp.experts.0.gate_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.0.up_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.0.down_proj.weight: torch.Size([3584, 2560])
model.layers.0.mlp.experts.1.gate_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.1.up_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.1.down_proj.weight: torch.Size([3584, 2560])
model.layers.0.mlp.experts.2.gate_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.2.up_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.2.down_proj.weight: torch.Size([3584, 2560])...有64个model.layers.0.mlp.experts层，这里省略model.layers.0.mlp.experts.3----model.layers.0.mlp.experts.62model.layers.0.mlp.experts.63.gate_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.63.up_proj.weight: torch.Size([2560, 3584])
model.layers.0.mlp.experts.63.down_proj.weight: torch.Size([3584, 2560])#下面是model.layers.0的shared moe结构的mlp层
model.layers.0.mlp.shared_expert.gate_proj.weight: torch.Size([20480, 3584])
model.layers.0.mlp.shared_expert.up_proj.weight: torch.Size([20480, 3584])
model.layers.0.mlp.shared_expert.down_proj.weight: torch.Size([3584, 20480])
model.layers.0.mlp.shared_expert_gate.weight: torch.Size([1, 3584])#下面是是model.layers.0的Qwen2MoeRMSNorm层
model.layers.0.input_layernorm.weight: torch.Size([3584])
model.layers.0.post_attention_layernorm.weight: torch.Size([3584])...这里省略model.layers.1---model.layers.27，它们的结构与model.layers.0一样#下面是马上要输出前的归一化norm层
model.norm.weight: torch.Size([3584])#下面是输出到最后的151936个token概率分布的mlp层
lm_head.weight: torch.Size([151936, 3584])