[CLIP-VIT-L + Qwen] 多模态大模型源码阅读

本文主要是介绍[CLIP-VIT-L + Qwen] 多模态大模型源码阅读 - MultiModal篇，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

[CLIP-VIT-L + Qwen] 多模态大模型源码阅读 - MultiModal篇

前情提要
源码阅读
- 导包
- - 逐行讲解
- dataclass部分
- - 整体含义
  - 逐行解读
- 模型微调
- - 整体含义
  - 逐行解读
- MultiModal类
- - 整体含义
  - 逐行解读

在这里插入图片描述
参考repo:WatchTower-Liu/VLM-learning; url: VLLM-BASE

前情提要

有关多模态大模型架构中的语言模型部分（MQwen.py）的代码请看（多模态大模型源码阅读 - 1、多模态大模型源码阅读 - 2，多模态大模型源码阅读 - 3，多模态大模型源码阅读 - 4）,多模态大模型架构中的视觉模型（visual/CLIP-VIT.py）部分请看多模态大模型源码阅读 - 5，多模态大模型架构中的trainer（trainer.py）部分请看多模态大模型源码阅读 - 6。
本节将讲解如何将之前重构的MQwen语言模型部分和CLIP-VIT视觉模型部分整合为MultiModal多模态模型类，并利用多模态模型类进行前向传播，生成预测内容。

源码阅读

导包

import torch 
from torch import nn
from typing import Optional
import os
import sys
sys.path.append("../")
from transformers.modeling_outputs import CausalLMOutputWithPast
from dataclasses import dataclass, asdict
from peft import get_peft_model, LoraConfig, TaskType, PeftModel
from visual.CLIP_VIT import visualModel
from qwen.Mqwen import MQWenLMHeadModel

逐行讲解

对于部分已经用了无数次的模块就不再赘述了~

from typing import Optional

typing模块最重要的就是类型注释功能，这里导入的Optional表示变量可以是制定的类型或者None。例如Optional[str]表示变量可以是str类型或者None。

import sys
sys.path.append("../")

将上一层级目录添加到系统路径中，可以将上一层级的模块直接通过模块名导入。例如上一层级目录中定义了一个叫做abc.py的模块，那么就可以通过import abc直接导入。

from transformers.modeling_outputs import CausalLMOutputWithPast
from dataclasses import dataclass, asdict

CausalLMOutputWithPast专门用于封装因果模型的输出，包含了模型输出和过去的隐藏状态。
dataclass装饰器用于封装数据类型，asdict可以将数据类实例转换为字典。

from peft import get_peft_model, LoraConfig, TaskType, PeftModel

peft用于模型微调，get_peft_model方法获取LoRA，prefix tuning等不同类别的微调模型，LoRA包含了LoRA模型的必要配置参数，TaskType定义模型执行的不同任务类型，如文本分类、摘要总结等。PeftModel是一个基类，指定PEFT的配置。

dataclass部分

@dataclass
class LanguageConfig():model_path: strtorch_dtype: torch.dtype = torch.bfloat16trust_remote_code: bool = True@dataclass
class VisualConfig():model_path: strpretrained: bool = True@dataclass
class MultiModalConfig():replace_token_id: int# image_context_length: int = 256image_context_length: int = 728image_feature_hidden_size: int = 4096

整体含义

用于封装不同配置下的参数类型和初始值。

逐行解读

@dataclass
class LanguageConfig():model_path: strtorch_dtype: torch.dtype = torch.bfloat16trust_remote_code: bool = True

LanguageConfig类用于存储和管理语言模型的参数和配置。
model_path代表模型的存储地址，通常为字符串类型，无初始值，需要用户手动传入。
torch_type代表模型使用的数据类型，这里使用半精度浮点数bfloat16
trust_remote_code默认为True，当我们要远程从huggingface加载预训练模型时，通常需要保持这个值为True，因为我们运行的不是本地代码，本地下载模型的可以无视。

@dataclass
class VisualConfig():model_path: strpretrained: bool = True

VisualConfig代表视觉模型的参数和配置类，model_path与上文相同。
pretrained代表是否加载预训练模型的权重。

@dataclass
class MultiModalConfig():replace_token_id: int# image_context_length: int = 256image_context_length: int = 728image_feature_hidden_size: int = 4096

MultiModalConfig代表多模态模型的参数和配置类。
replace_token_id指定input_ids中用于替换的token_id，例如输入为[102,103,101]的数据，指定101为replace_token_id，则将101替换为图片特征数据。
image_context_length代表图像上下文长度。
image_feature_hidden_size指定图像特征隐藏层维度大小。

模型微调

def make_lora(model, finetune_args):peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM,inference_mode=False,r=finetune_args.lora_rank,lora_alpha=32,lora_dropout=finetune_args.lora_dropout,target_modules = finetune_args.target_modules.split('|') # 把model打印出来，找跟attention相关的模块)model = get_peft_model(model, peft_config)return model

整体含义

通过LoRA对模型进行微调

逐行解读

def make_lora(model, finetune_args):

model传递入需要进行微调的模型实例。
finetune_args是一个包含模型微调参数的对象。

    peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM,inference_mode=False,r=finetune_args.lora_rank,lora_alpha=32,lora_dropout=finetune_args.lora_dropout,target_modules = finetune_args.target_modules.split('|') # 把model打印出来，找跟attention相关的模块)

Loraconfig设置LoRA微调的参数，task_type设置为因果引言模型，设置inference_mode为False，代表目前处于非推理进程中，即训练进程。finetune_args.lora_rank设置LoRA的秩，决定了添加到模型中低秩矩阵的大小。lora_alpha代表LoRA 的扩展因子，用于控制扩展的维度。lora_dropout设置LoRA层中的dropout率，防止过拟合。target_modules指定要引用LoRA的模块，通过分隔符’|'分割开来，指定多个模块。

MultiModal类

class MMultiModal(nn.Module):def __init__(self, Lconfig: LanguageConfig, Vconfig: VisualConfig, MMconfig: MultiModalConfig, finetune_args = None, train = False, *args, **kwargs) -> None:super().__init__(*args, **kwargs)image_feature_length = MMconfig.image_context_length * MMconfig.image_feature_hidden_sizeself.LLM = MQWenLMHeadModel.from_pretrained(Lconfig.model_path, asdict(MMconfig), torch_dtype = Lconfig.torch_dtype, trust_remote_code = Lconfig.trust_remote_code)# self.LLM = MMiniCPMLMHeadModel.from_pretrained(Lconfig.model_path, asdict(MMconfig), torch_dtype = Lconfig.torch_dtype, trust_remote_code = Lconfig.trust_remote_code)if train:self.LLM.gradient_checkpointing_enable() self.LLM.enable_input_require_grads()self.LLM.config.image_feature_length = image_feature_lengthif train and finetune_args is not None:self.LLM = make_lora(self.LLM, finetune_args)assert MMconfig.image_feature_hidden_size == self.LLM.config.hidden_sizeself.visualModel = visualModel.from_pretrained(Vconfig.model_path).to(Lconfig.torch_dtype)Vhidden_dim = self.visualModel.vision_embed_dimLhidden_dim = self.LLM.config.hidden_sizeself.make_feature_proj(Vhidden_dim, Lhidden_dim, Lconfig)self.MMconfig = MMconfigprint(f"LLM dtype: {self.LLM.dtype}")print(f"Visual model dtype: {self.visualModel.dtype}")print(f"Feature projection dtype: {self.feature_proj[0].weight.dtype}")def make_feature_proj(self, Vhidden_dim, Lhidden_dim, Lconfig):self.feature_proj = nn.Sequential(nn.Linear(Vhidden_dim, Lhidden_dim, dtype=Lconfig.torch_dtype),nn.GELU(),nn.Linear(Lhidden_dim, Lhidden_dim, dtype=Lconfig.torch_dtype))for name, module in self.feature_proj.named_children():if "Linear" in module._get_name(): module.weight.data.normal_(mean=0.0, std = 0.01)module.bias.data.zero_()def forward(self, image: torch.Tensor, input_ids: torch.LongTensor, labels: Optional[torch.LongTensor] = None):with torch.no_grad():# 确保 image 的数据类型为 bfloat16image = image.to(dtype=torch.bfloat16)image_feature = self.visualModel.get_image_features(pixel_values=image)[:,1:, :]image_feature = image_feature.detach()image_feature = self.feature_proj(image_feature)out = self.LLM(input_ids, labels=labels, images=image_feature)loss1 = out.lossreturn CausalLMOutputWithPast(loss=loss1,logits=out.logits,past_key_values=out.past_key_values,hidden_states=out.hidden_states,attentions=out.attentions,)def to(self, *args, **kwargs):return super().to(*args, **kwargs)def load(self, modelPath):self.LLM = PeftModel.from_pretrained(self.LLM, modelPath, inference_mode=True)other_params = torch.load(os.path.join(modelPath, "other_params.bin"))self.feature_proj.load_state_dict(other_params)@torch.no_grad()def generate(self, image: torch.Tensor, input_ids: torch.LongTensor):if image is None:image_feature = Noneelse:image_feature=self.visualModel.get_image_features(pixel_values=image)[:,1:, :]image_feature = self.feature_proj(image_feature)input_ids = torch.tensor([input_ids]).long().to(self.LLM.device)out = self.LLM.generate(inputs = input_ids, images=image_feature)[:, len(input_ids[0]):-1]return out.long().cpu()

整体含义

通过中间投影层将视觉模型和语言模型的输出映射到统一的向量空间，实现前向传播和预测生成等功能。

逐行解读

class MMultiModal(nn.Module):def __init__(self, Lconfig: LanguageConfig, Vconfig: VisualConfig, MMconfig: MultiModalConfig, finetune_args = None, train = False, *args, **kwargs) -> None:super().__init__(*args, **kwargs)image_feature_length = MMconfig.image_context_length * MMconfig.image_feature_hidden_size

nn.Module不必赘述，所有神经网络模块的基类。
Lconfig，Vconfig，MMconfig分别是语言模型，视觉模型和多模态模型的配置参数对象，
finetune_args为模型微调参数（如果需要微调的话）
train默认为False，这一参数代表模型是否处于训练状态，args, **kwargs为任意的位置参数和关键字参数。
父类初始化传递args, **kwargs保证代码的可拓展性。
image_feature_length初始化为图片上下文长度×图片特征隐藏层维度，计算得到图像特征向量的总长度。
这里的image_context_length可以理解为图像被分成了若干块，每一块都转换为image_feature_hidden_size长度的向量。

        self.LLM = MQWenLMHeadModel.from_pretrained(Lconfig.model_path, asdict(MMconfig), torch_dtype = Lconfig.torch_dtype, trust_remote_code = Lconfig.trust_remote_code)

初始化语言模型，从模型路径中加载预训练模型的权重和配置，采用重构的MQWenLMHeadModel类的成员方法。将多模态配置转换为字典对象作为额外配置传入。模型数据类型为Lconfig.torch_dtype。信任远程代码设置为True。

   if train:self.LLM.gradient_checkpointing_enable() self.LLM.enable_input_require_grads()

如果当前为训练模式，启用语言模型的梯度累积和梯度计算。

   self.LLM.config.image_feature_length = image_feature_length

设置语言模型配置文件中的image_feature_length为之前计算得到的image_feature_length。

        if train and finetune_args is not None:self.LLM = make_lora(self.LLM, finetune_args)

如果提供了微调参数，则使用之前定义的make_lora函数，用LoRA进行微调，并获得微调后的语言模型。

    assert MMconfig.image_feature_hidden_size == self.LLM.config.hidden_size

确保语言模型的隐藏层维度大小和多模态模型的图片特征隐藏层大小一致。

        self.visualModel = visualModel.from_pretrained(Vconfig.model_path).to(Lconfig.torch_dtype)Vhidden_dim = self.visualModel.vision_embed_dimLhidden_dim = self.LLM.config.hidden_sizeself.make_feature_proj(Vhidden_dim, Lhidden_dim, Lconfig)self.MMconfig = MMconfig

从视觉模型配置参数的模型路径初始化视觉模型，并将模型的数据类型转换为语言模型的数据类型。
初始化视觉模型隐藏层维度大小和语言模型隐藏层大小。
根据视觉模型隐藏层维度大小和语言模型隐藏层大小，语言模型配置参数初始化中间通盈层。
将多模态模型参数存储为成员变量。

    def make_feature_proj(self, Vhidden_dim, Lhidden_dim, Lconfig):self.feature_proj = nn.Sequential(nn.Linear(Vhidden_dim, Lhidden_dim, dtype=Lconfig.torch_dtype),nn.GELU(),nn.Linear(Lhidden_dim, Lhidden_dim, dtype=Lconfig.torch_dtype))

创建中间投影层，传递参数分别为视觉模型隐藏层维度大小，语言模型隐藏层维度大小和语言模型配置参数。
使用nn.Sequential创建一个顺序执行的深度网络块，nn,Linear为全连接线性层，第一个全连接层接受视觉模型的输入，将输入的维度转换为语言模型隐藏层维度的输出，经过一个激活函数GELU增加模型的非线性，经过另一个线性全连接层转换后输出。

        for name, module in self.feature_proj.named_children():if "Linear" in module._get_name(): module.weight.data.normal_(mean=0.0, std = 0.01)module.bias.data.zero_()

获取中间投影层所有子模块，对于线性层，将权重参数初始化为均值0，方差0.01的值，并将偏置项置为0.

    def forward(self, image: torch.Tensor, input_ids: torch.LongTensor, labels: Optional[torch.LongTensor] = None):with torch.no_grad():# 确保 image 的数据类型为 bfloat16image = image.to(dtype=torch.bfloat16)image_feature = self.visualModel.get_image_features(pixel_values=image)[:,1:, :]image_feature = image_feature.detach()

多模态模型的前向传播函数，传入浮点型张量image，整数型张量input_ids和整数型张量labels，其中labels的注解类型为Optional，表明labels可以为None。
在不计算梯度的情况下，将image的数据类型转换为半精度浮点数，并使用多模态架构中的视觉模型部分处理传入的图像像素值，提取图像特征，并通过切片操作去除image_features第二个维度的第一个输出结果。（这里不太清楚image_feature的形状。ps:不仅神经网络是黑盒，连代码也是黑盒，有兴趣的童鞋可以自行打印查看）。
最后使用.detach()从当前计算图中分离特征张量，防止数据运用到后续的梯度计算中。

        image_feature = self.feature_proj(image_feature)out = self.LLM(input_ids, labels=labels, images=image_feature)loss1 = out.lossreturn CausalLMOutputWithPast(loss=loss1,logits=out.logits,past_key_values=out.past_key_values,hidden_states=out.hidden_states,attentions=out.attentions,)

使用之前定义的中间投影层，提取的图像特征进行投射，使其维度大小与语言模型的影藏层维度大小一致。之后使用语言模型获取输出，传入参数input_ids，labels和处理后的image_feature。
从输出结果中获取loss，赋值给loss1变量
返回一个类封装的结果，传入的loss,logits等变量都从语言模型的输出out里获得

    def to(self, *args, **kwargs):return super().to(*args, **kwargs)

调用父类的to方法，传入*args, **kwargs参数，提高代码的可拓展性。这里的to方法主要用于数据类型的转移和设备的迁移，比如cuda ->cpu，cpu -> cuda

    def load(self, modelPath):self.LLM = PeftModel.from_pretrained(self.LLM, modelPath, inference_mode=True)other_params = torch.load(os.path.join(modelPath, "other_params.bin"))self.feature_proj.load_state_dict(other_params)

这段代码主要用于加载微调后的模型。根据modelpath加载微调后的语言模型，主要用于推理进程。other_params为保存的中间投影层参数，通过load_state_dict方法加载保存的投影层模型参数字典。

    @torch.no_grad()def generate(self, image: torch.Tensor, input_ids: torch.LongTensor):if image is None:image_feature = Noneelse:image_feature=self.visualModel.get_image_features(pixel_values=image)[:,1:, :]image_feature = self.feature_proj(image_feature)input_ids = torch.tensor([input_ids]).long().to(self.LLM.device)out = self.LLM.generate(inputs = input_ids, images=image_feature)[:, len(input_ids[0]):-1]return out.long().cpu()

@torch.no_grad()装饰器用来指定函数运行时不执行梯度计算，用于节省计算资源和内存，通常在非训练阶段使用。
传入浮点数张量image和整数型张量inpu_ids。
如果未传入image像素值，则将image_feature变量置为None，反之用多模态架构中的视觉模型获取图片特征，并通过投影层映射到与语言模型相同的向量空间中。（操作与前向传播中的处理一致）
确保input_ids为整数型张量，并将其转移到与语言模型相同的设备上，以备后续处理。
使用语言模型的generate方法，传入input_ids，image_featrues参数，利用切片操作获取生成结果中除input_ids以外的所有输出结果。
假设传入的Input_ids的size为（1,5），利用语言模型生成结果的size为（1,10），这里切片操作的目的就是只获取模型的生成结果，即预测结果，out[1,5：-1]
将返回结果转换为整数型张量，并将其迁移到cpu设备上。（由于部分库与cuda不兼容，例如numpy等，或者要将数据保存至文件中，需要将数据进行迁移。）

这篇关于[CLIP-VIT-L + Qwen] 多模态大模型源码阅读 - MultiModal篇的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！