MixtralForCausalLM DeepSpeed Inference节约HOST内存【最新的方案】

本文主要是介绍MixtralForCausalLM DeepSpeed Inference节约HOST内存【最新的方案】,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

MixtralForCausalLM DeepSpeed Inference节约HOST内存【最新的方案】

  • 一.效果
  • 二.特别说明
  • 三.测试步骤
    • 1.创建Mixtral-8x7B配置文件(简化了)
    • 2.生成随机模型,运行cpu float32推理,输出结果
    • 3.加载模型,cuda 单卡推理
    • 4.DS 4 TP cuda 推理
    • 5.分别保存DS 4TP每个rank上engine.module的权值
    • 6.DS 4TP推理,init_empty_weights初始化模型,每个rank加载自己engine.module的权值

本文演示了MixtralForCausalLM DeepSpeed Inference如果节约HOST内存
方法:每个rank分别保存,并且使用accelerate的init_empty_weights
增加的功能:

  • safetensors分块的存储与加载
  • 解决register_buffer persistent=False,参数初始化的问题

一.效果

运行方式HOST内存占用备注
单卡推理13198 MB
DS 4TP13246 MB/GPU
DS 4TP 优化内存占用后369 MB/GPU直接加载到设备,更节约HOST内存

二.特别说明

  • 1.MixtralRotaryEmbedding中self.register_buffer(“sin_cached”, emb.sin().to(dtype), persistent=False)
    因为persistent为False。所以不会保存到state_dict中,module.to_empty(device)也不会保留它的值
    只能在模型初始化之后保存出来,之后engine.moudle加载完权值之后再把这个buffer替换进去

三.测试步骤

1.创建Mixtral-8x7B配置文件(简化了)

mkdir skip_init_demo
cd skip_init_demo
tee ./config.json <<-'EOF'
{"architectures": ["MixtralForCausalLM"],"attention_dropout": 0.0,"bos_token_id": 1,"eos_token_id": 2,"hidden_act": "silu","hidden_size": 1024,"initializer_range": 0.02,"intermediate_size": 4096,"max_position_embeddings": 1024,"model_type": "mixtral","num_attention_heads": 32,"num_experts_per_tok": 2,"num_hidden_layers": 32,"num_key_value_heads": 8,"num_local_experts": 8,"output_router_logits": false,"rms_norm_eps": 1e-05,"rope_theta": 1000000.0,"router_aux_loss_coef": 0.02,"sliding_window": 128,"tie_word_embeddings": false,"torch_dtype": "bfloat16","transformers_version": "4.36.0.dev0","use_cache": true,"vocab_size": 32000
}
EOF

2.生成随机模型,运行cpu float32推理,输出结果

rm -rf Mixtral-8x7B
tee gen_model.py <<-'EOF'
import torch
import os
import time
def main():torch.manual_seed(1)from transformers import MixtralForCausalLM, MixtralConfigconfig=MixtralConfig.from_pretrained("./config.json")model = MixtralForCausalLM(config).half()    model.eval()model.save_pretrained("./Mixtral-8x7B",safe_serialization=True)torch.manual_seed(2)input_tokens=torch.randint(0,32000,(1,128))model=model.float()output=model(input_tokens)output=output.logits.detach().reshape(-1).cpu().numpy()[:8]print(output)if __name__ == "__main__":main()
EOF
python gen_model.py
du Mixtral-8x7B -lh

输出

6.3G    Mixtral-8x7B[-0.9623295  -0.36580455  0.767425    1.7021806  -0.17950581  0.36059803-0.49157432 -0.58618194]

3.加载模型,cuda 单卡推理

tee open_model.py <<-'EOF'
import torch
import os
import psutil
import time
from transformers.modeling_utils import load_sharded_checkpoint,load_state_dict
import json
from safetensors import safe_opendef get_mem_info():pid = os.getpid()current_process = psutil.Process(pid)memory_info = current_process.memory_info()print(f"RSS: {memory_info.rss / (1024 * 1024):.2f}MB VMS:{memory_info.vms / (1024 * 1024):.2f}MB")def main():from transformers import MixtralForCausalLM, MixtralConfigget_mem_info()config=MixtralConfig.from_pretrained("./config.json")model = MixtralForCausalLM(config).half()get_mem_info()with open("Mixtral-8x7B/model.safetensors.index.json", "r") as file:index_data = json.load(file)weight_files = index_data.get('weight_map', [])state_dict = {}for k,v in weight_files.items():weights_path = os.path.join("Mixtral-8x7B", v)with safe_open(weights_path, framework="pt") as f:for k in f.keys():state_dict[k] = f.get_tensor(k)       model.load_state_dict(state_dict, strict=True)get_mem_info()model=model.to("cuda:0")torch.manual_seed(2)input_tokens=torch.randint(0,32000,(1,128)).to("cuda:0")output=model(input_tokens)output=output.logits.detach().reshape(-1).cpu().numpy()[:8]print(output)if __name__ == "__main__":main()
EOF
python open_model.py

输出:

RSS: 251.70MB VMS:3292.21MB
RSS: 6697.91MB VMS:13695.17MB
RSS: 13198.57MB VMS:26385.02MB[-0.9633789  -0.36450195  0.76708984  1.703125   -0.1772461   0.3581543-0.48901367 -0.5888672 ]

4.DS 4 TP cuda 推理

tee open_model.py <<-'EOF'
import torch
import os
import psutil
import time
from transformers.modeling_utils import load_sharded_checkpoint,load_state_dict
import deepspeed
from deepspeed.accelerator import get_accelerator
import json
from safetensors import safe_opendeepspeed.init_distributed(dist_backend='nccl')
world_size = torch.distributed.get_world_size()
local_rank=int(os.environ['LOCAL_RANK'])
rank=torch.distributed.get_rank()def get_mem_info(prefix):pid = os.getpid()current_process = psutil.Process(pid)memory_info = current_process.memory_info()print(f"{prefix} RANK:{os.environ['LOCAL_RANK']} RSS: {memory_info.rss / (1024 * 1024):.2f}MB VMS:{memory_info.vms / (1024 * 1024):.2f}MB")def main():torch.set_num_threads(1)from transformers import MixtralForCausalLM, MixtralConfigget_mem_info("Init")config=MixtralConfig.from_pretrained("./config.json")model = MixtralForCausalLM(config).half()get_mem_info("ModelCreate")print("-----------------------")with open("Mixtral-8x7B/model.safetensors.index.json", "r") as file:index_data = json.load(file)weight_files = index_data.get('weight_map', [])state_dict = {}for k,v in weight_files.items():weights_path = os.path.join("Mixtral-8x7B", v)with safe_open(weights_path, framework="pt") as f:for k in f.keys():state_dict[k] = f.get_tensor(k)model.load_state_dict(state_dict, strict=True)get_mem_info("LoadState")print("-----------------------")engine = deepspeed.init_inference(model,tensor_parallel={"tp_size": world_size},dtype=torch.float16,replace_with_kernel_inject=False)device=get_accelerator().current_device_name()print("device:",device)torch.manual_seed(2)input_tokens=torch.randint(0,32000,(1,128)).to(device)output=engine(input_tokens)output=output.logits.detach().reshape(-1).cpu().numpy()[:8]if rank==0:print(output)if __name__ == "__main__":main()
EOF
deepspeed --num_gpus=4 open_model.py

输出:


Init RANK:1 RSS: 270.02MB VMS:3414.44MB
Init RANK:3 RSS: 270.43MB VMS:3414.45MB
Init RANK:2 RSS: 270.22MB VMS:3414.45MB
Init RANK:0 RSS: 270.38MB VMS:3486.45MBModelCreate RANK:0 RSS: 6757.33MB VMS:9965.12MB
ModelCreate RANK:3 RSS: 6727.30MB VMS:9862.06MB
ModelCreate RANK:2 RSS: 6757.18MB VMS:9893.12MB
ModelCreate RANK:1 RSS: 6756.99MB VMS:9893.12MBLoadState RANK:2 RSS: 13248.96MB VMS:22772.97MB
LoadState RANK:0 RSS: 13245.91MB VMS:22616.97MB
LoadState RANK:3 RSS: 13233.00MB VMS:22490.91MB
LoadState RANK:1 RSS: 13246.22MB VMS:23240.97MB[-0.96240234 -0.36547852  0.7680664   1.703125   -0.17382812  0.359375-0.49169922 -0.5883789 ]

5.分别保存DS 4TP每个rank上engine.module的权值

tee open_model.py <<-'EOF'
import torch
import os
import psutil
import time
from transformers.modeling_utils import load_sharded_checkpoint,load_state_dict
import deepspeed
from deepspeed.accelerator import get_accelerator
import json
from safetensors import safe_open
from safetensors.torch import save_file, load_filedeepspeed.init_distributed(dist_backend='nccl')
world_size = torch.distributed.get_world_size()
local_rank=int(os.environ['LOCAL_RANK'])
rank=torch.distributed.get_rank()def get_mem_info(prefix):pid = os.getpid()current_process = psutil.Process(pid)memory_info = current_process.memory_info()print(f"{prefix} RANK:{os.environ['LOCAL_RANK']} RSS: {memory_info.rss / (1024 * 1024):.2f}MB VMS:{memory_info.vms / (1024 * 1024):.2f}MB")def save_state_dict(state_dict,save_dir):max_bytes_per_file = 1 * 1024 * 1024 * 1024  # 1GB# 计算每个 tensor 的大小并拆分 state_dictsplit_state_dicts = []current_state_dict = {}current_size = 0for param_name, param_tensor in state_dict.items():tensor_size = param_tensor.element_size() * param_tensor.nelement()# 如果当前 tensor 超过了文件大小,先保存已有 tensorsif current_size + tensor_size > max_bytes_per_file:split_state_dicts.append(current_state_dict)current_state_dict = {}current_size = 0current_state_dict[param_name] = param_tensorcurrent_size += tensor_size# 添加最后一个 state_dictif current_state_dict:split_state_dicts.append(current_state_dict)# 保存拆分后的 state_dicts 并生成索引文件os.makedirs(save_dir, exist_ok=True)index = {"metadata": {"total_parts": len(split_state_dicts)},"weight_map": []}for i, sd in enumerate(split_state_dicts):part_file = os.path.join(save_dir, f"model_part_{i}.safetensors")save_file(sd, part_file)index["weight_map"].append(f"model_part_{i}.safetensors")# 保存索引文件index_file = os.path.join(save_dir, "index.json")with open(index_file, 'w') as f:json.dump(index, f, indent=4)def main():from transformers import MixtralForCausalLM, MixtralConfigget_mem_info("Init")config=MixtralConfig.from_pretrained("./config.json")model = MixtralForCausalLM(config).half()get_mem_info("ModelCreate")print("-----------------------")with open("Mixtral-8x7B/model.safetensors.index.json", "r") as file:index_data = json.load(file)weight_files = index_data.get('weight_map', [])state_dict = {}for k,v in weight_files.items():weights_path = os.path.join("Mixtral-8x7B", v)with safe_open(weights_path, framework="pt") as f:for k in f.keys():state_dict[k] = f.get_tensor(k)model.load_state_dict(state_dict, strict=True)get_mem_info("LoadState")print("-----------------------")engine = deepspeed.init_inference(model,tensor_parallel={"tp_size": world_size},dtype=torch.float16,replace_with_kernel_inject=False)save_state_dict(engine.module.state_dict(), f"./Mixtral-8x7B-{local_rank}")
if __name__ == "__main__":main()
EOF
deepspeed --num_gpus=4 open_model.py
du Mixtral-8x7B-* -lh

输出

1.7G    Mixtral-8x7B-0
1.7G    Mixtral-8x7B-1
1.7G    Mixtral-8x7B-2
1.7G    Mixtral-8x7B-3

6.DS 4TP推理,init_empty_weights初始化模型,每个rank加载自己engine.module的权值

tee open_model.py <<-'EOF'
import torch
import os
import psutil
import time
from accelerate import init_empty_weights
from transformers.modeling_utils import load_sharded_checkpoint,load_state_dict
import deepspeed
from deepspeed.accelerator import get_accelerator
import json
from safetensors import safe_open
from safetensors.torch import save_file, load_filedeepspeed.init_distributed(dist_backend='nccl')
world_size = torch.distributed.get_world_size()
local_rank=int(os.environ['LOCAL_RANK'])
rank=torch.distributed.get_rank()def get_mem_info(prefix):pid = os.getpid()current_process = psutil.Process(pid)memory_info = current_process.memory_info()print(f"{prefix} RANK:{os.environ['LOCAL_RANK']} RSS: {memory_info.rss / (1024 * 1024):.2f}MB VMS:{memory_info.vms / (1024 * 1024):.2f}MB")def my_load_state_dict(model,save_dir):index_file = os.path.join(save_dir, "index.json")with open(index_file, "r") as file:index_data = json.load(file)weight_files = index_data.get('weight_map', [])state_dict = {}for v in weight_files:weights_path = os.path.join(save_dir, v)with safe_open(weights_path, framework="pt") as f:for k in f.keys():state_dict[k] = f.get_tensor(k)model.load_state_dict(state_dict, strict=True)def main():from transformers import MixtralForCausalLM, MixtralConfigget_mem_info("Init")config=MixtralConfig.from_pretrained("./config.json")with init_empty_weights():model = MixtralForCausalLM(config).half()get_mem_info("ModelCreate")print("-----------------------")buffer_dict = {}for name, param in model.named_buffers():buffer_dict[name] = paramengine = deepspeed.init_inference(model,tensor_parallel={"tp_size": world_size},dtype=torch.float16,replace_with_kernel_inject=False)my_load_state_dict(engine.module,f"./Mixtral-8x7B-{local_rank}")for name, param in engine.module.named_buffers():param.copy_(buffer_dict[name])get_mem_info("LoadState")device=get_accelerator().current_device_name()torch.manual_seed(2)input_tokens=torch.randint(0,32000,(1,128)).to(device)output=engine(input_tokens)output=output.logits.detach().reshape(-1).cpu().numpy()[:8]if rank==0:print(output)
if __name__ == "__main__":main()
EOF
deepspeed --num_gpus=4 open_model.py

输出


Init RANK:1 RSS: 269.73MB VMS:3382.40MB
Init RANK:2 RSS: 269.45MB VMS:3382.39MB
Init RANK:3 RSS: 269.86MB VMS:3382.39MB
Init RANK:0 RSS: 269.96MB VMS:3454.39MBModelCreate RANK:1 RSS: 300.44MB VMS:17064.71MB
ModelCreate RANK:0 RSS: 297.03MB VMS:17136.70MB
ModelCreate RANK:2 RSS: 299.22MB VMS:17064.70MB
ModelCreate RANK:3 RSS: 300.66MB VMS:17065.70MBLoadState RANK:0 RSS: 366.28MB VMS:20159.03MB
LoadState RANK:3 RSS: 369.87MB VMS:20152.03MB
LoadState RANK:2 RSS: 368.37MB VMS:20151.02MB
LoadState RANK:1 RSS: 369.16MB VMS:20087.04MB[-0.96240234 -0.36547852  0.7680664   1.703125   -0.17382812  0.359375-0.49169922 -0.5883789 ]

这篇关于MixtralForCausalLM DeepSpeed Inference节约HOST内存【最新的方案】的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1069990

相关文章

无人叉车3d激光slam多房间建图定位异常处理方案-墙体画线地图切分方案

墙体画线地图切分方案 针对问题:墙体两侧特征混淆误匹配,导致建图和定位偏差,表现为过门跳变、外月台走歪等 ·解决思路:预期的根治方案IGICP需要较长时间完成上线,先使用切分地图的工程化方案,即墙体两侧切分为不同地图,在某一侧只使用该侧地图进行定位 方案思路 切分原理:切分地图基于关键帧位置,而非点云。 理论基础:光照是直线的,一帧点云必定只能照射到墙的一侧,无法同时照到两侧实践考虑:关

NameNode内存生产配置

Hadoop2.x 系列,配置 NameNode 内存 NameNode 内存默认 2000m ,如果服务器内存 4G , NameNode 内存可以配置 3g 。在 hadoop-env.sh 文件中配置如下。 HADOOP_NAMENODE_OPTS=-Xmx3072m Hadoop3.x 系列,配置 Nam

高效+灵活,万博智云全球发布AWS无代理跨云容灾方案!

摘要 近日,万博智云推出了基于AWS的无代理跨云容灾解决方案,并与拉丁美洲,中东,亚洲的合作伙伴面向全球开展了联合发布。这一方案以AWS应用环境为基础,将HyperBDR平台的高效、灵活和成本效益优势与无代理功能相结合,为全球企业带来实现了更便捷、经济的数据保护。 一、全球联合发布 9月2日,万博智云CEO Michael Wong在线上平台发布AWS无代理跨云容灾解决方案的阐述视频,介绍了

Andrej Karpathy最新采访:认知核心模型10亿参数就够了,AI会打破教育不公的僵局

夕小瑶科技说 原创  作者 | 海野 AI圈子的红人,AI大神Andrej Karpathy,曾是OpenAI联合创始人之一,特斯拉AI总监。上一次的动态是官宣创办一家名为 Eureka Labs 的人工智能+教育公司 ,宣布将长期致力于AI原生教育。 近日,Andrej Karpathy接受了No Priors(投资博客)的采访,与硅谷知名投资人 Sara Guo 和 Elad G

Android平台播放RTSP流的几种方案探究(VLC VS ExoPlayer VS SmartPlayer)

技术背景 好多开发者需要遴选Android平台RTSP直播播放器的时候,不知道如何选的好,本文针对常用的方案,做个大概的说明: 1. 使用VLC for Android VLC Media Player(VLC多媒体播放器),最初命名为VideoLAN客户端,是VideoLAN品牌产品,是VideoLAN计划的多媒体播放器。它支持众多音频与视频解码器及文件格式,并支持DVD影音光盘,VCD影

秋招最新大模型算法面试,熬夜都要肝完它

💥大家在面试大模型LLM这个板块的时候,不知道面试完会不会复盘、总结,做笔记的习惯,这份大模型算法岗面试八股笔记也帮助不少人拿到过offer ✨对于面试大模型算法工程师会有一定的帮助,都附有完整答案,熬夜也要看完,祝大家一臂之力 这份《大模型算法工程师面试题》已经上传CSDN,还有完整版的大模型 AI 学习资料,朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费

AI Toolkit + H100 GPU,一小时内微调最新热门文生图模型 FLUX

上个月,FLUX 席卷了互联网,这并非没有原因。他们声称优于 DALLE 3、Ideogram 和 Stable Diffusion 3 等模型,而这一点已被证明是有依据的。随着越来越多的流行图像生成工具(如 Stable Diffusion Web UI Forge 和 ComyUI)开始支持这些模型,FLUX 在 Stable Diffusion 领域的扩展将会持续下去。 自 FLU

JavaFX应用更新检测功能(在线自动更新方案)

JavaFX开发的桌面应用属于C端,一般来说需要版本检测和自动更新功能,这里记录一下一种版本检测和自动更新的方法。 1. 整体方案 JavaFX.应用版本检测、自动更新主要涉及一下步骤: 读取本地应用版本拉取远程版本并比较两个版本如果需要升级,那么拉取更新历史弹出升级控制窗口用户选择升级时,拉取升级包解压,重启应用用户选择忽略时,本地版本标志为忽略版本用户选择取消时,隐藏升级控制窗口 2.

如何选择SDR无线图传方案

在开源软件定义无线电(SDR)领域,有几个项目提供了无线图传的解决方案。以下是一些开源SDR无线图传方案: 1. **OpenHD**:这是一个远程高清数字图像传输的开源解决方案,它使用SDR技术来实现高清视频的无线传输。OpenHD项目提供了一个完整的工具链,包括发射器和接收器的硬件设计以及相应的软件。 2. **USRP(Universal Software Radio Periphera

MyBatis 切换不同的类型数据库方案

下属案例例当前结合SpringBoot 配置进行讲解。 背景: 实现一个工程里面在部署阶段支持切换不同类型数据库支持。 方案一 数据源配置 关键代码(是什么数据库,该怎么配就怎么配) spring:datasource:name: test# 使用druid数据源type: com.alibaba.druid.pool.DruidDataSource# @需要修改 数据库连接及驱动u