本文主要是介绍4bit/8bit 启动 Mixtral 8*7B 大语言模型,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
4bit/8bit 启动 Mixtral 8*7B 大语言模型
- 0. 背景
- 1. 修改代码
0. 背景
个人电脑配置实在难以以 float16 运行 Mixtral 8*7B 大语言模型,所以参数 4bit 或者 8bit 来启动。
实际测试结果,4bit 时推理速度明显变快了,8bit 时推理也非常慢。
使用的推理框架时 fastchat。
1. 修改代码
vi fastchat/model/model_adapter.py
修改前,
class MistralAdapter(BaseModelAdapter):"""The model adapter for Mistral AI models"""def match(self, model_path: str):return "mistral" in model_path.lower() or "mixtral" in model_path.lower()def load_model(self, model_path: str, from_pretrained_kwargs: dict):model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)model.config.eos_token_id = tokenizer.eos_token_idmodel.config.pad_token_id = tokenizer.pad_token_idreturn model, tokenizer
修改后,
class MistralAdapter(BaseModelAdapter):"""The model adapter for Mistral AI models"""def match(self, model_path: str):return "mistral" in model_path.lower() or "mixtral" in model_path.lower()def load_model(self, model_path: str, from_pretrained_kwargs: dict):# model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)if "mixtral" in model_path.lower():model = AutoModelForCausalLM.from_pretrained(model_path,low_cpu_mem_usage=True,trust_remote_code=True,# attn_implementation="flash_attention_2",# load_in_8bit=True,load_in_4bit=True,**from_pretrained_kwargs,)else:model = AutoModelForCausalLM.from_pretrained(model_path,low_cpu_mem_usage=True,trust_remote_code=True,**from_pretrained_kwargs,)model.config.eos_token_id = tokenizer.eos_token_idmodel.config.pad_token_id = tokenizer.pad_token_idreturn model, tokenizer
完结!
这篇关于4bit/8bit 启动 Mixtral 8*7B 大语言模型的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!