大模型推理时model.generate的源码

2024-06-11 16:52

本文主要是介绍大模型推理时model.generate的源码,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

大模型推理时model.generate的源码

文件路径:anaconda3/envs/环境名/lib/python3.10/site-packages/transformers/generation/utils.py

def generate(self,inputs: Optional[torch.Tensor] = None,generation_config: Optional[GenerationConfig] = None,logits_processor: Optional[LogitsProcessorList] = None,stopping_criteria: Optional[StoppingCriteriaList] = None,prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,synced_gpus: Optional[bool] = None,assistant_model: Optional["PreTrainedModel"] = None,streamer: Optional["BaseStreamer"] = None,negative_prompt_ids: Optional[torch.Tensor] = None,negative_prompt_attention_mask: Optional[torch.Tensor] = None,**kwargs,) -> Union[GenerateOutput, torch.LongTensor]:r"""Generates sequences of token ids for models with a language modeling head.<Tip warning={true}>Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to themodel's default generation configuration. You can override any `generation_config` by passing the correspondingparameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.For an overview of generation strategies and code examples, check out the [followingguide](../generation_strategies).</Tip>Parameters:inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):The sequence used as a prompt for the generation or as model inputs to the encoder. If `None` themethod initializes it with `bos_token_id` and a batch size of 1. For decoder-only models `inputs`should be in the format of `input_ids`. For encoder-decoder models *inputs* can represent any of`input_ids`, `input_values`, `input_features`, or `pixel_values`.generation_config ([`~generation.GenerationConfig`], *optional*):The generation configuration to be used as base parametrization for the generation call. `**kwargs`passed to generate matching the attributes of `generation_config` will override them. If`generation_config` is not provided, the default will be used, which has the following loadingpriority: 1) from the `generation_config.json` model file, if it exists; 2) from the modelconfiguration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'sdefault values, whose documentation should be checked to parameterize generation.logits_processor (`LogitsProcessorList`, *optional*):Custom logits processors that complement the default logits processors built from arguments andgeneration config. If a logit processor is passed that is already created with the arguments or ageneration config an error is thrown. This feature is intended for advanced users.stopping_criteria (`StoppingCriteriaList`, *optional*):Custom stopping criteria that complements the default stopping criteria built from arguments and ageneration config. If a stopping criteria is passed that is already created with the arguments or ageneration config an error is thrown. If your stopping criteria depends on the `scores` input, makesure you pass `return_dict_in_generate=True, output_scores=True` to `generate`. This feature isintended for advanced users.prefix_allowed_tokens_fn (`Callable[[int, torch.Tensor], List[int]]`, *optional*):If provided, this function constraints the beam search to allowed tokens only at each step. If notprovided no constraint is applied. This function takes 2 arguments: the batch ID `batch_id` and`input_ids`. It has to return a list with the allowed tokens for the next generation step conditionedon the batch ID `batch_id` and the previously generated tokens `inputs_ids`. This argument is usefulfor constrained generation conditioned on the prefix, as described in [Autoregressive EntityRetrieval](https://arxiv.org/abs/2010.00904).synced_gpus (`bool`, *optional*):Whether to continue running the while loop until max_length. Unless overridden this flag will be set to`True` under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finishedgenerating before other GPUs. Otherwise it'll be set to `False`.assistant_model (`PreTrainedModel`, *optional*):An assistant model that can be used to accelerate generation. The assistant model must have the exactsame tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistent modelis much faster than running generation with the model you're calling generate from. As such, theassistant model should be much smaller.streamer (`BaseStreamer`, *optional*):Streamer object that will be used to stream the generated sequences. Generated tokens are passedthrough `streamer.put(token_ids)` and the streamer is responsible for any further processing.negative_prompt_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):The negative prompt needed for some processors such as CFG. The batch size must match the input batchsize. This is an experimental feature, subject to breaking API changes in future versions.negative_prompt_attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):Attention_mask for `negative_prompt_ids`.kwargs (`Dict[str, Any]`, *optional*):Ad hoc parametrization of `generation_config` and/or additional model-specific kwargs that will beforwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoderspecific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder_*.Return:[`~utils.ModelOutput`] or `torch.LongTensor`: A [`~utils.ModelOutput`] (if `return_dict_in_generate=True`or when `config.return_dict_in_generate=True`) or a `torch.LongTensor`.If the model is *not* an encoder-decoder model (`model.config.is_encoder_decoder=False`), the possible[`~utils.ModelOutput`] types are:- [`~generation.GenerateDecoderOnlyOutput`],- [`~generation.GenerateBeamDecoderOnlyOutput`]If the model is an encoder-decoder model (`model.config.is_encoder_decoder=True`), the possible[`~utils.ModelOutput`] types are:- [`~generation.GenerateEncoderDecoderOutput`],- [`~generation.GenerateBeamEncoderDecoderOutput`]"""# 1. Handle `generation_config` and kwargs that might update it, and validate the `.generate()` callself._validate_model_class()tokenizer = kwargs.pop("tokenizer", None)  # Pull this out first, we only use it for stopping criteriageneration_config, model_kwargs = self._prepare_generation_config(generation_config, **kwargs)self._validate_model_kwargs(model_kwargs.copy())# 2. Set generation parameters if not already definedif synced_gpus is None:if is_deepspeed_zero3_enabled() and dist.get_world_size() > 1:synced_gpus = Trueelse:synced_gpus = Falselogits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()accepts_attention_mask = "attention_mask" in set(inspect.signature(self.forward).parameters.keys())requires_attention_mask = "encoder_outputs" not in model_kwargskwargs_has_attention_mask = model_kwargs.get("attention_mask", None) is not None# 3. Define model inputsinputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(inputs, generation_config.bos_token_id, model_kwargs)batch_size = inputs_tensor.shape[0]device = inputs_tensor.deviceself._prepare_special_tokens(generation_config, kwargs_has_attention_mask, device=device)# decoder-only models must use left-padding for batched generation.if not self.config.is_encoder_decoder and not is_torchdynamo_compiling():# If `input_ids` was given, check if the last id in any sequence is `pad_token_id`# Note: If using, `inputs_embeds` this check does not work, because we want to be more hands-off.if (generation_config.pad_token_id is not Noneand batch_size > 1and len(inputs_tensor.shape) == 2and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0):logger.warning("A decoder-only architecture is being used, but right-padding was detected! For correct ""generation results, please set `padding_side='left'` when initializing the tokenizer.")# 4. Define other model kwargs# decoder-only models with inputs_embeds forwarding must use caching (otherwise we can't detect whether we are# generating the first new token or not, and we only want to use the embeddings for the first new token)if not self.config.is_encoder_decoder and model_input_name == "inputs_embeds":model_kwargs["use_cache"] = Trueelse:model_kwargs["use_cache"] = generation_config.use_cacheif not kwargs_has_attention_mask and requires_attention_mask and accepts_attention_mask:model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(inputs_tensor, generation_config.pad_token_id, generation_config.eos_token_id)if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs:# if model is encoder decoder encoder_outputs are created and added to `model_kwargs`model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(inputs_tensor, model_kwargs, model_input_name, generation_config)# 5. Prepare `input_ids` which will be used for auto-regressive generationif self.config.is_encoder_decoder:input_ids, model_kwargs = self._prepare_decoder_input_ids_for_generation(batch_size=batch_size,model_input_name=model_input_name,model_kwargs=model_kwargs,decoder_start_token_id=generation_config.decoder_start_token_id,device=inputs_tensor.device,)else:input_ids = inputs_tensor if model_input_name == "input_ids" else model_kwargs.pop("input_ids")if streamer is not None:streamer.put(input_ids.cpu())# 6. Prepare `max_length` depending on other stopping criteria.input_ids_length = input_ids.shape[-1]has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not Nonehas_default_min_length = kwargs.get("min_length") is None and generation_config.min_length is not Nonegeneration_config = self._prepare_generated_length(generation_config=generation_config,has_default_max_length=has_default_max_length,has_default_min_length=has_default_min_length,model_input_name=model_input_name,inputs_tensor=inputs_tensor,input_ids_length=input_ids_length,)if generation_config.cache_implementation is not None and model_kwargs.get("past_key_values") is not None:raise ValueError("Passing both `cache_implementation` (used to initialize certain caches) and `past_key_values` (a ""Cache object) is unsupported. Please use only one of the two.")elif generation_config.cache_implementation in NEED_SETUP_CACHE_CLASSES_MAPPING:if not self._supports_cache_class:raise ValueError("This model does not support the `cache_implementation` argument. Please check the following ""issue: https://github.com/huggingface/transformers/issues/28981.")if generation_config.cache_implementation == "static":if not self._supports_static_cache:raise ValueError("This model does not support `cache_implementation='static'`. Please check the following ""issue: https://github.com/huggingface/transformers/issues/28981")model_kwargs["past_key_values"] = self._get_static_cache(batch_size, generation_config.max_length)self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)# 7. determine generation modegeneration_mode = generation_config.get_generation_mode(assistant_model)if streamer is not None and (generation_config.num_beams > 1):raise ValueError("`streamer` cannot be used with beam search (yet!). Make sure that `num_beams` is set to 1.")if self.device.type != input_ids.device.type:warnings.warn("You are calling .generate() with the `input_ids` being on a device type different"f" than your model's device. `input_ids` is on {input_ids.device.type}, whereas the model"f" is on {self.device.type}. You may experience unexpected behaviors or slower generation."" Please make sure that you have put `input_ids` to the"f" correct device by calling for example input_ids = input_ids.to('{self.device.type}') before"" running `.generate()`.",UserWarning,)# 8. prepare distribution pre_processing samplersprepared_logits_processor = self._get_logits_processor(generation_config=generation_config,input_ids_seq_length=input_ids_length,encoder_input_ids=inputs_tensor,prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,logits_processor=logits_processor,device=inputs_tensor.device,model_kwargs=model_kwargs,negative_prompt_ids=negative_prompt_ids,negative_prompt_attention_mask=negative_prompt_attention_mask,)# 9. prepare stopping criteriaprepared_stopping_criteria = self._get_stopping_criteria(generation_config=generation_config, stopping_criteria=stopping_criteria, tokenizer=tokenizer, **kwargs)# 10. go into different generation modesif generation_mode == GenerationMode.ASSISTED_GENERATION:if generation_config.num_return_sequences > 1:raise ValueError("num_return_sequences has to be 1 when doing assisted generate, "f"but is {generation_config.num_return_sequences}.")if batch_size > 1:raise ValueError("assisted generate is only supported for batch_size = 1")if not model_kwargs["use_cache"]:raise ValueError("assisted generate requires `use_cache=True`")if generation_config.cache_implementation == "static":raise ValueError("assisted generate is not supported with `static_cache`")# 11. Get the candidate generator, given the parameterizationcandidate_generator = self._get_candidate_generator(generation_config=generation_config,input_ids=input_ids,inputs_tensor=inputs_tensor,assistant_model=assistant_model,logits_processor=logits_processor,model_kwargs=model_kwargs,)# 12. prepare logits warper (if `do_sample` is `True`)prepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 13. run assisted generateresult = self._assisted_decoding(input_ids,candidate_generator=candidate_generator,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:if not model_kwargs["use_cache"]:raise ValueError("Contrastive search requires `use_cache=True`")result = self._contrastive_search(input_ids,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode in (GenerationMode.SAMPLE, GenerationMode.GREEDY_SEARCH):# 11. prepare logits warperprepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 12. expand input_ids with `num_return_sequences` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_return_sequences,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)result = self._sample(input_ids,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):# 11. prepare logits warperprepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 12. prepare beam search scorerbeam_scorer = BeamSearchScorer(batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,max_length=generation_config.max_length,)# 13. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 14. run beam sampleresult = self._beam_search(input_ids,beam_scorer,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)elif generation_mode == GenerationMode.GROUP_BEAM_SEARCH:# 11. prepare beam search scorerbeam_scorer = BeamSearchScorer(batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,num_beam_groups=generation_config.num_beam_groups,max_length=generation_config.max_length,)# 12. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run beam searchresult = self._group_beam_search(input_ids,beam_scorer,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)elif generation_mode == GenerationMode.CONSTRAINED_BEAM_SEARCH:final_constraints = []if generation_config.constraints is not None:final_constraints = generation_config.constraintsif generation_config.force_words_ids is not None:def typeerror():raise ValueError("`force_words_ids` has to either be a `List[List[List[int]]]` or `List[List[int]]` "f"of positive integers, but is {generation_config.force_words_ids}.")if (not isinstance(generation_config.force_words_ids, list)or len(generation_config.force_words_ids) == 0):typeerror()for word_ids in generation_config.force_words_ids:if isinstance(word_ids[0], list):if not isinstance(word_ids, list) or len(word_ids) == 0:typeerror()if any(not isinstance(token_ids, list) for token_ids in word_ids):typeerror()if any(any((not isinstance(token_id, int) or token_id < 0) for token_id in token_ids)for token_ids in word_ids):typeerror()constraint = DisjunctiveConstraint(word_ids)else:if not isinstance(word_ids, list) or len(word_ids) == 0:typeerror()if any((not isinstance(token_id, int) or token_id < 0) for token_id in word_ids):typeerror()constraint = PhrasalConstraint(word_ids)final_constraints.append(constraint)# 11. prepare beam search scorerconstrained_beam_scorer = ConstrainedBeamSearchScorer(constraints=final_constraints,batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,max_length=generation_config.max_length,)# 12. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run beam searchresult = self._constrained_beam_search(input_ids,constrained_beam_scorer=constrained_beam_scorer,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)return result

这篇关于大模型推理时model.generate的源码的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1051742

相关文章

0基础租个硬件玩deepseek,蓝耘元生代智算云|本地部署DeepSeek R1模型的操作流程

《0基础租个硬件玩deepseek,蓝耘元生代智算云|本地部署DeepSeekR1模型的操作流程》DeepSeekR1模型凭借其强大的自然语言处理能力,在未来具有广阔的应用前景,有望在多个领域发... 目录0基础租个硬件玩deepseek,蓝耘元生代智算云|本地部署DeepSeek R1模型,3步搞定一个应

Deepseek R1模型本地化部署+API接口调用详细教程(释放AI生产力)

《DeepseekR1模型本地化部署+API接口调用详细教程(释放AI生产力)》本文介绍了本地部署DeepSeekR1模型和通过API调用将其集成到VSCode中的过程,作者详细步骤展示了如何下载和... 目录前言一、deepseek R1模型与chatGPT o1系列模型对比二、本地部署步骤1.安装oll

Spring AI Alibaba接入大模型时的依赖问题小结

《SpringAIAlibaba接入大模型时的依赖问题小结》文章介绍了如何在pom.xml文件中配置SpringAIAlibaba依赖,并提供了一个示例pom.xml文件,同时,建议将Maven仓... 目录(一)pom.XML文件:(二)application.yml配置文件(一)pom.xml文件:首

如何在本地部署 DeepSeek Janus Pro 文生图大模型

《如何在本地部署DeepSeekJanusPro文生图大模型》DeepSeekJanusPro模型在本地成功部署,支持图片理解和文生图功能,通过Gradio界面进行交互,展示了其强大的多模态处... 目录什么是 Janus Pro1. 安装 conda2. 创建 python 虚拟环境3. 克隆 janus

本地私有化部署DeepSeek模型的详细教程

《本地私有化部署DeepSeek模型的详细教程》DeepSeek模型是一种强大的语言模型,本地私有化部署可以让用户在自己的环境中安全、高效地使用该模型,避免数据传输到外部带来的安全风险,同时也能根据自... 目录一、引言二、环境准备(一)硬件要求(二)软件要求(三)创建虚拟环境三、安装依赖库四、获取 Dee

DeepSeek模型本地部署的详细教程

《DeepSeek模型本地部署的详细教程》DeepSeek作为一款开源且性能强大的大语言模型,提供了灵活的本地部署方案,让用户能够在本地环境中高效运行模型,同时保护数据隐私,在本地成功部署DeepSe... 目录一、环境准备(一)硬件需求(二)软件依赖二、安装Ollama三、下载并部署DeepSeek模型选

Go中sync.Once源码的深度讲解

《Go中sync.Once源码的深度讲解》sync.Once是Go语言标准库中的一个同步原语,用于确保某个操作只执行一次,本文将从源码出发为大家详细介绍一下sync.Once的具体使用,x希望对大家有... 目录概念简单示例源码解读总结概念sync.Once是Go语言标准库中的一个同步原语,用于确保某个操

Golang的CSP模型简介(最新推荐)

《Golang的CSP模型简介(最新推荐)》Golang采用了CSP(CommunicatingSequentialProcesses,通信顺序进程)并发模型,通过goroutine和channe... 目录前言一、介绍1. 什么是 CSP 模型2. Goroutine3. Channel4. Channe

Java汇编源码如何查看环境搭建

《Java汇编源码如何查看环境搭建》:本文主要介绍如何在IntelliJIDEA开发环境中搭建字节码和汇编环境,以便更好地进行代码调优和JVM学习,首先,介绍了如何配置IntelliJIDEA以方... 目录一、简介二、在IDEA开发环境中搭建汇编环境2.1 在IDEA中搭建字节码查看环境2.1.1 搭建步

Python基于火山引擎豆包大模型搭建QQ机器人详细教程(2024年最新)

《Python基于火山引擎豆包大模型搭建QQ机器人详细教程(2024年最新)》:本文主要介绍Python基于火山引擎豆包大模型搭建QQ机器人详细的相关资料,包括开通模型、配置APIKEY鉴权和SD... 目录豆包大模型概述开通模型付费安装 SDK 环境配置 API KEY 鉴权Ark 模型接口Prompt