大模型推理时model.generate的源码

2024-06-11 16:52

本文主要是介绍大模型推理时model.generate的源码,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

大模型推理时model.generate的源码

文件路径:anaconda3/envs/环境名/lib/python3.10/site-packages/transformers/generation/utils.py

def generate(self,inputs: Optional[torch.Tensor] = None,generation_config: Optional[GenerationConfig] = None,logits_processor: Optional[LogitsProcessorList] = None,stopping_criteria: Optional[StoppingCriteriaList] = None,prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,synced_gpus: Optional[bool] = None,assistant_model: Optional["PreTrainedModel"] = None,streamer: Optional["BaseStreamer"] = None,negative_prompt_ids: Optional[torch.Tensor] = None,negative_prompt_attention_mask: Optional[torch.Tensor] = None,**kwargs,) -> Union[GenerateOutput, torch.LongTensor]:r"""Generates sequences of token ids for models with a language modeling head.<Tip warning={true}>Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to themodel's default generation configuration. You can override any `generation_config` by passing the correspondingparameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`.For an overview of generation strategies and code examples, check out the [followingguide](../generation_strategies).</Tip>Parameters:inputs (`torch.Tensor` of varying shape depending on the modality, *optional*):The sequence used as a prompt for the generation or as model inputs to the encoder. If `None` themethod initializes it with `bos_token_id` and a batch size of 1. For decoder-only models `inputs`should be in the format of `input_ids`. For encoder-decoder models *inputs* can represent any of`input_ids`, `input_values`, `input_features`, or `pixel_values`.generation_config ([`~generation.GenerationConfig`], *optional*):The generation configuration to be used as base parametrization for the generation call. `**kwargs`passed to generate matching the attributes of `generation_config` will override them. If`generation_config` is not provided, the default will be used, which has the following loadingpriority: 1) from the `generation_config.json` model file, if it exists; 2) from the modelconfiguration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'sdefault values, whose documentation should be checked to parameterize generation.logits_processor (`LogitsProcessorList`, *optional*):Custom logits processors that complement the default logits processors built from arguments andgeneration config. If a logit processor is passed that is already created with the arguments or ageneration config an error is thrown. This feature is intended for advanced users.stopping_criteria (`StoppingCriteriaList`, *optional*):Custom stopping criteria that complements the default stopping criteria built from arguments and ageneration config. If a stopping criteria is passed that is already created with the arguments or ageneration config an error is thrown. If your stopping criteria depends on the `scores` input, makesure you pass `return_dict_in_generate=True, output_scores=True` to `generate`. This feature isintended for advanced users.prefix_allowed_tokens_fn (`Callable[[int, torch.Tensor], List[int]]`, *optional*):If provided, this function constraints the beam search to allowed tokens only at each step. If notprovided no constraint is applied. This function takes 2 arguments: the batch ID `batch_id` and`input_ids`. It has to return a list with the allowed tokens for the next generation step conditionedon the batch ID `batch_id` and the previously generated tokens `inputs_ids`. This argument is usefulfor constrained generation conditioned on the prefix, as described in [Autoregressive EntityRetrieval](https://arxiv.org/abs/2010.00904).synced_gpus (`bool`, *optional*):Whether to continue running the while loop until max_length. Unless overridden this flag will be set to`True` under DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finishedgenerating before other GPUs. Otherwise it'll be set to `False`.assistant_model (`PreTrainedModel`, *optional*):An assistant model that can be used to accelerate generation. The assistant model must have the exactsame tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistent modelis much faster than running generation with the model you're calling generate from. As such, theassistant model should be much smaller.streamer (`BaseStreamer`, *optional*):Streamer object that will be used to stream the generated sequences. Generated tokens are passedthrough `streamer.put(token_ids)` and the streamer is responsible for any further processing.negative_prompt_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):The negative prompt needed for some processors such as CFG. The batch size must match the input batchsize. This is an experimental feature, subject to breaking API changes in future versions.negative_prompt_attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):Attention_mask for `negative_prompt_ids`.kwargs (`Dict[str, Any]`, *optional*):Ad hoc parametrization of `generation_config` and/or additional model-specific kwargs that will beforwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoderspecific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder_*.Return:[`~utils.ModelOutput`] or `torch.LongTensor`: A [`~utils.ModelOutput`] (if `return_dict_in_generate=True`or when `config.return_dict_in_generate=True`) or a `torch.LongTensor`.If the model is *not* an encoder-decoder model (`model.config.is_encoder_decoder=False`), the possible[`~utils.ModelOutput`] types are:- [`~generation.GenerateDecoderOnlyOutput`],- [`~generation.GenerateBeamDecoderOnlyOutput`]If the model is an encoder-decoder model (`model.config.is_encoder_decoder=True`), the possible[`~utils.ModelOutput`] types are:- [`~generation.GenerateEncoderDecoderOutput`],- [`~generation.GenerateBeamEncoderDecoderOutput`]"""# 1. Handle `generation_config` and kwargs that might update it, and validate the `.generate()` callself._validate_model_class()tokenizer = kwargs.pop("tokenizer", None)  # Pull this out first, we only use it for stopping criteriageneration_config, model_kwargs = self._prepare_generation_config(generation_config, **kwargs)self._validate_model_kwargs(model_kwargs.copy())# 2. Set generation parameters if not already definedif synced_gpus is None:if is_deepspeed_zero3_enabled() and dist.get_world_size() > 1:synced_gpus = Trueelse:synced_gpus = Falselogits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()accepts_attention_mask = "attention_mask" in set(inspect.signature(self.forward).parameters.keys())requires_attention_mask = "encoder_outputs" not in model_kwargskwargs_has_attention_mask = model_kwargs.get("attention_mask", None) is not None# 3. Define model inputsinputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs(inputs, generation_config.bos_token_id, model_kwargs)batch_size = inputs_tensor.shape[0]device = inputs_tensor.deviceself._prepare_special_tokens(generation_config, kwargs_has_attention_mask, device=device)# decoder-only models must use left-padding for batched generation.if not self.config.is_encoder_decoder and not is_torchdynamo_compiling():# If `input_ids` was given, check if the last id in any sequence is `pad_token_id`# Note: If using, `inputs_embeds` this check does not work, because we want to be more hands-off.if (generation_config.pad_token_id is not Noneand batch_size > 1and len(inputs_tensor.shape) == 2and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0):logger.warning("A decoder-only architecture is being used, but right-padding was detected! For correct ""generation results, please set `padding_side='left'` when initializing the tokenizer.")# 4. Define other model kwargs# decoder-only models with inputs_embeds forwarding must use caching (otherwise we can't detect whether we are# generating the first new token or not, and we only want to use the embeddings for the first new token)if not self.config.is_encoder_decoder and model_input_name == "inputs_embeds":model_kwargs["use_cache"] = Trueelse:model_kwargs["use_cache"] = generation_config.use_cacheif not kwargs_has_attention_mask and requires_attention_mask and accepts_attention_mask:model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(inputs_tensor, generation_config.pad_token_id, generation_config.eos_token_id)if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs:# if model is encoder decoder encoder_outputs are created and added to `model_kwargs`model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(inputs_tensor, model_kwargs, model_input_name, generation_config)# 5. Prepare `input_ids` which will be used for auto-regressive generationif self.config.is_encoder_decoder:input_ids, model_kwargs = self._prepare_decoder_input_ids_for_generation(batch_size=batch_size,model_input_name=model_input_name,model_kwargs=model_kwargs,decoder_start_token_id=generation_config.decoder_start_token_id,device=inputs_tensor.device,)else:input_ids = inputs_tensor if model_input_name == "input_ids" else model_kwargs.pop("input_ids")if streamer is not None:streamer.put(input_ids.cpu())# 6. Prepare `max_length` depending on other stopping criteria.input_ids_length = input_ids.shape[-1]has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not Nonehas_default_min_length = kwargs.get("min_length") is None and generation_config.min_length is not Nonegeneration_config = self._prepare_generated_length(generation_config=generation_config,has_default_max_length=has_default_max_length,has_default_min_length=has_default_min_length,model_input_name=model_input_name,inputs_tensor=inputs_tensor,input_ids_length=input_ids_length,)if generation_config.cache_implementation is not None and model_kwargs.get("past_key_values") is not None:raise ValueError("Passing both `cache_implementation` (used to initialize certain caches) and `past_key_values` (a ""Cache object) is unsupported. Please use only one of the two.")elif generation_config.cache_implementation in NEED_SETUP_CACHE_CLASSES_MAPPING:if not self._supports_cache_class:raise ValueError("This model does not support the `cache_implementation` argument. Please check the following ""issue: https://github.com/huggingface/transformers/issues/28981.")if generation_config.cache_implementation == "static":if not self._supports_static_cache:raise ValueError("This model does not support `cache_implementation='static'`. Please check the following ""issue: https://github.com/huggingface/transformers/issues/28981")model_kwargs["past_key_values"] = self._get_static_cache(batch_size, generation_config.max_length)self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)# 7. determine generation modegeneration_mode = generation_config.get_generation_mode(assistant_model)if streamer is not None and (generation_config.num_beams > 1):raise ValueError("`streamer` cannot be used with beam search (yet!). Make sure that `num_beams` is set to 1.")if self.device.type != input_ids.device.type:warnings.warn("You are calling .generate() with the `input_ids` being on a device type different"f" than your model's device. `input_ids` is on {input_ids.device.type}, whereas the model"f" is on {self.device.type}. You may experience unexpected behaviors or slower generation."" Please make sure that you have put `input_ids` to the"f" correct device by calling for example input_ids = input_ids.to('{self.device.type}') before"" running `.generate()`.",UserWarning,)# 8. prepare distribution pre_processing samplersprepared_logits_processor = self._get_logits_processor(generation_config=generation_config,input_ids_seq_length=input_ids_length,encoder_input_ids=inputs_tensor,prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,logits_processor=logits_processor,device=inputs_tensor.device,model_kwargs=model_kwargs,negative_prompt_ids=negative_prompt_ids,negative_prompt_attention_mask=negative_prompt_attention_mask,)# 9. prepare stopping criteriaprepared_stopping_criteria = self._get_stopping_criteria(generation_config=generation_config, stopping_criteria=stopping_criteria, tokenizer=tokenizer, **kwargs)# 10. go into different generation modesif generation_mode == GenerationMode.ASSISTED_GENERATION:if generation_config.num_return_sequences > 1:raise ValueError("num_return_sequences has to be 1 when doing assisted generate, "f"but is {generation_config.num_return_sequences}.")if batch_size > 1:raise ValueError("assisted generate is only supported for batch_size = 1")if not model_kwargs["use_cache"]:raise ValueError("assisted generate requires `use_cache=True`")if generation_config.cache_implementation == "static":raise ValueError("assisted generate is not supported with `static_cache`")# 11. Get the candidate generator, given the parameterizationcandidate_generator = self._get_candidate_generator(generation_config=generation_config,input_ids=input_ids,inputs_tensor=inputs_tensor,assistant_model=assistant_model,logits_processor=logits_processor,model_kwargs=model_kwargs,)# 12. prepare logits warper (if `do_sample` is `True`)prepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 13. run assisted generateresult = self._assisted_decoding(input_ids,candidate_generator=candidate_generator,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:if not model_kwargs["use_cache"]:raise ValueError("Contrastive search requires `use_cache=True`")result = self._contrastive_search(input_ids,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode in (GenerationMode.SAMPLE, GenerationMode.GREEDY_SEARCH):# 11. prepare logits warperprepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 12. expand input_ids with `num_return_sequences` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_return_sequences,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)result = self._sample(input_ids,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,streamer=streamer,**model_kwargs,)elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):# 11. prepare logits warperprepared_logits_warper = (self._get_logits_warper(generation_config) if generation_config.do_sample else None)# 12. prepare beam search scorerbeam_scorer = BeamSearchScorer(batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,max_length=generation_config.max_length,)# 13. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 14. run beam sampleresult = self._beam_search(input_ids,beam_scorer,logits_processor=prepared_logits_processor,logits_warper=prepared_logits_warper,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)elif generation_mode == GenerationMode.GROUP_BEAM_SEARCH:# 11. prepare beam search scorerbeam_scorer = BeamSearchScorer(batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,num_beam_groups=generation_config.num_beam_groups,max_length=generation_config.max_length,)# 12. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run beam searchresult = self._group_beam_search(input_ids,beam_scorer,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)elif generation_mode == GenerationMode.CONSTRAINED_BEAM_SEARCH:final_constraints = []if generation_config.constraints is not None:final_constraints = generation_config.constraintsif generation_config.force_words_ids is not None:def typeerror():raise ValueError("`force_words_ids` has to either be a `List[List[List[int]]]` or `List[List[int]]` "f"of positive integers, but is {generation_config.force_words_ids}.")if (not isinstance(generation_config.force_words_ids, list)or len(generation_config.force_words_ids) == 0):typeerror()for word_ids in generation_config.force_words_ids:if isinstance(word_ids[0], list):if not isinstance(word_ids, list) or len(word_ids) == 0:typeerror()if any(not isinstance(token_ids, list) for token_ids in word_ids):typeerror()if any(any((not isinstance(token_id, int) or token_id < 0) for token_id in token_ids)for token_ids in word_ids):typeerror()constraint = DisjunctiveConstraint(word_ids)else:if not isinstance(word_ids, list) or len(word_ids) == 0:typeerror()if any((not isinstance(token_id, int) or token_id < 0) for token_id in word_ids):typeerror()constraint = PhrasalConstraint(word_ids)final_constraints.append(constraint)# 11. prepare beam search scorerconstrained_beam_scorer = ConstrainedBeamSearchScorer(constraints=final_constraints,batch_size=batch_size,num_beams=generation_config.num_beams,device=inputs_tensor.device,length_penalty=generation_config.length_penalty,do_early_stopping=generation_config.early_stopping,num_beam_hyps_to_keep=generation_config.num_return_sequences,max_length=generation_config.max_length,)# 12. interleave input_ids with `num_beams` additional sequences per batchinput_ids, model_kwargs = self._expand_inputs_for_generation(input_ids=input_ids,expand_size=generation_config.num_beams,is_encoder_decoder=self.config.is_encoder_decoder,**model_kwargs,)# 13. run beam searchresult = self._constrained_beam_search(input_ids,constrained_beam_scorer=constrained_beam_scorer,logits_processor=prepared_logits_processor,stopping_criteria=prepared_stopping_criteria,generation_config=generation_config,synced_gpus=synced_gpus,**model_kwargs,)return result

这篇关于大模型推理时model.generate的源码的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!


原文地址:
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.chinasem.cn/article/1051742

相关文章

Python实现无痛修改第三方库源码的方法详解

《Python实现无痛修改第三方库源码的方法详解》很多时候,我们下载的第三方库是不会有需求不满足的情况,但也有极少的情况,第三方库没有兼顾到需求,本文将介绍几个修改源码的操作,大家可以根据需求进行选择... 目录需求不符合模拟示例 1. 修改源文件2. 继承修改3. 猴子补丁4. 追踪局部变量需求不符合很

Java的IO模型、Netty原理解析

《Java的IO模型、Netty原理解析》Java的I/O是以流的方式进行数据输入输出的,Java的类库涉及很多领域的IO内容:标准的输入输出,文件的操作、网络上的数据传输流、字符串流、对象流等,这篇... 目录1.什么是IO2.同步与异步、阻塞与非阻塞3.三种IO模型BIO(blocking I/O)NI

基于Flask框架添加多个AI模型的API并进行交互

《基于Flask框架添加多个AI模型的API并进行交互》:本文主要介绍如何基于Flask框架开发AI模型API管理系统,允许用户添加、删除不同AI模型的API密钥,感兴趣的可以了解下... 目录1. 概述2. 后端代码说明2.1 依赖库导入2.2 应用初始化2.3 API 存储字典2.4 路由函数2.5 应

GORM中Model和Table的区别及使用

《GORM中Model和Table的区别及使用》Model和Table是两种与数据库表交互的核心方法,但它们的用途和行为存在著差异,本文主要介绍了GORM中Model和Table的区别及使用,具有一... 目录1. Model 的作用与特点1.1 核心用途1.2 行为特点1.3 示例China编程代码2. Tab

Spring 中 BeanFactoryPostProcessor 的作用和示例源码分析

《Spring中BeanFactoryPostProcessor的作用和示例源码分析》Spring的BeanFactoryPostProcessor是容器初始化的扩展接口,允许在Bean实例化前... 目录一、概览1. 核心定位2. 核心功能详解3. 关键特性二、Spring 内置的 BeanFactory

C#集成DeepSeek模型实现AI私有化的流程步骤(本地部署与API调用教程)

《C#集成DeepSeek模型实现AI私有化的流程步骤(本地部署与API调用教程)》本文主要介绍了C#集成DeepSeek模型实现AI私有化的方法,包括搭建基础环境,如安装Ollama和下载DeepS... 目录前言搭建基础环境1、安装 Ollama2、下载 DeepSeek R1 模型客户端 ChatBo

SpringBoot快速接入OpenAI大模型的方法(JDK8)

《SpringBoot快速接入OpenAI大模型的方法(JDK8)》本文介绍了如何使用AI4J快速接入OpenAI大模型,并展示了如何实现流式与非流式的输出,以及对函数调用的使用,AI4J支持JDK8... 目录使用AI4J快速接入OpenAI大模型介绍AI4J-github快速使用创建SpringBoot

0基础租个硬件玩deepseek,蓝耘元生代智算云|本地部署DeepSeek R1模型的操作流程

《0基础租个硬件玩deepseek,蓝耘元生代智算云|本地部署DeepSeekR1模型的操作流程》DeepSeekR1模型凭借其强大的自然语言处理能力,在未来具有广阔的应用前景,有望在多个领域发... 目录0基础租个硬件玩deepseek,蓝耘元生代智算云|本地部署DeepSeek R1模型,3步搞定一个应

Deepseek R1模型本地化部署+API接口调用详细教程(释放AI生产力)

《DeepseekR1模型本地化部署+API接口调用详细教程(释放AI生产力)》本文介绍了本地部署DeepSeekR1模型和通过API调用将其集成到VSCode中的过程,作者详细步骤展示了如何下载和... 目录前言一、deepseek R1模型与chatGPT o1系列模型对比二、本地部署步骤1.安装oll

Spring AI Alibaba接入大模型时的依赖问题小结

《SpringAIAlibaba接入大模型时的依赖问题小结》文章介绍了如何在pom.xml文件中配置SpringAIAlibaba依赖,并提供了一个示例pom.xml文件,同时,建议将Maven仓... 目录(一)pom.XML文件:(二)application.yml配置文件(一)pom.xml文件:首