transformers中的data_collator

2023-11-24 08:15
文章标签 data transformers collator

本文主要是介绍transformers中的data_collator,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

前言

使用huggingface的Dataset加载数据集,然后使用过tokenizer对文本数据进行编码,但是此时的特征数据还不是tensor,需要转换为深度学习框架所需的tensor类型。data_collator的作用就是将features特征数据转换为tensor类型的dataset。

本文记录huggingface transformers中两种比较常用的data_collator,一种是default_data_collator,另一种是DataCollatorWithPadding。本文使用BertTokenizer作为基础tokenizer,如下所示:

from transformers import BertTokenizer
from transformers import default_data_collator, DataCollatorWithPadding
from datasets import Datasettokenizer = BertTokenizer.from_pretrained("hfl/chinese-bert-wwm-ext")def func(exam):return tokenizer(exam["text"])

default_data_collator

如果使用pytorch框架,default_data_collator本质是执行torch_default_data_collator。注意输入参数要求是List[Any]格式,输出需满足Dict[str, Any]格式。

def default_data_collator(features: List[InputDataClass], return_tensors="pt") -> Dict[str, Any]:"""Very simple data collator that simply collates batches of dict-like objects and performs special handling forpotential keys named:- `label`: handles a single value (int or float) per object- `label_ids`: handles a list of values per objectDoes not do any additional preprocessing: property names of the input object will be used as corresponding inputsto the model. See glue and ner for example of how it's useful."""# In this function we'll make the assumption that all `features` in the batch# have the same attributes.# So we will look at the first element as a proxy for what attributes exist# on the whole batch.if return_tensors == "pt":return torch_default_data_collator(features)elif return_tensors == "tf":return tf_default_data_collator(features)elif return_tensors == "np":return numpy_default_data_collator(features)

torch_default_data_collator 源码如下,源码中假设所有features特征数据拥有相同的属性信息,因此源码选择使用第一个样例数据进行逻辑判断。另外源码对特征数据中的label或者label_ids属性进行特殊处理, 分别对应单标签分类多标签分类。并且将特征属性更名为“labels”——大多数预训练模型的forward方法中定义的关键词参数名为labels

def torch_default_data_collator(features: List[InputDataClass]) -> Dict[str, Any]:import torchif not isinstance(features[0], Mapping):features = [vars(f) for f in features]first = features[0]batch = {}# Special handling for labels.# Ensure that tensor is created with the correct type# (it should be automatically the case, but let's make sure of it.)if "label" in first and first["label"] is not None:label = first["label"].item() if isinstance(first["label"], torch.Tensor) else first["label"]dtype = torch.long if isinstance(label, int) else torch.floatbatch["labels"] = torch.tensor([f["label"] for f in features], dtype=dtype)elif "label_ids" in first and first["label_ids"] is not None:if isinstance(first["label_ids"], torch.Tensor):batch["labels"] = torch.stack([f["label_ids"] for f in features])else:dtype = torch.long if type(first["label_ids"][0]) is int else torch.floatbatch["labels"] = torch.tensor([f["label_ids"] for f in features], dtype=dtype)# Handling of all other possible keys.# Again, we will use the first element to figure out which key/values are not None for this model.for k, v in first.items():if k not in ("label", "label_ids") and v is not None and not isinstance(v, str):if isinstance(v, torch.Tensor):batch[k] = torch.stack([f[k] for f in features])elif isinstance(v, np.ndarray):batch[k] = torch.tensor(np.stack([f[k] for f in features]))else:batch[k] = torch.tensor([f[k] for f in features])return batch

示例:

x = [{"text": "我爱中国。", "label": 1}, {"text": "我爱中国。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
dataset = default_data_collator(features)

DataCollatorWithPadding

注意DataCollatorWithPadding是一个类,首先需要实例化,然后再将features转为dataset。与default_data_collator相比,DataCollatorWithPadding会为接受到的特征数据进行padding操作——各个维度的size补全到相同值。其源码如下:

@dataclass
class DataCollatorWithPadding:"""Data collator that will dynamically pad the inputs received.Args:tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):The tokenizer used for encoding the data.padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):Select a strategy to pad the returned sequences (according to the model's padding side and padding index)among:- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a singlesequence is provided).- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximumacceptable input length for the model if that argument is not provided.- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).max_length (`int`, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (`int`, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=7.5 (Volta).return_tensors (`str`):The type of Tensor to return. Allowable values are "np", "pt" and "tf"."""tokenizer: PreTrainedTokenizerBasepadding: Union[bool, str, PaddingStrategy] = Truemax_length: Optional[int] = Nonepad_to_multiple_of: Optional[int] = Nonereturn_tensors: str = "pt"def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:batch = self.tokenizer.pad(features,padding=self.padding,max_length=self.max_length,pad_to_multiple_of=self.pad_to_multiple_of,return_tensors=self.return_tensors,)if "label" in batch:batch["labels"] = batch["label"]del batch["label"]if "label_ids" in batch:batch["labels"] = batch["label_ids"]del batch["label_ids"]return batch

在实例化过程中,注意pad_to_multiple_of其含义是指将max_length扩充为指定值的整数倍。举例而言,如果max_length=510pad_to_multiple_of=8,则会将max_length设置为512。参考transformers.tokenization_utils_base.PreTrainedTokenizerBase._pad源码:

    def _pad(self,encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],max_length: Optional[int] = None,padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,pad_to_multiple_of: Optional[int] = None,return_attention_mask: Optional[bool] = None,) -> dict:"""Pad encoded inputs (on left/right and up to predefined length or max length in the batch)Args:encoded_inputs:Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).max_length: maximum length of the returned list and optionally padding length (see below).Will truncate by taking into account the special tokens.padding_strategy: PaddingStrategy to use for padding.- PaddingStrategy.LONGEST Pad to the longest sequence in the batch- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)- PaddingStrategy.DO_NOT_PAD: Do not padThe tokenizer padding sides are defined in self.padding_side:- 'left': pads on the left of the sequences- 'right': pads on the right of the sequencespad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability`>= 7.5` (Volta).return_attention_mask:(optional) Set to False to avoid returning attention mask (default: set to model specifics)"""
...
...if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
...
...

DataCollatorWithPadding的__call__方法中,同样将label或者label_ids重命名为labels。并且其实质是通过transformers.tokenization_utils_base.PreTrainedTokenizerBase.pad实现的。

    def pad(self,encoded_inputs: Union[BatchEncoding,List[BatchEncoding],Dict[str, EncodedInput],Dict[str, List[EncodedInput]],List[Dict[str, EncodedInput]],],padding: Union[bool, str, PaddingStrategy] = True,max_length: Optional[int] = None,pad_to_multiple_of: Optional[int] = None,return_attention_mask: Optional[bool] = None,return_tensors: Optional[Union[str, TensorType]] = None,verbose: bool = True,) -> BatchEncoding:"""Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence lengthin the batch.Padding side (left/right) padding token ids are defined at the tokenizer level (with `self.padding_side`,`self.pad_token_id` and `self.pad_token_type_id`).Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode thetext followed by a call to the `pad` method to get a padded encoding.<Tip>If the `encoded_inputs` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, theresult will use the same type unless you provide a different tensor type with `return_tensors`. In the case ofPyTorch tensors, you will lose the specific device of your tensors however.</Tip>Args:encoded_inputs ([`BatchEncoding`], list of [`BatchEncoding`], `Dict[str, List[int]]`, `Dict[str, List[List[int]]` or `List[Dict[str, List[int]]]`):Tokenized inputs. Can represent one input ([`BatchEncoding`] or `Dict[str, List[int]]`) or a batch oftokenized inputs (list of [`BatchEncoding`], *Dict[str, List[List[int]]]* or *List[Dict[str,List[int]]]*) so you can use this method during preprocessing as well as in a PyTorch Dataloadercollate function.Instead of `List[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), seethe note above for the return type.padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):Select a strategy to pad the returned sequences (according to the model's padding side and paddingindex) among:- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a singlesequence if provided).- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximumacceptable input length for the model if that argument is not provided.- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of differentlengths).max_length (`int`, *optional*):Maximum length of the returned list and optionally padding length (see above).pad_to_multiple_of (`int`, *optional*):If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability`>= 7.5` (Volta).return_attention_mask (`bool`, *optional*):Whether to return the attention mask. If left to the default, will return the attention mask accordingto the specific tokenizer's default, defined by the `return_outputs` attribute.[What are attention masks?](../glossary#attention-mask)return_tensors (`str` or [`~utils.TensorType`], *optional*):If set, will return tensors instead of list of python integers. Acceptable values are:- `'tf'`: Return TensorFlow `tf.constant` objects.- `'pt'`: Return PyTorch `torch.Tensor` objects.- `'np'`: Return Numpy `np.ndarray` objects.verbose (`bool`, *optional*, defaults to `True`):Whether or not to print more information and warnings."""......# If we have a list of dicts, let's convert it in a dict of lists# We do this to allow using this method as a collate_fn function in PyTorch Dataloaderif isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping):encoded_inputs = {key: [example[key] for example in encoded_inputs] for key in encoded_inputs[0].keys()}
......
  • 首先注意pad方法对输入参数的要求,其中EncodedInput是List[int]的别名。BatchEncoding可以看做是一个字典对象,其格式满足Dict[str, Any],其数据存储在data属性中。并且BatchEncoding实例化过程中,会调用convert_to_tensors方法,该方法会将data属性中的数据转换成tensor类型。
  • 如果输入的特征数据是List[Dict[str, Any]]格式,会将其转换为Dict[str, List],以满足pytorch Dataloader的要求。并且如果直接使用datasets.Dataset示例对象作为pad方法的输入,会报错——datasets.Dataset示例没有keys属性。

示例:

x += [{"text": "中国是一个伟大国家。", "label": 1}]
ds = Dataset.from_list(x)
features = ds.map(func, batched=False, remove_columns=["text"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
dataset = data_collator(features=features.to_list())  # convert Dataset into List

这篇关于transformers中的data_collator的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/421572

相关文章

HTML5 data-*自定义数据属性的示例代码

《HTML5data-*自定义数据属性的示例代码》HTML5的自定义数据属性(data-*)提供了一种标准化的方法在HTML元素上存储额外信息,可以通过JavaScript访问、修改和在CSS中使用... 目录引言基本概念使用自定义数据属性1. 在 html 中定义2. 通过 JavaScript 访问3.

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

CentOS下mysql数据库data目录迁移

https://my.oschina.net/u/873762/blog/180388        公司新上线一个资讯网站,独立主机,raid5,lamp架构。由于资讯网是面向小行业,初步估计一两年内访问量压力不大,故,在做服务器系统搭建的时候,只是简单分出一个独立的data区作为数据库和网站程序的专区,其他按照linux的默认分区。apache,mysql,php均使用yum安装(也尝试

[论文笔记]LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

引言 今天带来第一篇量化论文LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale笔记。 为了简单,下文中以翻译的口吻记录,比如替换"作者"为"我们"。 大语言模型已被广泛采用,但推理时需要大量的GPU内存。我们开发了一种Int8矩阵乘法的过程,用于Transformer中的前馈和注意力投影层,这可以将推理所需

使用Spring Boot集成Spring Data JPA和单例模式构建库存管理系统

引言 在企业级应用开发中,数据库操作是非常重要的一环。Spring Data JPA提供了一种简化的方式来进行数据库交互,它使得开发者无需编写复杂的JPA代码就可以完成常见的CRUD操作。此外,设计模式如单例模式可以帮助我们更好地管理和控制对象的创建过程,从而提高系统的性能和可维护性。本文将展示如何结合Spring Boot、Spring Data JPA以及单例模式来构建一个基本的库存管理系统

15 组件的切换和对组件的data的使用

划重点 a 标签的使用事件修饰符组件的定义组件的切换:登录 / 注册 泡椒鱼头 :微辣 <!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta http-equiv="X-UA-

12C 新特性,MOVE DATAFILE 在线移动 包括system, 附带改名 NID ,cdb_data_files视图坏了

ALTER DATABASE MOVE DATAFILE  可以改名 可以move file,全部一个命令。 resue 可以重用,keep好像不生效!!! system照移动不误-------- SQL> select file_name, status, online_status from dba_data_files where tablespace_name='SYSTEM'

SIGMOD-24概览Part7: Industry Session (Graph Data Management)

👇BG3: A Cost Effective and I/O Efficient Graph Database in ByteDance 🏛机构:字节 ➡️领域: Information systems → Data management systemsStorage management 📚摘要:介绍了字节新提出的ByteGraph 3.0(BG3)模型,用来处理大规模图结构数据 背景

java.sql.SQLException: No data found

Java代码如下: package com.accord.utils;import java.sql.Connection;import java.sql.DriverManager;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.sql.ResultSetMetaData;import

FORM的ENCTYPE=multipart/form-data 时request.getParameter()值为null问题的解决

此情况发生于前台表单传送至后台java servlet处理: 问题:当Form需要FileUpload上传文件同时上传表单其他控件数据时,由于设置了ENCTYPE=”multipart/form-data” 属性,后台request.getParameter()获取的值为null 上传文件的参考代码:http://www.runoob.com/jsp/jsp-file-uploading.ht