运行paddlenlp入门示例:训练与演算

2023-10-29 05:40

本文主要是介绍运行paddlenlp入门示例:训练与演算,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

0. 环境

win10 + NVIDIA GeForce GTX 1660 Ti 6GB
python3.9
cuda 10.2
cudnn 7.6.5
paddlepaddle 2.2.0(已经搭建好GPU版本)

1. 安装PaddleNLP

python -m pip install --upgrade paddlenlp -i https://pypi.org/simple

2. 运行脚本

2.1 创建文件E:\Workspaces\python\nlp\pynlp_10min.py,添加以下内容

        

import paddlenlp as ppnlp
from paddlenlp.datasets import load_datasettrain_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])print(train_ds.label_list)for data in train_ds.data[:5]:print(data)# 设置想要使用模型的名称
MODEL_NAME = "ernie-1.0"tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME)
ernie_model = ppnlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)import paddle# 将原始输入文本切分token,
tokens = tokenizer._tokenize("请输入测试样例")
print("Tokens: {}".format(tokens))# token映射为对应token id
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Tokens id: {}".format(tokens_ids))# 拼接上预训练模型对应的特殊token ,如[CLS]、[SEP]
tokens_ids = tokenizer.build_inputs_with_special_tokens(tokens_ids)# 转化成paddle框架数据格式
tokens_pd = paddle.to_tensor([tokens_ids])
print("Tokens : {}".format(tokens_pd))# 此时即可输入ERNIE模型中得到相应输出
sequence_output, pooled_output = ernie_model(tokens_pd)
print("Token wise output: {}, Pooled output: {}".format(sequence_output.shape, pooled_output.shape))# 一行代码完成切分token,映射token ID以及拼接特殊token
encoded_text = tokenizer(text="请输入测试样例")
for key, value in encoded_text.items():print("{}:\n\t{}".format(key, value))# 转化成paddle框架数据格式
input_ids = paddle.to_tensor([encoded_text['input_ids']])
print("input_ids : {}".format(input_ids))
segment_ids = paddle.to_tensor([encoded_text['token_type_ids']])
print("token_type_ids : {}".format(segment_ids))# 此时即可输入ERNIE模型中得到相应输出
sequence_output, pooled_output = ernie_model(input_ids, segment_ids)
print("Token wise output: {}, Pooled output: {}".format(sequence_output.shape, pooled_output.shape))# 单句输入
single_seg_input = tokenizer(text="请输入测试样例")
# 句对输入
multi_seg_input = tokenizer(text="请输入测试样例1", text_pair="请输入测试样例2")print("单句输入token (str): {}".format(tokenizer.convert_ids_to_tokens(single_seg_input['input_ids'])))
print("单句输入token (int): {}".format(single_seg_input['input_ids']))
print("单句输入segment ids : {}".format(single_seg_input['token_type_ids']))print()
print("句对输入token (str): {}".format(tokenizer.convert_ids_to_tokens(multi_seg_input['input_ids'])))
print("句对输入token (int): {}".format(multi_seg_input['input_ids']))
print("句对输入segment ids : {}".format(multi_seg_input['token_type_ids']))# Highlight: padding到统一长度
encoded_text = tokenizer(text="请输入测试样例",  max_seq_len=15)for key, value in encoded_text.items():print("{}:\n\t{}".format(key, value))# ---------------------------------------------------------------------------------------------------
# 数据读入from functools import partial
from paddlenlp.data import Stack, Tuple, Pad
import numpy as npdef convert_example(example,tokenizer,max_seq_length=512,is_test=False):# 将原数据处理成model可读入的格式,enocded_inputs是一个dict,包含input_ids、token_type_ids等字段encoded_inputs = tokenizer(text=example["text"], max_seq_len=max_seq_length)# input_ids:对文本切分token后,在词汇表中对应的token idinput_ids = encoded_inputs["input_ids"]# token_type_ids:当前token属于句子1还是句子2,即上述图中表达的segment idstoken_type_ids = encoded_inputs["token_type_ids"]if not is_test:# label:情感极性类别label = np.array([example["label"]], dtype="int64")return input_ids, token_type_ids, labelelse:# qid:每条数据的编号qid = np.array([example["qid"]], dtype="int64")return input_ids, token_type_ids, qid
def create_dataloader(dataset,trans_fn=None,mode='train',batch_size=1,batchify_fn=None):if trans_fn:dataset = dataset.map(trans_fn)shuffle = True if mode == 'train' else Falseif mode == "train":sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)else:sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle)dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn)return dataloader# 模型运行批处理大小
batch_size = 8     # 32
max_seq_length = 128trans_func = partial(convert_example,tokenizer=tokenizer,max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),  # inputPad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segmentStack(dtype="int64")  # label
): [data for data in fn(samples)]
train_data_loader = create_dataloader(train_ds,mode='train',batch_size=batch_size,batchify_fn=batchify_fn,trans_fn=trans_func)
dev_data_loader = create_dataloader(dev_ds,mode='dev',batch_size=batch_size,batchify_fn=batchify_fn,trans_fn=trans_func)# ---------------------------------------------------------------------------------------------------
# PaddleNLP一键加载预训练模型ernie_model = ppnlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)
model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=len(train_ds.label_list))# ---------------------------------------------------------------------------------------------------
# 设置Fine-Tune优化策略,接入评价指标from paddlenlp.transformers import LinearDecayWithWarmup# 训练过程中的最大学习率
learning_rate = 5e-5 
# 训练轮次
epochs = 1 #3
# 学习率预热比例
warmup_proportion = 0.1
# 权重衰减系数,类似模型正则项策略,避免模型过拟合
weight_decay = 0.01num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)
optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler,parameters=model.parameters(),weight_decay=weight_decay,apply_decay_param_fun=lambda x: x in [p.name for n, p in model.named_parameters()if not any(nd in n for nd in ["bias", "norm"])])criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()# ---------------------------------------------------------------------------------------------------
# 模型训练与评估
import paddle.nn.functional as F
def evaluate(model, criterion, metric, data_loader):model.eval()metric.reset()losses = []for batch in data_loader:input_ids, token_type_ids, labels = batchlogits = model(input_ids, token_type_ids)loss = criterion(logits, labels)losses.append(loss.numpy())correct = metric.compute(logits, labels)metric.update(correct)accu = metric.accumulate()# print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))model.train()metric.reset()return  np.mean(losses), accuglobal_step = 0
for epoch in range(1, epochs + 1):for step, batch in enumerate(train_data_loader, start=1):input_ids, segment_ids, labels = batchlogits = model(input_ids, segment_ids)loss = criterion(logits, labels)probs = F.softmax(logits, axis=1)correct = metric.compute(probs, labels)metric.update(correct)acc = metric.accumulate()global_step += 1if global_step % 10 == 0 :print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))loss.backward()optimizer.step()lr_scheduler.step()optimizer.clear_grad()evaluate(model, criterion, metric, dev_data_loader)model.save_pretrained('E:\\Workspaces\\python\\nlp\\checkpoint')
tokenizer.save_pretrained('E:\\Workspaces\\python\\nlp\\checkpoint')# ---------------------------------------------------------------------------------------------------
# 模型预测
from utils import predictdata = [{"text":'这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'},{"text":'怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'},{"text":'作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'},
]
label_map = {0: 'negative', 1: 'positive'}results = predict(model, data, tokenizer, label_map, batch_size=batch_size)
for idx, text in enumerate(data):print('Data: {} \t Lable: {}'.format(text, results[idx]))

2.2 创建文件E:\Workspaces\python\nlp\utils.py,添加以下内容

import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Paddef predict(model, data, tokenizer, label_map, batch_size=1):"""Predicts the data labels.Args:model (obj:`paddle.nn.Layer`): A model to classify texts.data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.A Example object contains `text`(word_ids) and `se_len`(sequence length).tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods.label_map(obj:`dict`): The label id (key) to label str (value) map.batch_size(obj:`int`, defaults to 1): The number of batch.Returns:results(obj:`dict`): All the predictions labels."""examples = []for text in data:input_ids, segment_ids = convert_example(text,tokenizer,max_seq_length=128,is_test=True)examples.append((input_ids, segment_ids))batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input idPad(axis=0, pad_val=tokenizer.pad_token_id),  # segment id): fn(samples)# Seperates data into some batches.batches = []one_batch = []for example in examples:one_batch.append(example)if len(one_batch) == batch_size:batches.append(one_batch)one_batch = []if one_batch:# The last batch whose size is less than the config batch_size setting.batches.append(one_batch)results = []model.eval()for batch in batches:input_ids, segment_ids = batchify_fn(batch)input_ids = paddle.to_tensor(input_ids)segment_ids = paddle.to_tensor(segment_ids)logits = model(input_ids, segment_ids)probs = F.softmax(logits, axis=1)idx = paddle.argmax(probs, axis=1).numpy()idx = idx.tolist()labels = [label_map[i] for i in idx]results.extend(labels)return results@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):"""Given a dataset, it evals model and computes the metric.Args:model(obj:`paddle.nn.Layer`): A model to classify texts.data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.criterion(obj:`paddle.nn.Layer`): It can compute the loss.metric(obj:`paddle.metric.Metric`): The evaluation metric."""model.eval()metric.reset()losses = []for batch in data_loader:input_ids, token_type_ids, labels = batchlogits = model(input_ids, token_type_ids)loss = criterion(logits, labels)losses.append(loss.numpy())correct = metric.compute(logits, labels)metric.update(correct)accu = metric.accumulate()print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))model.train()metric.reset()def convert_example(example, tokenizer, max_seq_length=512, is_test=False):"""Builds model inputs from a sequence or a pair of sequence for sequence classification tasksby concatenating and adding special tokens. And creates a mask from the two sequences passed to be used in a sequence-pair classification task.A BERT sequence has the following format:- single sequence: ``[CLS] X [SEP]``- pair of sequences: ``[CLS] A [SEP] B [SEP]``A BERT sequence pair mask has the following format:::0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1| first sequence    | second sequence |If only one sequence, only returns the first portion of the mask (0's).Args:example(obj:`list[str]`): List of input data, containing text and label if it have label.tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` which contains most of the methods. Users should refer to the superclass for more information regarding methods.max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.Returns:input_ids(obj:`list[int]`): The list of token ids.token_type_ids(obj: `list[int]`): List of sequence pair mask.label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test."""encoded_inputs = tokenizer(text=example["text"], max_seq_len=max_seq_length)input_ids = encoded_inputs["input_ids"]token_type_ids = encoded_inputs["token_type_ids"]if not is_test:label = np.array([example["label"]], dtype="int64")return input_ids, token_type_ids, labelelse:return input_ids, token_type_idsdef create_dataloader(dataset,mode='train',batch_size=1,batchify_fn=None,trans_fn=None):if trans_fn:dataset = dataset.map(trans_fn)shuffle = True if mode == 'train' else Falseif mode == 'train':batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)else:batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)return paddle.io.DataLoader(dataset=dataset,batch_sampler=batch_sampler,collate_fn=batchify_fn,return_list=True)

3. 运行结果

 


4. 小结

4.1 缺少了utils.py,在参考[3]中找到了。

4.2 当GPU内存不够用时候,需要将batch_size降低。原本batch_size是32的,我的GPU顶不住,改为8,可以顺利运行本示例。

参考

参考[1],10分钟完成高精度中文情感分析
https://paddlenlp.readthedocs.io/zh/latest/get_started/quick_start.html
参考[2],超简单【推特文本情感13分类练习赛】高分baseline
https://blog.csdn.net/weixin_41450123/article/details/120520141?spm=1001.2014.3001.5501
参考[3],『NLP经典项目集』02:使用预训练模型ERNIE优化情感分析,https://blog.csdn.net/qq_15821487/article/details/117123555
 

这篇关于运行paddlenlp入门示例:训练与演算的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/298603

相关文章

Spring Boot 3.4.3 基于 Spring WebFlux 实现 SSE 功能(代码示例)

《SpringBoot3.4.3基于SpringWebFlux实现SSE功能(代码示例)》SpringBoot3.4.3结合SpringWebFlux实现SSE功能,为实时数据推送提供... 目录1. SSE 简介1.1 什么是 SSE?1.2 SSE 的优点1.3 适用场景2. Spring WebFlu

springboot security快速使用示例详解

《springbootsecurity快速使用示例详解》:本文主要介绍springbootsecurity快速使用示例,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝... 目录创www.chinasem.cn建spring boot项目生成脚手架配置依赖接口示例代码项目结构启用s

golang 日志log与logrus示例详解

《golang日志log与logrus示例详解》log是Go语言标准库中一个简单的日志库,本文给大家介绍golang日志log与logrus示例详解,感兴趣的朋友一起看看吧... 目录一、Go 标准库 log 详解1. 功能特点2. 常用函数3. 示例代码4. 优势和局限二、第三方库 logrus 详解1.

SpringBoot实现MD5加盐算法的示例代码

《SpringBoot实现MD5加盐算法的示例代码》加盐算法是一种用于增强密码安全性的技术,本文主要介绍了SpringBoot实现MD5加盐算法的示例代码,文中通过示例代码介绍的非常详细,对大家的学习... 目录一、什么是加盐算法二、如何实现加盐算法2.1 加盐算法代码实现2.2 注册页面中进行密码加盐2.

Redis 中的热点键和数据倾斜示例详解

《Redis中的热点键和数据倾斜示例详解》热点键是指在Redis中被频繁访问的特定键,这些键由于其高访问频率,可能导致Redis服务器的性能问题,尤其是在高并发场景下,本文给大家介绍Redis中的热... 目录Redis 中的热点键和数据倾斜热点键(Hot Key)定义特点应对策略示例数据倾斜(Data S

JavaScript Array.from及其相关用法详解(示例演示)

《JavaScriptArray.from及其相关用法详解(示例演示)》Array.from方法是ES6引入的一个静态方法,用于从类数组对象或可迭代对象创建一个新的数组实例,本文将详细介绍Array... 目录一、Array.from 方法概述1. 方法介绍2. 示例演示二、结合实际场景的使用1. 初始化二

C#中的 StreamReader/StreamWriter 使用示例详解

《C#中的StreamReader/StreamWriter使用示例详解》在C#开发中,StreamReader和StreamWriter是处理文本文件的核心类,属于System.IO命名空间,本... 目录前言一、什么是 StreamReader 和 StreamWriter?1. 定义2. 特点3. 用

Java中&和&&以及|和||的区别、应用场景和代码示例

《Java中&和&&以及|和||的区别、应用场景和代码示例》:本文主要介绍Java中的逻辑运算符&、&&、|和||的区别,包括它们在布尔和整数类型上的应用,文中通过代码介绍的非常详细,需要的朋友可... 目录前言1. & 和 &&代码示例2. | 和 ||代码示例3. 为什么要使用 & 和 | 而不是总是使

Java强制转化示例代码详解

《Java强制转化示例代码详解》:本文主要介绍Java编程语言中的类型转换,包括基本类型之间的强制类型转换和引用类型的强制类型转换,文中通过代码介绍的非常详细,需要的朋友可以参考下... 目录引入基本类型强制转换1.数字之间2.数字字符之间引入引用类型的强制转换总结引入在Java编程语言中,类型转换(无论

redis+lua实现分布式限流的示例

《redis+lua实现分布式限流的示例》本文主要介绍了redis+lua实现分布式限流的示例,可以实现复杂的限流逻辑,如滑动窗口限流,并且避免了多步操作导致的并发问题,具有一定的参考价值,感兴趣的可... 目录为什么使用Redis+Lua实现分布式限流使用ZSET也可以实现限流,为什么选择lua的方式实现