albert每两层共享参数

2024-08-28 01:32
文章标签 参数 共享 两层 albert

本文主要是介绍albert每两层共享参数,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1、albert的原始实现(brightmart实现)

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:name_variable_scope = "layer_shared"else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=True if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

2、变更为每两层共享参数

在tensorflow中,为了 节约变量存储空间 ,我们常常需要通过共享 变量作用域(variable_scope) 来实现 共享变量 。

大家比较常用也比较笨的一种方法是,在重复使用(即 非第一次使用)时,设置 reuse=True 来 再次调用 该共享变量作用域(variable_scope)。对于未初始化的变量,当碰到reuse=True时,会报错;参考参数共享的博客https://blog.csdn.net/qq_35203425/article/details/82469348;使用 tf.Variable_scope(…, reuse=tf.AUTO_REUSE) 的方法来一次性对variable_scope进行reuse,现将代码模板总结如下:

 

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:#name_variable_scope = "layer_shared"name_variable_scope = "layer_%d" % (layer_idx//2)else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=tf.AUTO_REUSE if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

 

3、权值共享的优点

1). 减少运算只是锦上添花

权重共享可以减少运算

2). 权重共享的本质是特征提取

之前说到权重就是模板,我们按照一定的模板来与样本进行比对,看看有没有与模板一致的外在表现(特征)

3). 权重共享使得模型泛化

普通的神经网络输入是固定的,而权重共享可以使得输入不固定。

比如很多张图像,每张图像上有个人脸,但是人脸在图像的不同位置,或者图像的大小也不相同,此时权重共享可以全图扫描,搜索人脸,进而把特征提取出来。

再如RNN做语义分析,两句话:我去年去了北京;去年我和父母去了北京,这其实意思差不多,但文字位置不同,句子长度也不同。

权重共享使得模型能够处理一个连续序列的特征,而不管输入的序列总长度是多少。

当这个连续序列在样本的不同位置时,依然能够识别,而不是学习每个位置的规则,这不仅抓住了不同特征之间的连续性,也减少了学习规则

 

参考博客

https://blog.csdn.net/qq_35203425/article/details/82469348 (tf.AUTO_REUSE实现作用域共享)

https://zhuanlan.zhihu.com/p/103226488 (bert各个模块详解,我觉得非常详细的一篇博客,rewrite)

https://www.cnblogs.com/yanshw/p/10483014.html (再谈权重共享,这篇博客论证了权重共享是必须的)

 

 

这篇关于albert每两层共享参数的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1113343

相关文章

java父子线程之间实现共享传递数据

《java父子线程之间实现共享传递数据》本文介绍了Java中父子线程间共享传递数据的几种方法,包括ThreadLocal变量、并发集合和内存队列或消息队列,并提醒注意并发安全问题... 目录通过 ThreadLocal 变量共享数据通过并发集合共享数据通过内存队列或消息队列共享数据注意并发安全问题总结在 J

Java通过反射获取方法参数名的方式小结

《Java通过反射获取方法参数名的方式小结》这篇文章主要为大家详细介绍了Java如何通过反射获取方法参数名的方式,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录1、前言2、解决方式方式2.1: 添加编译参数配置 -parameters方式2.2: 使用Spring的内部工具类 -

Python调用另一个py文件并传递参数常见的方法及其应用场景

《Python调用另一个py文件并传递参数常见的方法及其应用场景》:本文主要介绍在Python中调用另一个py文件并传递参数的几种常见方法,包括使用import语句、exec函数、subproce... 目录前言1. 使用import语句1.1 基本用法1.2 导入特定函数1.3 处理文件路径2. 使用ex

NFS实现多服务器文件的共享的方法步骤

《NFS实现多服务器文件的共享的方法步骤》NFS允许网络中的计算机之间共享资源,客户端可以透明地读写远端NFS服务器上的文件,本文就来介绍一下NFS实现多服务器文件的共享的方法步骤,感兴趣的可以了解一... 目录一、简介二、部署1、准备1、服务端和客户端:安装nfs-utils2、服务端:创建共享目录3、服

MySQL中时区参数time_zone解读

《MySQL中时区参数time_zone解读》MySQL时区参数time_zone用于控制系统函数和字段的DEFAULTCURRENT_TIMESTAMP属性,修改时区可能会影响timestamp类型... 目录前言1.时区参数影响2.如何设置3.字段类型选择总结前言mysql 时区参数 time_zon

Python如何使用seleniumwire接管Chrome查看控制台中参数

《Python如何使用seleniumwire接管Chrome查看控制台中参数》文章介绍了如何使用Python的seleniumwire库来接管Chrome浏览器,并通过控制台查看接口参数,本文给大家... 1、cmd打开控制台,启动谷歌并制定端口号,找不到文件的加环境变量chrome.exe --rem

Linux中Curl参数详解实践应用

《Linux中Curl参数详解实践应用》在现代网络开发和运维工作中,curl命令是一个不可或缺的工具,它是一个利用URL语法在命令行下工作的文件传输工具,支持多种协议,如HTTP、HTTPS、FTP等... 目录引言一、基础请求参数1. -X 或 --request2. -d 或 --data3. -H 或

使用Nginx来共享文件的详细教程

《使用Nginx来共享文件的详细教程》有时我们想共享电脑上的某些文件,一个比较方便的做法是,开一个HTTP服务,指向文件所在的目录,这次我们用nginx来实现这个需求,本文将通过代码示例一步步教你使用... 在本教程中,我们将向您展示如何使用开源 Web 服务器 Nginx 设置文件共享服务器步骤 0 —

详解Spring Boot接收参数的19种方式

《详解SpringBoot接收参数的19种方式》SpringBoot提供了多种注解来接收不同类型的参数,本文给大家介绍SpringBoot接收参数的19种方式,感兴趣的朋友跟随小编一起看看吧... 目录SpringBoot接受参数相关@PathVariable注解@RequestHeader注解@Reque

Java向kettle8.0传递参数的方式总结

《Java向kettle8.0传递参数的方式总结》介绍了如何在Kettle中传递参数到转换和作业中,包括设置全局properties、使用TransMeta和JobMeta的parameterValu... 目录1.传递参数到转换中2.传递参数到作业中总结1.传递参数到转换中1.1. 通过设置Trans的