本文主要是介绍albert每两层共享参数,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
1、albert的原始实现(brightmart实现)
def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:name_variable_scope = "layer_shared"else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=True if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output
2、变更为每两层共享参数
在tensorflow中,为了 节约变量存储空间 ,我们常常需要通过共享 变量作用域(variable_scope) 来实现 共享变量 。
大家比较常用也比较笨的一种方法是,在重复使用(即 非第一次使用)时,设置 reuse=True 来 再次调用 该共享变量作用域(variable_scope)。对于未初始化的变量,当碰到reuse=True时,会报错;参考参数共享的博客https://blog.csdn.net/qq_35203425/article/details/82469348;使用 tf.Variable_scope(…, reuse=tf.AUTO_REUSE) 的方法来一次性对variable_scope进行reuse,现将代码模板总结如下:
def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:#name_variable_scope = "layer_shared"name_variable_scope = "layer_%d" % (layer_idx//2)else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=tf.AUTO_REUSE if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output
3、权值共享的优点
1). 减少运算只是锦上添花
权重共享可以减少运算
2). 权重共享的本质是特征提取
之前说到权重就是模板,我们按照一定的模板来与样本进行比对,看看有没有与模板一致的外在表现(特征)
3). 权重共享使得模型泛化
普通的神经网络输入是固定的,而权重共享可以使得输入不固定。
比如很多张图像,每张图像上有个人脸,但是人脸在图像的不同位置,或者图像的大小也不相同,此时权重共享可以全图扫描,搜索人脸,进而把特征提取出来。
再如RNN做语义分析,两句话:我去年去了北京;去年我和父母去了北京,这其实意思差不多,但文字位置不同,句子长度也不同。
权重共享使得模型能够处理一个连续序列的特征,而不管输入的序列总长度是多少。
当这个连续序列在样本的不同位置时,依然能够识别,而不是学习每个位置的规则,这不仅抓住了不同特征之间的连续性,也减少了学习规则
参考博客
https://blog.csdn.net/qq_35203425/article/details/82469348 (tf.AUTO_REUSE实现作用域共享)
https://zhuanlan.zhihu.com/p/103226488 (bert各个模块详解,我觉得非常详细的一篇博客,rewrite)
https://www.cnblogs.com/yanshw/p/10483014.html (再谈权重共享,这篇博客论证了权重共享是必须的)
这篇关于albert每两层共享参数的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!