albert每两层共享参数

2024-08-28 01:32
文章标签 参数 共享 两层 albert

本文主要是介绍albert每两层共享参数,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1、albert的原始实现(brightmart实现)

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:name_variable_scope = "layer_shared"else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=True if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

2、变更为每两层共享参数

在tensorflow中,为了 节约变量存储空间 ,我们常常需要通过共享 变量作用域(variable_scope) 来实现 共享变量 。

大家比较常用也比较笨的一种方法是,在重复使用(即 非第一次使用)时,设置 reuse=True 来 再次调用 该共享变量作用域(variable_scope)。对于未初始化的变量,当碰到reuse=True时,会报错;参考参数共享的博客https://blog.csdn.net/qq_35203425/article/details/82469348;使用 tf.Variable_scope(…, reuse=tf.AUTO_REUSE) 的方法来一次性对variable_scope进行reuse,现将代码模板总结如下:

 

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:#name_variable_scope = "layer_shared"name_variable_scope = "layer_%d" % (layer_idx//2)else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=tf.AUTO_REUSE if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

 

3、权值共享的优点

1). 减少运算只是锦上添花

权重共享可以减少运算

2). 权重共享的本质是特征提取

之前说到权重就是模板,我们按照一定的模板来与样本进行比对,看看有没有与模板一致的外在表现(特征)

3). 权重共享使得模型泛化

普通的神经网络输入是固定的,而权重共享可以使得输入不固定。

比如很多张图像,每张图像上有个人脸,但是人脸在图像的不同位置,或者图像的大小也不相同,此时权重共享可以全图扫描,搜索人脸,进而把特征提取出来。

再如RNN做语义分析,两句话:我去年去了北京;去年我和父母去了北京,这其实意思差不多,但文字位置不同,句子长度也不同。

权重共享使得模型能够处理一个连续序列的特征,而不管输入的序列总长度是多少。

当这个连续序列在样本的不同位置时,依然能够识别,而不是学习每个位置的规则,这不仅抓住了不同特征之间的连续性,也减少了学习规则

 

参考博客

https://blog.csdn.net/qq_35203425/article/details/82469348 (tf.AUTO_REUSE实现作用域共享)

https://zhuanlan.zhihu.com/p/103226488 (bert各个模块详解,我觉得非常详细的一篇博客,rewrite)

https://www.cnblogs.com/yanshw/p/10483014.html (再谈权重共享,这篇博客论证了权重共享是必须的)

 

 

这篇关于albert每两层共享参数的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1113343

相关文章

SpringBoot请求参数接收控制指南分享

《SpringBoot请求参数接收控制指南分享》:本文主要介绍SpringBoot请求参数接收控制指南,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录Spring Boot 请求参数接收控制指南1. 概述2. 有注解时参数接收方式对比3. 无注解时接收参数默认位置

Python使用getopt处理命令行参数示例解析(最佳实践)

《Python使用getopt处理命令行参数示例解析(最佳实践)》getopt模块是Python标准库中一个简单但强大的命令行参数处理工具,它特别适合那些需要快速实现基本命令行参数解析的场景,或者需要... 目录为什么需要处理命令行参数?getopt模块基础实际应用示例与其他参数处理方式的比较常见问http

Android实现两台手机屏幕共享和远程控制功能

《Android实现两台手机屏幕共享和远程控制功能》在远程协助、在线教学、技术支持等多种场景下,实时获得另一部移动设备的屏幕画面,并对其进行操作,具有极高的应用价值,本项目旨在实现两台Android手... 目录一、项目概述二、相关知识2.1 MediaProjection API2.2 Socket 网络

Linux内核参数配置与验证详细指南

《Linux内核参数配置与验证详细指南》在Linux系统运维和性能优化中,内核参数(sysctl)的配置至关重要,本文主要来聊聊如何配置与验证这些Linux内核参数,希望对大家有一定的帮助... 目录1. 引言2. 内核参数的作用3. 如何设置内核参数3.1 临时设置(重启失效)3.2 永久设置(重启仍生效

SpringMVC获取请求参数的方法

《SpringMVC获取请求参数的方法》:本文主要介绍SpringMVC获取请求参数的方法,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友可以参考下... 目录1、通过ServletAPI获取2、通过控制器方法的形参获取请求参数3、@RequestParam4、@

Spring Boot项目部署命令java -jar的各种参数及作用详解

《SpringBoot项目部署命令java-jar的各种参数及作用详解》:本文主要介绍SpringBoot项目部署命令java-jar的各种参数及作用的相关资料,包括设置内存大小、垃圾回收... 目录前言一、基础命令结构二、常见的 Java 命令参数1. 设置内存大小2. 配置垃圾回收器3. 配置线程栈大小

SpringBoot利用@Validated注解优雅实现参数校验

《SpringBoot利用@Validated注解优雅实现参数校验》在开发Web应用时,用户输入的合法性校验是保障系统稳定性的基础,​SpringBoot的@Validated注解提供了一种更优雅的解... 目录​一、为什么需要参数校验二、Validated 的核心用法​1. 基础校验2. php分组校验3

Linux samba共享慢的原因及解决方案

《Linuxsamba共享慢的原因及解决方案》:本文主要介绍Linuxsamba共享慢的原因及解决方案,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录linux samba共享慢原因及解决问题表现原因解决办法总结Linandroidux samba共享慢原因及解决

一文带你了解SpringBoot中启动参数的各种用法

《一文带你了解SpringBoot中启动参数的各种用法》在使用SpringBoot开发应用时,我们通常需要根据不同的环境或特定需求调整启动参数,那么,SpringBoot提供了哪些方式来配置这些启动参... 目录一、启动参数的常见传递方式二、通过命令行参数传递启动参数三、使用 application.pro

基于@RequestParam注解之Spring MVC参数绑定的利器

《基于@RequestParam注解之SpringMVC参数绑定的利器》:本文主要介绍基于@RequestParam注解之SpringMVC参数绑定的利器,具有很好的参考价值,希望对大家有所帮助... 目录@RequestParam注解:Spring MVC参数绑定的利器什么是@RequestParam?@