albert每两层共享参数

2024-08-28 01:32
文章标签 参数 共享 两层 albert

本文主要是介绍albert每两层共享参数,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1、albert的原始实现(brightmart实现)

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:name_variable_scope = "layer_shared"else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=True if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

2、变更为每两层共享参数

在tensorflow中,为了 节约变量存储空间 ,我们常常需要通过共享 变量作用域(variable_scope) 来实现 共享变量 。

大家比较常用也比较笨的一种方法是,在重复使用(即 非第一次使用)时,设置 reuse=True 来 再次调用 该共享变量作用域(variable_scope)。对于未初始化的变量,当碰到reuse=True时,会报错;参考参数共享的博客https://blog.csdn.net/qq_35203425/article/details/82469348;使用 tf.Variable_scope(…, reuse=tf.AUTO_REUSE) 的方法来一次性对variable_scope进行reuse,现将代码模板总结如下:

 

def transformer_model(input_tensor,attention_mask=None,hidden_size=768,num_hidden_layers=12,num_attention_heads=12,intermediate_size=3072,intermediate_act_fn=gelu,hidden_dropout_prob=0.1,attention_probs_dropout_prob=0.1,initializer_range=0.02,do_return_all_layers=False,share_parameter_across_layers=True):"""Multi-headed, multi-layer Transformer from "Attention is All You Need".This is almost an exact implementation of the original Transformer encoder.See the original paper:https://arxiv.org/abs/1706.03762Also see:https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.pyArgs:input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,seq_length], with 1 for positions that can be attended to and 0 inpositions that should not be.hidden_size: int. Hidden size of the Transformer.num_hidden_layers: int. Number of layers (blocks) in the Transformer.num_attention_heads: int. Number of attention heads in the Transformer.intermediate_size: int. The size of the "intermediate" (a.k.a., feedforward) layer.intermediate_act_fn: function. The non-linear activation function to applyto the output of the intermediate/feed-forward layer.hidden_dropout_prob: float. Dropout probability for the hidden layers.attention_probs_dropout_prob: float. Dropout probability of the attentionprobabilities.initializer_range: float. Range of the initializer (stddev of truncatednormal).do_return_all_layers: Whether to also return all layers or just the finallayer.Returns:float Tensor of shape [batch_size, seq_length, hidden_size], the finalhidden layer of the Transformer.Raises:ValueError: A Tensor shape or parameter is invalid."""if hidden_size % num_attention_heads != 0:raise ValueError("The hidden size (%d) is not a multiple of the number of attention ""heads (%d)" % (hidden_size, num_attention_heads))attention_head_size = int(hidden_size / num_attention_heads)input_shape = get_shape_list(input_tensor, expected_rank=3)batch_size = input_shape[0]seq_length = input_shape[1]input_width = input_shape[2]# The Transformer performs sum residuals on all layers so the input needs# to be the same as the hidden size.if input_width != hidden_size:raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %(input_width, hidden_size))# We keep the representation as a 2D tensor to avoid re-shaping it back and# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on# the GPU/CPU but may not be free on the TPU, so we want to minimize them to# help the optimizer.prev_output = reshape_to_matrix(input_tensor)all_layer_outputs = []for layer_idx in range(num_hidden_layers):if share_parameter_across_layers:#name_variable_scope = "layer_shared"name_variable_scope = "layer_%d" % (layer_idx//2)else:name_variable_scope = "layer_%d" % layer_idx# share all parameters across layers. add by brightmart, 2019-09-28. previous it is like this: "layer_%d" % layer_idxwith tf.variable_scope(name_variable_scope,reuse=tf.AUTO_REUSE if (share_parameter_across_layers and layer_idx > 0) else False):layer_input = prev_outputwith tf.variable_scope("attention"):attention_heads = []with tf.variable_scope("self"):attention_head = attention_layer(from_tensor=layer_input,to_tensor=layer_input,attention_mask=attention_mask,num_attention_heads=num_attention_heads,size_per_head=attention_head_size,attention_probs_dropout_prob=attention_probs_dropout_prob,initializer_range=initializer_range,do_return_2d_tensor=True,batch_size=batch_size,from_seq_length=seq_length,to_seq_length=seq_length)attention_heads.append(attention_head)attention_output = Noneif len(attention_heads) == 1:attention_output = attention_heads[0]else:# In the case where we have other sequences, we just concatenate# them to the self-attention head before the projection.attention_output = tf.concat(attention_heads, axis=-1)# Run a linear projection of `hidden_size` then add a residual# with `layer_input`.with tf.variable_scope("output"):attention_output = tf.layers.dense(attention_output,hidden_size,kernel_initializer=create_initializer(initializer_range))attention_output = dropout(attention_output, hidden_dropout_prob)attention_output = layer_norm(attention_output + layer_input)# The activation is only applied to the "intermediate" hidden layer.with tf.variable_scope("intermediate"):intermediate_output = tf.layers.dense(attention_output,intermediate_size,activation=intermediate_act_fn,kernel_initializer=create_initializer(initializer_range))# Down-project back to `hidden_size` then add the residual.with tf.variable_scope("output"):layer_output = tf.layers.dense(intermediate_output,hidden_size,kernel_initializer=create_initializer(initializer_range))layer_output = dropout(layer_output, hidden_dropout_prob)layer_output = layer_norm(layer_output + attention_output)prev_output = layer_outputall_layer_outputs.append(layer_output)if do_return_all_layers:final_outputs = []for layer_output in all_layer_outputs:final_output = reshape_from_matrix(layer_output, input_shape)final_outputs.append(final_output)return final_outputselse:final_output = reshape_from_matrix(prev_output, input_shape)return final_output

 

3、权值共享的优点

1). 减少运算只是锦上添花

权重共享可以减少运算

2). 权重共享的本质是特征提取

之前说到权重就是模板,我们按照一定的模板来与样本进行比对,看看有没有与模板一致的外在表现(特征)

3). 权重共享使得模型泛化

普通的神经网络输入是固定的,而权重共享可以使得输入不固定。

比如很多张图像,每张图像上有个人脸,但是人脸在图像的不同位置,或者图像的大小也不相同,此时权重共享可以全图扫描,搜索人脸,进而把特征提取出来。

再如RNN做语义分析,两句话:我去年去了北京;去年我和父母去了北京,这其实意思差不多,但文字位置不同,句子长度也不同。

权重共享使得模型能够处理一个连续序列的特征,而不管输入的序列总长度是多少。

当这个连续序列在样本的不同位置时,依然能够识别,而不是学习每个位置的规则,这不仅抓住了不同特征之间的连续性,也减少了学习规则

 

参考博客

https://blog.csdn.net/qq_35203425/article/details/82469348 (tf.AUTO_REUSE实现作用域共享)

https://zhuanlan.zhihu.com/p/103226488 (bert各个模块详解,我觉得非常详细的一篇博客,rewrite)

https://www.cnblogs.com/yanshw/p/10483014.html (再谈权重共享,这篇博客论证了权重共享是必须的)

 

 

这篇关于albert每两层共享参数的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1113343

相关文章

Andrej Karpathy最新采访:认知核心模型10亿参数就够了,AI会打破教育不公的僵局

夕小瑶科技说 原创  作者 | 海野 AI圈子的红人,AI大神Andrej Karpathy,曾是OpenAI联合创始人之一,特斯拉AI总监。上一次的动态是官宣创办一家名为 Eureka Labs 的人工智能+教育公司 ,宣布将长期致力于AI原生教育。 近日,Andrej Karpathy接受了No Priors(投资博客)的采访,与硅谷知名投资人 Sara Guo 和 Elad G

C++11第三弹:lambda表达式 | 新的类功能 | 模板的可变参数

🌈个人主页: 南桥几晴秋 🌈C++专栏: 南桥谈C++ 🌈C语言专栏: C语言学习系列 🌈Linux学习专栏: 南桥谈Linux 🌈数据结构学习专栏: 数据结构杂谈 🌈数据库学习专栏: 南桥谈MySQL 🌈Qt学习专栏: 南桥谈Qt 🌈菜鸡代码练习: 练习随想记录 🌈git学习: 南桥谈Git 🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈🌈�

如何在页面调用utility bar并传递参数至lwc组件

1.在app的utility item中添加lwc组件: 2.调用utility bar api的方式有两种: 方法一,通过lwc调用: import {LightningElement,api ,wire } from 'lwc';import { publish, MessageContext } from 'lightning/messageService';import Ca

4B参数秒杀GPT-3.5:MiniCPM 3.0惊艳登场!

​ 面壁智能 在 AI 的世界里,总有那么几个时刻让人惊叹不已。面壁智能推出的 MiniCPM 3.0,这个仅有4B参数的"小钢炮",正在以惊人的实力挑战着 GPT-3.5 这个曾经的AI巨人。 MiniCPM 3.0 MiniCPM 3.0 MiniCPM 3.0 目前的主要功能有: 长上下文功能:原生支持 32k 上下文长度,性能完美。我们引入了

怎么让1台电脑共享给7人同时流畅设计

在当今的创意设计与数字内容生产领域,图形工作站以其强大的计算能力、专业的图形处理能力和稳定的系统性能,成为了众多设计师、动画师、视频编辑师等创意工作者的必备工具。 设计团队面临资源有限,比如只有一台高性能电脑时,如何高效地让七人同时流畅地进行设计工作,便成为了一个亟待解决的问题。 一、硬件升级与配置 1.高性能处理器(CPU):选择多核、高线程的处理器,例如Intel的至强系列或AMD的Ry

AI(文生语音)-TTS 技术线路探索学习:从拼接式参数化方法到Tacotron端到端输出

AI(文生语音)-TTS 技术线路探索学习:从拼接式参数化方法到Tacotron端到端输出 在数字化时代,文本到语音(Text-to-Speech, TTS)技术已成为人机交互的关键桥梁,无论是为视障人士提供辅助阅读,还是为智能助手注入声音的灵魂,TTS 技术都扮演着至关重要的角色。从最初的拼接式方法到参数化技术,再到现今的深度学习解决方案,TTS 技术经历了一段长足的进步。这篇文章将带您穿越时

如何确定 Go 语言中 HTTP 连接池的最佳参数?

确定 Go 语言中 HTTP 连接池的最佳参数可以通过以下几种方式: 一、分析应用场景和需求 并发请求量: 确定应用程序在特定时间段内可能同时发起的 HTTP 请求数量。如果并发请求量很高,需要设置较大的连接池参数以满足需求。例如,对于一个高并发的 Web 服务,可能同时有数百个请求在处理,此时需要较大的连接池大小。可以通过压力测试工具模拟高并发场景,观察系统在不同并发请求下的性能表现,从而

# VMware 共享文件

VMware tools快速安装 VMware 提供了 open-vm-tools,这是 VMware 官方推荐的开源工具包,通常不需要手动安装 VMware Tools,因为大多数 Linux 发行版(包括 Ubuntu、CentOS 等)都包含了 open-vm-tools,并且已经优化以提供与 VMware 环境的兼容性和功能支持。 建议按照以下步骤安装 open-vm-tools 而不

多路转接之select(fd_set介绍,参数详细介绍),实现非阻塞式网络通信

目录 多路转接之select 引入 介绍 fd_set 函数原型 nfds readfds / writefds / exceptfds readfds  总结  fd_set操作接口  timeout timevalue 结构体 传入值 返回值 代码 注意点 -- 调用函数 select的参数填充  获取新连接 注意点 -- 通信时的调用函数 添加新fd到

未来工作趋势:零工小程序在共享经济中的作用

经济在不断发展的同时,科技也在飞速发展。零工经济作为一种新兴的工作模式,正在全球范围内迅速崛起。特别是在中国,随着数字经济的蓬勃发展和共享经济模式的深入推广,零工小程序在促进就业、提升资源利用效率方面显示出了巨大的潜力和价值。 一、零工经济的定义及现状 零工经济是指通过临时性、自由职业或项目制的工作形式,利用互联网平台快速匹配供需双方的新型经济模式。这种模式打破了传统全职工作的界限,为劳动