NLP 漫谈:从 BERT 说开去

2023-10-28 01:21
文章标签 nlp bert 漫谈 开去

本文主要是介绍NLP 漫谈:从 BERT 说开去,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

最近在做 NLP 的课堂作业,花了两天时间着重看了 2018-2021 期间 BERT 这一系列比较有名的模型,顺便做了点笔记。GPT 系列模型我之前做过总结,引路:GPT系列登神之路

ps: 由于文献和资料大多都是英文的,笔记也就顺便用英文写了。


non-contextual and contextual embeddings

non-contextual embeddings: Word2vec, Glove, FastText.

They are (static) non-contextual embeddings: you get the exact same embedding for one token no matter how you use it in your sentence.

FastText uses the same technique of Skipgram and CBOW, but instead of working on a whole word (as in word2vec), it parts the word into n-grams, and then does the same process as Word2Vec, but on these character n-grams. A word is represented as the sum of vectors of character n-grams.

contextual embeddings: ELMo, BERT

They learn contextual embeddings, providing different embeddings for the same token in different context. More semantic informations.

Attention: when people say contextual embeddings, they don’t mean the vectors from the look-up table (the output of embedding layer), they mean the hidden states of the pre-trained model. Thus, all word embeddings are fundamentally non-contextual but can be made contextual by incorporating hidden layers. Word2vec and GloVe embeddings can also incorporate with hidden layers to derive contextual embeddings.

c.f. The discussions: difference-between-non-contextual-and-contextual-word-embeddings

Generally, the hidden layers are LSTM (in ELMo) or Transformer (BERT) to capture contextual information

ELMo uses bidirectional LSTM, but it is not really BIdirectional model. A simple concatenation of forward and backward hidden state in LSTM does not introduce the interactions between tokens. In this sence, BERT is indeed a BIdirectional model.


BERT

Three parts of embeddings in BERT :

  • Token Embeddings

Nothing special. A trainable lookup table of size (30522, 768)

  • Segment Embeddings

The Segment Embeddings layer only has 2 vector representations, which is a trainable lookup table of size (2, 768). The first vector (index 0) is assigned to all tokens that belong to input 1 while the last vector (index 1) is assigned to all tokens that belong to input 2.

  • Position Embeddings

Different to positional encodings in Transformer, BERT learns the position embeddings which is a lookup table of size (512, 768). [BERT handles input sequences up to 512 characters long and the embedding vector of each position is 768-dimension]. That is to say, the first row representing the vector representation of any word in the first position, the second row representing any word in the second position, and so on.

cf: How the Embedding Layers in BERT Were Implemented

Attention: The Position Embeddings in BERT is traiable; while the positional encodings in Transformer is fixed, computed by the formula. One default of the former is non-extendable (cannot treat sequences longer than 512).

Sentence level embedding

BERT uses two pre-training tasks: MLM and NSP (next sentence prediction). The latter is designed to capture sentence relationship. NSP: to predict whether sentence2 is the next sentence of sentence1. Binary classification

请添加图片描述

Two aspects to reveal the sentence level information:

  • Segment Embeddings
  • Adding special tokens and . is inserted in the head of sentence, and is supposed to represent the whole sentence information.

However, RoBERTa (Robustly Optimized BERT approach) shows NSP is in fact not useful: removing the NSP loss matches or slightly improves downstream task performance. This may due to the fact that NSP is a too simple task.

ALBERT (A Lite BERT) proposes a new pre-training task to replace NSP: SOP (sentence order prediction) which predicts whether two sentences are in order. If the correct sentence order is reversed, the model should predict “No”.


BART

https://arxiv.org/pdf/1910.13461.pdf

From Facebook, 2019

BERT cannot talk: Nowadays, the text generation is still in an autoregressive manner. The pre-training task of BERT differs from it. Thus, the generation ability of BERT is weak. [maybe BERT is good at non-autoregressive generation manner]

BART: Bidirectional and auto-regressive transformer. (a typical seq2seq structure).

pre-training tasks for BART (ways to corrupt the input): Token Masking, Token Deletion, Text Infilling, Sentence Permutation, Document Rotation

请添加图片描述

Comparison of pre-training tasks: using TextInfilling tasks to pre-train obtains the best performance; Sentence Permutation, Document Rotation results in an obvious drop of performance.

BART can be used for downstream tasks: Sequence Classification Tasks, Token Classification Tasks, Sequence Generation Tasks, Machine Translation.

请添加图片描述


RoBERTa

https://arxiv.org/pdf/1907.11692.pdf

From Facebook, 2019

Optimize BERT: Bigger batch size, more training data, longer input sequence.

Several updates compared to BERT :

  • dynamic mask

    The original BERT: performed masking once during data preprocessing, resulting in a static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.

    RoBERTa introduces dynamic mask: generate the masking pattern every time we feed a sequence to the model. But it seems that dynamic mask does not have an absolute advantage.

请添加图片描述

  • No NSP pre-training task

RoBERTa romoves NSP pre-training task, which matches or slightly improves downstream task performance.

  • byte-level BPE

The original BERT: uses a character-level BPE vocabulary of size 30K, which is learned after preprocessing the input with heuristic tokenization rules. Following Radford et al. (2019)

RoBERTa: consider training BERT with a larger byte-level BPE vocabulary containing 50K subword units, without any additional preprocessing or tokenization of the input.


ELECTRA

https://arxiv.org/pdf/2003.10555.pdf

From Google and Stanford, 2020

ELECTRA: Efficiently Learning an Encoder that Classifies Token Replacements Accurately

Predict the masked token is computational costly. ELECTRA proposes a new pre-training task: replaced token detection (RTD), aiming to predict whether one token is replaced by the another one. It is a binary classification.

请添加图片描述

Typically, it uses a tiny BERT as generator, in order to predict the masked tokens. Then use these predicted tokens, let ELECTRA tell us whether one token has been replaced.

Generator and Discriminator are jointly trained: the former minimizes loss MLM and the latter minimizes loss MLM+RTD (we could not back propogate RTD loss to generator).

Note: The author found that, if the discriminator and the generator share the same token embeddings, the performance would be better. In this case, the token embeddings are trained on loss MLM+RTD. Later, DeBERTa v3 proposes an optimization: Gradient-Disentangled Embedding Sharing.

ELECTRA achieves a competitively performance on GLUE, while with a much lower TFlops, nearly 1/4.

请添加图片描述


ALBERT

https://arxiv.org/pdf/1909.11942.pdf

From Google, 2020

present two parameter reduction techniques to lower memory consumption and increase the training speed of BERT:

  • factorization of the embedding matrix: Instead of projecting the one-hot vectors directly into the hidden space of size H, we first project them into a lower dimensional embedding space of size E, and then project it to the hidden space. By using this decomposition, we reduce the embedding parameters from O(V ×H) to O(V ×E +E ×H).
  • Cross-layer parameter sharing: The default decision for ALBERT is to share all parameters across layers.

请添加图片描述

The transitions from layer to layer are much smoother for ALBERT than for BERT. Even if there is a drop of performance compared to BERT, the parameters become 3-4 times less than BERT.

Another contribution:

  • SOP (sentence order prediction) pre-training task. To focus on modeling inter-sentence coherence. Positive examples: two consecutive segments from the same document; Negative examples: the same two consecutive segments but with their order swapped.

T5: Transfer Text-to-Text Transformer

From Google, 2020

https://arxiv.org/pdf/1910.10683.pdf

c.f. https://zhuanlan.zhihu.com/p/88438851

No brilliant ideas, but surprising computation ressources! It introduces a unified framework that converts all text-based language problems into a text-to-text format. It compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
请添加图片描述

Compare architecture

  • Encoder-decoder (Full-visible): a typical seq2seq model using transformer. E.g. BART
  • language model (Causal): decoder part. E.g. GPT
  • prefix language model (Causal with prefix): combines encoder and decoder. E.g. UniLM

请添加图片描述

It turns out that Encoder-decoder architecture with a denoising pre-training objective (MLM, as in BERT) achieves best performance. Surprisingly, sharing parameters across the encoder and decoder performed nearly as well, while halving the total parameter count.

Compare Unsupervised Objectives

请添加图片描述

  • Prefix LM objective: predict from left to right
  • BERT-style denoising objective: predict masked tokens
  • Deshuffling: convert the shuffled text to the original

请添加图片描述

BERT-style denoising objective performs best. Moreover, The best corruption strategies is Replace spans (All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence).

Corruption rate 15% wins (which is the rate in original BERT). The best corrupted span length is 3.

Other takeaway

The basic approach of updating all of a pre-trained model’s parameters during fine-tuning outperformed methods that are designed to update fewer parameters, although updating all parameters is most expensive.


DeBERTa

From Microsoft, 2021

https://arxiv.org/pdf/2006.03654.pdf

cf. https://blog.csdn.net/weixin_42437114/article/details/127017933

From Microsoft DeBERTa: Decoding-enhanced bert with disentangled attention. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better.

Contributions:

  • disentangled attention mechanism: each word is represented using two vectors that encode its content and position, respectively (As comparison, BERT just sums them up). And the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively.
  • an enhanced mask decoder: incorporate absolute positions with contextual embeddings before softmax (which is to predict the masked tokens in model pre-training MLM).
  • Scale-invariant-Fine-Tuning (SiFT): a new virtual adversarial training method for fine-tuning. It aims at improving a model’s robustness to adversarial examples (created by making small perturbations to the input).
    For NLP tasks, the perturbation is applied to the word embedding instead of the original word sequence. But the norm of the embedding vectors vary among different words and models. SiFT: applys the perturbations to the normalized word embeddings.

DeBERTa v3

From Microsoft, 2021

https://arxiv.org/pdf/2111.09543.pdf

  • replacing mask language modeling (MLM) with replaced token detection (RTD)

  • gradient-disentangled embedding sharing method (GDES) ©

    The author argues that Embedding sharing in ELECTRA pull token embeddings into very different directions, which causes “tug-of-war” (a). MLM tries to map the tokens that are semantically similar to the embedding vectors that are close to each other. RTD, on the other hand, tries to discriminate semantically similar tokens, pulling their embeddings as far as possible to optimize the classification accuracy.

    But if use totally different token embeddings for G & D, the performance drops (b).

    GDES: MLM loss to train embedding for G & D; RTD loss to train delta embedding E Δ E_{\Delta} EΔ, which is stored separately during training, to keep E G E_G EG and E D E_D ED same. After training, sum up to get final E D E_D ED. In fact, the gradient of RTD loss is disentangled during training, hence the name.
    请添加图片描述
    请添加图片描述


Network compression

  • network Pruning:

    • Removes unnecessary parts of the network after training. This includes weight magnitude pruning, attention head pruning, layers, and others.
  • Weight Factorization

  • Approximates parameter matrices by factorizing them into a multiplication of two smaller matrices. This imposes a low-rank constraint on the matrix. Weight factorization can be applied to both token embeddings (which saves a lot of memory on disk) or parameters in feed-forward / self-attention layers (for some speed improvements).

  • Knowledge Distillation

  • Aka “Student Teacher.” Trains a much smaller Transformer from scratch on the pre-training / downstream-data. Normally this would fail, but utilizing soft labels (pseudo labels) from a fully-sized model improves optimization for unknown reasons.

  • Weight Sharing

  • Some weights in the model share the same value as other parameters in the model. For example, ALBERT uses the same weight matrices for every single layer of self-attention in BERT.

  • parameter Quantization

  • Truncates floating point numbers to only use a few bits (which causes round-off error). The quantization values can also be learned either during or after training.

c.f. All The Ways You Can Compress BERT

Examples: Distill BERT, Tiny BERT, ALBERT…


Ways of fine-tuning

How to fine-tune models:

  1. Fix the whole pre-trained model (use it as a feature extractor), only train several added layers
  2. Fine tune the pre-trained model and several added layers
  3. Add some layers into pre-trained models (called adaptors). Only train these adaptors and added layers
  4. Instead of using output of last layer of pre-trained model, compute a weighted sum of output for different layers or blocks (typically useful for transformer blocks, because output of each block has same shape). The weights can be learned during fine tuning.

Generally speaking, method 1 is most performant (cf takeaways in T5), but it is the most expensive.
Method 2 is the next performant, but it has to store a new model for each downstream task.
Method 3 tries to solve this pbm, but where to add the adaptors remains a open problem.


这篇关于NLP 漫谈:从 BERT 说开去的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/289740

相关文章

BERT 论文逐段精读【论文精读】

BERT: 近 3 年 NLP 最火 CV: 大数据集上的训练好的 NN 模型,提升 CV 任务的性能 —— ImageNet 的 CNN 模型 NLP: BERT 简化了 NLP 任务的训练,提升了 NLP 任务的性能 BERT 如何站在巨人的肩膀上的?使用了哪些 NLP 已有的技术和思想?哪些是 BERT 的创新? 1标题 + 作者 BERT: Pre-trainin

8. 自然语言处理中的深度学习:从词向量到BERT

引言 深度学习在自然语言处理(NLP)领域的应用极大地推动了语言理解和生成技术的发展。通过从词向量到预训练模型(如BERT)的演进,NLP技术在机器翻译、情感分析、问答系统等任务中取得了显著成果。本篇博文将探讨深度学习在NLP中的核心技术,包括词向量、序列模型(如RNN、LSTM),以及BERT等预训练模型的崛起及其实际应用。 1. 词向量的生成与应用 词向量(Word Embedding)

漫谈设计模式 [12]:模板方法模式

引导性开场 菜鸟:老大,我最近在做一个项目,遇到了点麻烦。我们有很多相似的操作流程,但每个流程的细节又有些不同。我写了很多重复的代码,感觉很乱。你有啥好办法吗? 老鸟:嗯,听起来你遇到了典型的代码复用和维护问题。你有没有听说过“模板方法模式”? 菜鸟:模板方法模式?没听过。这是什么? 老鸟:简单来说,模板方法模式让你在一个方法中定义一个算法的骨架,而将一些步骤的实现延迟到子类中。这样,你可

漫谈设计模式 [9]:外观模式

引导性开场 菜鸟:老鸟,我最近在做一个项目,感觉代码越来越复杂,我都快看不懂了。尤其是有好几个子系统,它们之间的调用关系让我头疼。 老鸟:复杂的代码确实让人头疼。你有没有考虑过使用设计模式来简化你的代码结构? 菜鸟:设计模式?我听说过一些,但不太了解。你觉得我应该用哪个模式呢? 老鸟:听起来你的问题可能适合用**外观模式(Facade Pattern)**来解决。我们可以一起探讨一下。

漫谈设计模式 [6]:适配器模式

引导性开场 菜鸟:老鸟,我最近在项目中遇到一个问题,我们的系统需要集成一个新的第三方库,但这个库的接口和我们现有的代码完全不兼容。我该怎么办? 老鸟:这是个常见的问题,很多开发者都会遇到这种情况。你有没有听说过适配器模式? 菜鸟:适配器模式?没有,能详细说说吗? 老鸟:当然可以!这就是我们今天要讨论的主题。适配器模式是一个设计模式,可以帮助我们解决你现在遇到的问题。 渐进式介绍概念 老

文本分类场景下微调BERT

How to Fine-Tune BERT for Text Classification 论文《How to Fine-Tune BERT for Text Classification?》是2019年发表的一篇论文。这篇文章做了一些实验来分析了如何在文本分类场景下微调BERT,是后面网上讨论如何微调BERT时经常提到的论文。 结论与思路 先来看一下论文的实验结论: BERT模型上面的

【python 走进NLP】两两求相似度,得到一条文本和其他文本最大的相似度

应用场景: 一个数据框里面文本,两两求相似度,得到一条文本和其他文本最大的相似度。 content source_id0 丰华股份军阀割据发生的故事大概多少w 11 丰华股份军阀割据发生的故事大概多少 22 丰华股份军阀割据发生的故事大概多少 33 丰华股份军阀割据发生的故事大概多少

【Python 走进NLP】NLP词频统计和处理停用词,可视化

# coding=utf-8import requestsimport sysreload(sys)sys.setdefaultencoding('utf-8')from lxml import etreeimport timetime1=time.time()import bs4import nltkfrom bs4 import BeautifulSoupfrom

【java 走进NLP】simhash 算法计算两篇文章相似度

python 计算两篇文章的相似度算法simhash见: https://blog.csdn.net/u013421629/article/details/85052915 对长文本 是比较合适的(超过500字以上) 下面贴上java 版本实现: pom.xml 加入依赖 <dependency><groupId>org.jsoup</groupId><artifactId>jsoup</a

【python 走进NLP】simhash 算法计算两篇文章相似度

互联网网页存在大量的重复内容网页,无论对于搜索引擎的网页去重和过滤、新闻小说等内容网站的内容反盗版和追踪,还是社交媒体等文本去重和聚类,都需要对网页或者文本进行去重和过滤。最简单的文本相似性计算方法可以利用空间向量模型,计算分词后的文本的特征向量的相似性,这种方法存在效率的严重弊端,无法针对海量的文本进行两两的相似性判断。模仿生物学指纹的特点,对每个文本构造一个指纹,来作为该文本的标识,从形式上来