本文主要是介绍论文阅读——DistilBERT,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
ArXiv:https://arxiv.org/abs/1910.01108
Train Loss:
DistilBERT:
DistilBERT具有与BERT相同的一般结构,层数减少2倍,移除token类型嵌入和pooler。从老师那里取一层来初始化学生。
The token-type embeddings and the pooler are removed while the number of layers is reduced by a factor of 2. Most of the operations used in the Transformer architecture (linear layer and layer normalisation) are highly optimized in modern linear algebra frameworks。
we initialize the student from the teacher by taking one layer out of two.
大batch,4k,动态mask,去掉NSP
训练数据:和BERT一样
这篇关于论文阅读——DistilBERT的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!