本文主要是介绍ai讲师老师人工智能培训讲师计算机视觉讲师叶梓:计算机视觉领域的自监督学习模型——MAE-11,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
接上一篇
P24P25
MAE的编码器部分
n Our encoder is a ViT but applied only on visible, unmasked patches .
n Just as in a standard ViT , our encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
n However, our encoder only operates on a small subset (e.g., 25%) of the full set. Masked patches are removed; no mask tokens are used.
n This allows us to train very large encoders with only a fraction of compute and memory.
n The full set is handled by a lightweight decoder , described next.
MAE的解码器部分
n The input to the MAE decoder is the full set of tokens consisting of ( i ) encoded visible patches , and (ii) mask tokens .
n Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted.
n We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image.
n The decoder has another series of Transformer blocks.
未完,下一篇继续……
这篇关于ai讲师老师人工智能培训讲师计算机视觉讲师叶梓:计算机视觉领域的自监督学习模型——MAE-11的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!