sentence similarity vs text (multi-sentence) similarity

2023-10-18 18:59

本文主要是介绍sentence similarity vs text (multi-sentence) similarity,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1. sentence similarity

1.1 方法列举

BERT
Universal Sentence Encoder
ELECTRA embedding

1.2 介绍

1.2.1 BERT
With the advancement in language models, representation of sentences into vectors has been getting better lately. That might give some good result in your case. For example, BERT can be used to get the sentence embedding.

Supervised:BERT for sentence similarity if you have labelled set of data
在这里插入图片描述

You can use the pre-trained BERT model and you can pass two sentences and you can let the vector obtained at [CLS] pass through a feed forward neural network to decide whether the sentences are similar. This approach can work if you have labelled set of data. If you don’t have, consider the following :

Unsupervised:BERT for single sentence

在这里插入图片描述

You pass the variable length sentences to the BERT network and the vector obtained at the token [CLS] becomes the vector for the sentence. You can then use cosine similarity the way you have been using.

1.2.2 Universal Sentence Encoder

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf

1.2.3 ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS

https://arxiv.org/pdf/2003.10555.pdf

a more sample-efficient pre-training task called replaced token detection.

Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.

在这里插入图片描述

  • 1.3 实践

Easy sentence similarity with BERT Sentence Embeddings using John Snow Labs NLU:
https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf

利用Bert构建句向量并计算相似度:
https://netycc.com/2018/12/05/%E5%88%A9%E7%94%A8bert%E6%9E%84%E5%BB%BA%E5%8F%A5%E5%90%91%E9%87%8F%E5%B9%B6%E8%AE%A1%E7%AE%97%E7%9B%B8%E4%BC%BC%E5%BA%A6/

bert-as-service框架:require only two lines of code to get sentence/token-level encodes.
Finally, bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

2. text similarity

2.1 方法:
WDM (for word-level, WDM, for sentence-level, SDM)
Sentence Mover’s Similarity is a variation of Word Mover’s Similarity.

2.2 介绍:

2.2.1 WDM

One approach is using Word Mover’s Distance (WMD). WMD is an algorithm for finding the distance between texts of different lengths, where each word is represented as a word embedding vector.

The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document.

For example:
在这里插入图片描述
Source: “From Word Embeddings To Document Distances” Paper

WMD can be modified to Sentence Mover’s Distance, comparing how far apart different sentence embeddings are to each other.

2.2.2 SDM

Sentence Mover’s Similarity:
https://homes.cs.washington.edu/~nasmith/papers/clark+celikyilmaz+smith.acl19.pdf

这篇关于sentence similarity vs text (multi-sentence) similarity的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/234539

相关文章

MySQL 筛选条件放 ON后 vs 放 WHERE 后的区别解析

《MySQL筛选条件放ON后vs放WHERE后的区别解析》文章解释了在MySQL中,将筛选条件放在ON和WHERE中的区别,文章通过几个场景说明了ON和WHERE的区别,并总结了ON用于关... 今天我们来讲讲数据库筛选条件放 ON 后和放 WHERE 后的区别。ON 决定如何 "连接" 表,WHERE

VS Code中的Python代码格式化插件示例讲解

《VSCode中的Python代码格式化插件示例讲解》在Java开发过程中,代码的规范性和可读性至关重要,一个团队中如果每个开发者的代码风格各异,会给代码的维护、审查和协作带来极大的困难,这篇文章主... 目录前言如何安装与配置使用建议与技巧如何选择总结前言在 VS Code 中,有几款非常出色的 pyt

MySQL中VARCHAR和TEXT的区别小结

《MySQL中VARCHAR和TEXT的区别小结》MySQL中VARCHAR和TEXT用于存储字符串,VARCHAR可变长度存储在行内,适合短文本;TEXT存储在溢出页,适合大文本,下面就来具体的了解... 目录一、VARCHAR 和 TEXT 基本介绍1. VARCHAR2. TEXT二、VARCHAR

VS配置好Qt环境之后但无法打开ui界面的问题解决

《VS配置好Qt环境之后但无法打开ui界面的问题解决》本文主要介绍了VS配置好Qt环境之后但无法打开ui界面的问题解决,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要... 目UKeLvb录找到Qt安装目录中designer.UKeLvBexe的路径找到vs中的解决方案资源

mysqld_multi在Linux服务器上运行多个MySQL实例

《mysqld_multi在Linux服务器上运行多个MySQL实例》在Linux系统上使用mysqld_multi来启动和管理多个MySQL实例是一种常见的做法,这种方式允许你在同一台机器上运行多个... 目录1. 安装mysql2. 配置文件示例配置文件3. 创建数据目录4. 启动和管理实例启动所有实例

Android平台播放RTSP流的几种方案探究(VLC VS ExoPlayer VS SmartPlayer)

技术背景 好多开发者需要遴选Android平台RTSP直播播放器的时候,不知道如何选的好,本文针对常用的方案,做个大概的说明: 1. 使用VLC for Android VLC Media Player(VLC多媒体播放器),最初命名为VideoLAN客户端,是VideoLAN品牌产品,是VideoLAN计划的多媒体播放器。它支持众多音频与视频解码器及文件格式,并支持DVD影音光盘,VCD影

2014 Multi-University Training Contest 8小记

1002 计算几何 最大的速度才可能拥有无限的面积。 最大的速度的点 求凸包, 凸包上的点( 注意不是端点 ) 才拥有无限的面积 注意 :  凸包上如果有重点则不满足。 另外最大的速度为0也不行的。 int cmp(double x){if(fabs(x) < 1e-8) return 0 ;if(x > 0) return 1 ;return -1 ;}struct poin

2014 Multi-University Training Contest 7小记

1003   数学 , 先暴力再解方程。 在b进制下是个2 , 3 位数的 大概是10000进制以上 。这部分解方程 2-10000 直接暴力 typedef long long LL ;LL n ;int ok(int b){LL m = n ;int c ;while(m){c = m % b ;if(c == 3 || c == 4 || c == 5 ||

2014 Multi-University Training Contest 6小记

1003  贪心 对于111...10....000 这样的序列,  a 为1的个数,b为0的个数,易得当 x= a / (a + b) 时 f最小。 讲串分成若干段  1..10..0   ,  1..10..0 ,  要满足x非递减 。  对于 xi > xi+1  这样的合并 即可。 const int maxn = 100008 ;struct Node{int

【Python报错已解决】AttributeError: ‘list‘ object has no attribute ‘text‘

🎬 鸽芷咕:个人主页  🔥 个人专栏: 《C++干货基地》《粉丝福利》 ⛺️生活的理想,就是为了理想的生活! 文章目录 前言一、问题描述1.1 报错示例1.2 报错分析1.3 解决思路 二、解决方法2.1 方法一:检查属性名2.2 步骤二:访问列表元素的属性 三、其他解决方法四、总结 前言 在Python编程中,属性错误(At