sentence similarity vs text (multi-sentence) similarity

2023-10-18 18:59

本文主要是介绍sentence similarity vs text (multi-sentence) similarity,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1. sentence similarity

1.1 方法列举

BERT
Universal Sentence Encoder
ELECTRA embedding

1.2 介绍

1.2.1 BERT
With the advancement in language models, representation of sentences into vectors has been getting better lately. That might give some good result in your case. For example, BERT can be used to get the sentence embedding.

Supervised:BERT for sentence similarity if you have labelled set of data
在这里插入图片描述

You can use the pre-trained BERT model and you can pass two sentences and you can let the vector obtained at [CLS] pass through a feed forward neural network to decide whether the sentences are similar. This approach can work if you have labelled set of data. If you don’t have, consider the following :

Unsupervised:BERT for single sentence

在这里插入图片描述

You pass the variable length sentences to the BERT network and the vector obtained at the token [CLS] becomes the vector for the sentence. You can then use cosine similarity the way you have been using.

1.2.2 Universal Sentence Encoder

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf

1.2.3 ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS

https://arxiv.org/pdf/2003.10555.pdf

a more sample-efficient pre-training task called replaced token detection.

Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not.

在这里插入图片描述

  • 1.3 实践

Easy sentence similarity with BERT Sentence Embeddings using John Snow Labs NLU:
https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf

利用Bert构建句向量并计算相似度:
https://netycc.com/2018/12/05/%E5%88%A9%E7%94%A8bert%E6%9E%84%E5%BB%BA%E5%8F%A5%E5%90%91%E9%87%8F%E5%B9%B6%E8%AE%A1%E7%AE%97%E7%9B%B8%E4%BC%BC%E5%BA%A6/

bert-as-service框架:require only two lines of code to get sentence/token-level encodes.
Finally, bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

2. text similarity

2.1 方法:
WDM (for word-level, WDM, for sentence-level, SDM)
Sentence Mover’s Similarity is a variation of Word Mover’s Similarity.

2.2 介绍:

2.2.1 WDM

One approach is using Word Mover’s Distance (WMD). WMD is an algorithm for finding the distance between texts of different lengths, where each word is represented as a word embedding vector.

The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document.

For example:
在这里插入图片描述
Source: “From Word Embeddings To Document Distances” Paper

WMD can be modified to Sentence Mover’s Distance, comparing how far apart different sentence embeddings are to each other.

2.2.2 SDM

Sentence Mover’s Similarity:
https://homes.cs.washington.edu/~nasmith/papers/clark+celikyilmaz+smith.acl19.pdf

这篇关于sentence similarity vs text (multi-sentence) similarity的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/234539

相关文章

Android平台播放RTSP流的几种方案探究(VLC VS ExoPlayer VS SmartPlayer)

技术背景 好多开发者需要遴选Android平台RTSP直播播放器的时候,不知道如何选的好,本文针对常用的方案,做个大概的说明: 1. 使用VLC for Android VLC Media Player(VLC多媒体播放器),最初命名为VideoLAN客户端,是VideoLAN品牌产品,是VideoLAN计划的多媒体播放器。它支持众多音频与视频解码器及文件格式,并支持DVD影音光盘,VCD影

2014 Multi-University Training Contest 8小记

1002 计算几何 最大的速度才可能拥有无限的面积。 最大的速度的点 求凸包, 凸包上的点( 注意不是端点 ) 才拥有无限的面积 注意 :  凸包上如果有重点则不满足。 另外最大的速度为0也不行的。 int cmp(double x){if(fabs(x) < 1e-8) return 0 ;if(x > 0) return 1 ;return -1 ;}struct poin

2014 Multi-University Training Contest 7小记

1003   数学 , 先暴力再解方程。 在b进制下是个2 , 3 位数的 大概是10000进制以上 。这部分解方程 2-10000 直接暴力 typedef long long LL ;LL n ;int ok(int b){LL m = n ;int c ;while(m){c = m % b ;if(c == 3 || c == 4 || c == 5 ||

2014 Multi-University Training Contest 6小记

1003  贪心 对于111...10....000 这样的序列,  a 为1的个数,b为0的个数,易得当 x= a / (a + b) 时 f最小。 讲串分成若干段  1..10..0   ,  1..10..0 ,  要满足x非递减 。  对于 xi > xi+1  这样的合并 即可。 const int maxn = 100008 ;struct Node{int

【Python报错已解决】AttributeError: ‘list‘ object has no attribute ‘text‘

🎬 鸽芷咕:个人主页  🔥 个人专栏: 《C++干货基地》《粉丝福利》 ⛺️生活的理想,就是为了理想的生活! 文章目录 前言一、问题描述1.1 报错示例1.2 报错分析1.3 解决思路 二、解决方法2.1 方法一:检查属性名2.2 步骤二:访问列表元素的属性 三、其他解决方法四、总结 前言 在Python编程中,属性错误(At

VS Code 调试go程序的相关配置说明

用 VS code 调试Go程序需要在.vscode/launch.json文件中增加如下配置:  // launch.json{// Use IntelliSense to learn about possible attributes.// Hover to view descriptions of existing attributes.// For more information,

【ReactJS】困惑于text/babel与browser.js还是babel.js?

使用JSX   使用JSX,可以极大的简化React元素的创建,JSX抽象化了React.createElement()函数的使用,其语法风格类似于HTML语法风格。对比如下代码可以让你更好的理解这一点。 // 使用React.createElement()return React.createElement('div',null,'Hello',this.props.name);//使用J

Android:EditText在hint字体大小和text字体大小不一致时的设置方法

今天碰到一个需求,有一个输入框EditText,要求输入某项金额,要求在未输入文字之前,hint提示,输入文字之后显示输入的文字,要求是未输入内容时hint字体大小为14sp,输入金额之后字体大小要变成30sp。,可是EditText本身没有这个属性可以设置,怎么办呢,只有在代码中添加监听事件了: /*** 添加监听,在hint时和text时切换字体大小*/cetMoney.addTextCha

解决服务器VS Code中Jupyter突然崩溃的问题

问题 本来在服务器Anaconda的Python环境里装其他的包,装完了想在Jupyter里写代码验证一下有没有装好,一运行发现Jupyter崩溃了!?报错如下所示 Failed to start the Kernel. ImportError: /home/hujh/anaconda3/envs/mia/lib/python3.12/lib-dynload/_sqlite3.cpython-

VSC++: 括号对称比较

括号的使用规则:大括号,中括号,小括号{[()]};中括号,小括号[()];小括号();大括号、中括号、小括号、中括号、小括号、大括号{[()][()]};大括号,中括号,小括号,小括号{[(())]};大括号,中括号,小括号,小括号{[()()]};小括号不能嵌套,小括号可连续使用。 {[]}、{()}、([])、({})、[{}]、{}、[]、{[}]、[(])都属非法。 char aa[