本文主要是介绍语义相似性计算,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
1、匹配内容较少(内存可以放下时)
参考:NLP实践——基于SBERT的语义搜索,语义相似度计算,SimCSE、GenQ等无监督训练-CSDN博客
sentencetransformer对应的git说明:GitHub - UKPLab/sentence-transformers: Multilingual Sentence & Image Embeddings with BERT
sentencetransformer的官方文档:Pretrained Models — Sentence-Transformers documentation
查询的代码:
from sentence_transformers import SentenceTransformer, util
# 【创建模型】
# 这里的编码器可以换成mpnet-base-v2等
# 模型自动下载,并在/root/.cache下创建缓存,若使用本地制定目录的模型,只需要修改为对应的模型目录即可。
# 如果是想加载本地的预训练模型,则类似于huggingface的from_pretrained方法,把输入参数换成本地模型的路径
encoder = SentenceTransformer('paraphrase-MiniLM-L12-v2')
# encoder = SentenceTransformer('path-to-your-pretrained-model/paraphrase-MiniLM-L12-v2/')answer_list = []
def encoding_all_data():
base_data_path =
input_question_list = []
with open(base_data_path,'r',encoding='utf-8') as fi:
json_data = json.loads(fi.read())
for each in json_data:
input_data,output_data = each["input"],each["output"]
input_question_list.append(input_data)
answer_list.append( output_data)
matrix =encoder.encode(input_question_list, convert_to_tensor=True)
return matrix
matrix = encoding_all_data()
# 计算编码
def get_similar_query(query):sentence_vec = encoder.encode(query, convert_to_tensor=True)
cos_scores = util.cos_sim(sentence_vec, matrix)
max_score, max_index = torch.max(cos_scores, dim=1)
index = max_index.cpu().numpy().tolist()[0]
score = max_score.cpu().numpy().tolist()[0]
return answer_list[index],score
if __name__ == '__main__':
print("data ok")
while True:
print("begin compute sim")inputs = input()
result = get_similar_query(inputs)
print(result)
这篇关于语义相似性计算的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!