es中相关性和相关性算分(explain,boosting)

2024-01-02 14:40

本文主要是介绍es中相关性和相关性算分(explain,boosting),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

通过explain API查看TF-IDF得分:

数据集:使用python生成大量数据写入es数据库并查询操作2_IT之一小佬的博客-CSDN博客_python helpers.bulk

 

在search查询中,explain默认是false。

当explain为false或者不写时,查询条件如下:

GET /personal_info_100000/_search
{"explain": false,"query": {"match": {"character": "学习"}}
}

运行结果: 

"hits" : {"total" : {"value" : 10000,"relation" : "gte"},"max_score" : 4.277235,"hits" : [{"_index" : "personal_info_100000","_type" : "doc","_id" : "15","_score" : 4.277235,"_source" : {"id" : 15,"name" : "刘一","sex" : "男","age" : 25,"character" : "肯学习,有问题不逃避,愿意虚心向他人学习","subject" : "生物","grade" : 69,"create_time" : "2022-11-01 21:44:12"}},{"_index" : "personal_info_100000","_type" : "doc","_id" : "29","_score" : 4.277235,"_source" : {"id" : 29,"name" : "刘一","sex" : "男","age" : 32,"character" : "肯学习,有问题不逃避,愿意虚心向他人学习","subject" : "英语","grade" : 85,"create_time" : "2022-11-01 21:44:12"}},
......

当explain为true时,查询条件为:

GET /personal_info_100000/_search
{"explain": true,"query": {"match": {"character": "学习"}}
}

运行结果:

"hits" : {"total" : {"value" : 10000,"relation" : "gte"},"max_score" : 4.277235,"hits" : [{"_shard" : "[personal_info_100000][0]","_node" : "9xCKv5RGRNecuoPworyaUg","_index" : "personal_info_100000","_type" : "doc","_id" : "15","_score" : 4.277235,"_source" : {"id" : 15,"name" : "刘一","sex" : "男","age" : 25,"character" : "肯学习,有问题不逃避,愿意虚心向他人学习","subject" : "生物","grade" : 69,"create_time" : "2022-11-01 21:44:12"},"_explanation" : {"value" : 4.277235,"description" : "sum of:","details" : [{"value" : 1.6575089,"description" : "weight(character:学 in 2) [PerFieldSimilarity], result of:","details" : [{"value" : 1.6575089,"description" : "score(freq=2.0), computed as boost * idf * tf from:","details" : [{"value" : 2.2,"description" : "boost","details" : [ ]},{"value" : 1.1837717,"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details" : [{"value" : 30612,"description" : "n, number of documents containing term","details" : [ ]},{"value" : 100000,"description" : "N, total number of documents with field","details" : [ ]}]},{"value" : 0.63645136,"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details" : [{"value" : 2.0,"description" : "freq, occurrences of term within document","details" : [ ]},{"value" : 1.2,"description" : "k1, term saturation parameter","details" : [ ]},{"value" : 0.75,"description" : "b, length normalization parameter","details" : [ ]},{"value" : 18.0,"description" : "dl, length of field","details" : [ ]},{"value" : 19.23022,"description" : "avgdl, average length of field","details" : [ ]}]}]}]},{"value" : 2.6197262,"description" : "weight(character:习 in 2) [PerFieldSimilarity], result of:","details" : [{"value" : 2.6197262,"description" : "score(freq=2.0), computed as boost * idf * tf from:","details" : [{"value" : 2.2,"description" : "boost","details" : [ ]},{"value" : 1.870975,"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details" : [{"value" : 15397,"description" : "n, number of documents containing term","details" : [ ]},{"value" : 100000,"description" : "N, total number of documents with field","details" : [ ]}]},{"value" : 0.63645136,"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details" : [{"value" : 2.0,"description" : "freq, occurrences of term within document","details" : [ ]},{"value" : 1.2,"description" : "k1, term saturation parameter","details" : [ ]},{"value" : 0.75,"description" : "b, length normalization parameter","details" : [ ]},{"value" : 18.0,"description" : "dl, length of field","details" : [ ]},{"value" : 19.23022,"description" : "avgdl, average length of field","details" : [ ]}]}]}]}]}},{"_shard" : "[personal_info_100000][0]","_node" : "9xCKv5RGRNecuoPworyaUg","_index" : "personal_info_100000","_type" : "doc","_id" : "29","_score" : 4.277235,"_source" : {"id" : 29,"name" : "刘一","sex" : "男","age" : 32,"character" : "肯学习,有问题不逃避,愿意虚心向他人学习","subject" : "英语","grade" : 85,"create_time" : "2022-11-01 21:44:12"},"_explanation" : {"value" : 4.277235,"description" : "sum of:","details" : [{"value" : 1.6575089,"description" : "weight(character:学 in 15) [PerFieldSimilarity], result of:","details" : [{"value" : 1.6575089,"description" : "score(freq=2.0), computed as boost * idf * tf from:","details" : [{"value" : 2.2,"description" : "boost","details" : [ ]},{"value" : 1.1837717,"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details" : [{"value" : 30612,"description" : "n, number of documents containing term","details" : [ ]},{"value" : 100000,"description" : "N, total number of documents with field","details" : [ ]}]},{"value" : 0.63645136,"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details" : [{"value" : 2.0,"description" : "freq, occurrences of term within document","details" : [ ]},{"value" : 1.2,"description" : "k1, term saturation parameter","details" : [ ]},{"value" : 0.75,"description" : "b, length normalization parameter","details" : [ ]},{"value" : 18.0,"description" : "dl, length of field","details" : [ ]},{"value" : 19.23022,"description" : "avgdl, average length of field","details" : [ ]}]}]}]},{"value" : 2.6197262,"description" : "weight(character:习 in 15) [PerFieldSimilarity], result of:","details" : [{"value" : 2.6197262,"description" : "score(freq=2.0), computed as boost * idf * tf from:","details" : [{"value" : 2.2,"description" : "boost","details" : [ ]},{"value" : 1.870975,"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details" : [{"value" : 15397,"description" : "n, number of documents containing term","details" : [ ]},{"value" : 100000,"description" : "N, total number of documents with field","details" : [ ]}]},{"value" : 0.63645136,"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details" : [{"value" : 2.0,"description" : "freq, occurrences of term within document","details" : [ ]},{"value" : 1.2,"description" : "k1, term saturation parameter","details" : [ ]},{"value" : 0.75,"description" : "b, length normalization parameter","details" : [ ]},{"value" : 18.0,"description" : "dl, length of field","details" : [ ]},{"value" : 19.23022,"description" : "avgdl, average length of field","details" : [ ]}]}]}]}]}},
......

使用Kabana批量插入几条数据:

PUT /test_score/_bulk
{"index": {"_id": 1}}
{"content": "we use Elasticsearch to power the search"}
{"index": {"_id": 2}}
{"content": "we like elasticsearch"}
{"index": {"_id": 3}}
{"content": "Thre scoring of documents is caculated by the scoring formula"}
{"index": {"_id": 4}}
{"content": "you know ,for search"}

当explain为false或者不写时,查询条件如下:

GET /test_score/_search
{"explain": false,"query": {"match": {"content": "elasticsearch"}}
}

 运行结果:

{"took" : 1,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.8713851,"hits" : [{"_index" : "test_score","_type" : "_doc","_id" : "2","_score" : 0.8713851,"_source" : {"content" : "we like elasticsearch"}},{"_index" : "test_score","_type" : "_doc","_id" : "1","_score" : 0.6489038,"_source" : {"content" : "we use Elasticsearch to power the search"}}]}
}

当explain为true时,查询条件为:

GET /test_score/_search
{"explain": true,"query": {"match": {"content": "elasticsearch"}}
}

运行结果:

{"took" : 1,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.8713851,"hits" : [{"_shard" : "[test_score][0]","_node" : "9xCKv5RGRNecuoPworyaUg","_index" : "test_score","_type" : "_doc","_id" : "2","_score" : 0.8713851,"_source" : {"content" : "we like elasticsearch"},"_explanation" : {"value" : 0.8713851,"description" : "weight(content:elasticsearch in 1) [PerFieldSimilarity], result of:","details" : [{"value" : 0.8713851,"description" : "score(freq=1.0), computed as boost * idf * tf from:","details" : [{"value" : 2.2,"description" : "boost","details" : [ ]},{"value" : 0.6931472,"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details" : [{"value" : 2,"description" : "n, number of documents containing term","details" : [ ]},{"value" : 4,"description" : "N, total number of documents with field","details" : [ ]}]},{"value" : 0.5714286,"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details" : [{"value" : 1.0,"description" : "freq, occurrences of term within document","details" : [ ]},{"value" : 1.2,"description" : "k1, term saturation parameter","details" : [ ]},{"value" : 0.75,"description" : "b, length normalization parameter","details" : [ ]},{"value" : 3.0,"description" : "dl, length of field","details" : [ ]},{"value" : 6.0,"description" : "avgdl, average length of field","details" : [ ]}]}]}]}},{"_shard" : "[test_score][0]","_node" : "9xCKv5RGRNecuoPworyaUg","_index" : "test_score","_type" : "_doc","_id" : "1","_score" : 0.6489038,"_source" : {"content" : "we use Elasticsearch to power the search"},"_explanation" : {"value" : 0.6489038,"description" : "weight(content:elasticsearch in 0) [PerFieldSimilarity], result of:","details" : [{"value" : 0.6489038,"description" : "score(freq=1.0), computed as boost * idf * tf from:","details" : [{"value" : 2.2,"description" : "boost","details" : [ ]},{"value" : 0.6931472,"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details" : [{"value" : 2,"description" : "n, number of documents containing term","details" : [ ]},{"value" : 4,"description" : "N, total number of documents with field","details" : [ ]}]},{"value" : 0.42553192,"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details" : [{"value" : 1.0,"description" : "freq, occurrences of term within document","details" : [ ]},{"value" : 1.2,"description" : "k1, term saturation parameter","details" : [ ]},{"value" : 0.75,"description" : "b, length normalization parameter","details" : [ ]},{"value" : 7.0,"description" : "dl, length of field","details" : [ ]},{"value" : 6.0,"description" : "avgdl, average length of field","details" : [ ]}]}]}]}}]}
}

Boosting Relevance 计算相关性

Boosting是控制相关度的一种手段。参数boost的含义:

  • 当boost > 1时,打分的相关度相对性提升
  • 当0 < boost <1时,打分的权重相对性降低
  • 当boost <0时,贡献负分

返回匹配positive查询的文档并降低匹配negative查询的文档相似度分。这样就可以在不排除某些文档的前提下对文档进行查询,搜索结果中存在只不过相似度分数相比正常匹配的要低.

应用场景:希望包含了某项内容的结果不是不出现,而是排序靠后。

查询条件1,negative_boost为0.2:

GET /test_score/_search
{"query": {"boosting": {"positive": {"term": {"content": {"value": "elasticsearch"}}},"negative": {"term": {"content": {"value": "like"}}},"negative_boost": 0.2}}
}

运行结果:

#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{"took" : 1,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.6489038,"hits" : [{"_index" : "test_score","_type" : "_doc","_id" : "1","_score" : 0.6489038,"_source" : {"content" : "we use Elasticsearch to power the search"}},{"_index" : "test_score","_type" : "_doc","_id" : "2","_score" : 0.17427702,"_source" : {"content" : "we like elasticsearch"}}]}
}

查询条件1,negative_boost为0.8:

GET /test_score/_search
{"query": {"boosting": {"positive": {"term": {"content": {"value": "elasticsearch"}}},"negative": {"term": {"content": {"value": "like"}}},"negative_boost": 0.8}}
}

运行结果:

#! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security.
{"took" : 0,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.6971081,"hits" : [{"_index" : "test_score","_type" : "_doc","_id" : "2","_score" : 0.6971081,"_source" : {"content" : "we like elasticsearch"}},{"_index" : "test_score","_type" : "_doc","_id" : "1","_score" : 0.6489038,"_source" : {"content" : "we use Elasticsearch to power the search"}}]}
}

这篇关于es中相关性和相关性算分(explain,boosting)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/562752

相关文章

ElasticSearch的DSL查询⑤(ES数据聚合、DSL语法数据聚合、RestClient数据聚合)

目录 一、数据聚合 1.1 DSL实现聚合 1.1.1 Bucket聚合  1.1.2 带条件聚合 1.1.3 Metric聚合 1.1.4 总结 2.1 RestClient实现聚合 2.1.1 Bucket聚合 2.1.2 带条件聚合 2.2.3 Metric聚合 一、数据聚合 聚合(aggregations)可以让我们极其方便的实现对数据的统计、分析、运算。例如:

OpenGL ES学习总结:基础知识简介

什么是OpenGL ES? OpenGL ES (为OpenGL for Embedded System的缩写) 为适用于嵌入式系统的一个免费二维和三维图形库。 为桌面版本OpenGL 的一个子集。 OpenGL ES管道(Pipeline) OpenGL ES 1.x 的工序是固定的,称为Fix-Function Pipeline,可以想象一个带有很多控制开关的机器,尽管加工

OpenGL ES 2.0渲染管线

http://codingnow.cn/opengles/1504.html Opengl es 2.0实现了可编程的图形管线,比起1.x的固定管线要复杂和灵活很多,由两部分规范组成:Opengl es 2.0 API规范和Opengl es着色语言规范。下图是Opengl es 2.0渲染管线,阴影部分是opengl es 2.0的可编程阶段。   1. 顶点着色器(Vert

Elastic Stack--ES集群加密及Kibana的RBAC实战

前言:本博客仅作记录学习使用,部分图片出自网络,如有侵犯您的权益,请联系删除 学习B站博主教程笔记:  最新版适合自学的ElasticStack全套视频(Elk零基础入门到精通教程)Linux运维必备—ElasticSearch+Logstash+Kibana精讲_哔哩哔哩_bilibilihttps://www.bilibili.com/video/BV1VMW3e6Ezk/?sp

【python 相关性分析】Python绘制相关性热力图

在数据分析时,经常会针对两个变量进行相关性分析。在Python中主要用到的方法是pandas中的corr()方法。 corr():如果由数据框调用corr函数,那么将会计算每个列两两之间的相似度,返回DataFrame # -*- coding: utf-8 -*-# 导入包import pandas as pdimport numpy as npimport matplotlib.py

执行计划查看方法(Explain plan)

什么是执行计划 所谓执行计划,顾名思义,就是对一个查询任务,做出一份怎样去完成任务的详细方案。举个生活中的例子,我从珠海要去英国,我可以 选择先去香港然后转机,也可以先去北京转机,或者去广州也可以。但是到底怎样去英国划算,也就是我的费用最少,这是一件值得考究 的事情。同样对于查询而言,我们提交的SQL仅仅是描述出了我们的目的地是英国,但至于怎么去,通常我们的SQL中是没有给出提示信息

期货赫兹量化-种群优化算法:进化策略,(μ,λ)-ES 和 (μ+λ)-ES

进化策略(Evolution Strategies, ES)是一种启发式算法,旨在模仿自然选择的过程来解决复杂的优化问题,尤其在没有显式解、或搜索空间巨大的情况下表现良好。基于自然界的进化原理,进化策略通过突变、选择等遗传算子迭代生成解,并最终寻求全局最优解。 进化策略通常基于两个核心机制:突变和选择。突变是对当前解进行随机扰动,而选择则用于保留适应度更高的个体。本文详细介绍了 (μ,λ)-ES

鸿蒙(API 12 Beta6版)图形加速【OpenGL ES平台内插模式】超帧功能开发

超帧内插模式是利用相邻两个真实渲染帧进行超帧计算生成中间的预测帧,即利用第N-1帧和第N帧真实渲染帧预测第N-0.5帧预测帧,如下图所示。由于中间预测帧的像素点通常能在前后两帧中找到对应位置,因此内插模式的预测帧效果较外插模式更优。由于第N帧真实渲染帧需要等待第N-0.5帧预测帧生成并送显后才能最终送显,因此会新增1~2帧的响应时延。 OpenGL ES平台 业务流程 基于OpenGL

ES实现百亿级数据实时分析实战案例

点击上方蓝色字体,选择“设为星标” 回复”资源“获取更多资源 背景 我们小组前段时间接到一个需求,希望能够按照小时为单位,看到每个实验中各种特征(单个或组合)的覆盖率、正样本占比、负样本占比。我简单解释一下这三种指标的定义: 覆盖率:所有样本中出现某一特征的样本的比例正样本占比:所有出现该特征的样本中,正样本的比例负样本占比:所有出现该特征的样本中,负样本的比例 光看这三个指标,大家可能会觉得

【硬刚ES】ES基础(二十一) 单字符串多字段查询:Multi Match

本文是对《【硬刚大数据之学习路线篇】从零到大数据专家的学习指南(全面升级版)》的ES部分补充。