ElasticSearch Aggregation(六)

本文主要是介绍ElasticSearch Aggregation(六)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

文章目录

ElasticSearch Aggregation(六)
- 指标聚合
- - Percentiles
  - - 脚本
  - rate聚合
  - - 语法
  - 使用脚本聚合
  - stats聚合
  - - 脚本
    - 缺失值
  - string stats聚合
  - - 字符分布
    - 脚本
    - 缺失值
  - sum聚合
  - Top metrics 聚合
  - - size
    - 例子

ElasticSearch Aggregation(六)

指标聚合

Percentiles

百分比聚合。一种多值指标聚合，对从聚合文档中提取的数值计算一个或多个百分比。这些值可以从文档中的特定数字或直方图字段中提取。

当检索到一系列百分位数时，它们可用于估计数据分布并确定数据是否偏斜、双峰等。

让我们看看代表加载时间的百分比范围:

curl -X GET "localhost:9200/latency/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"load_time_outlier": {"percentiles": {"field": "load_time" }}}
}
'

默认情况下，百分比指标将生成一个百分比范围：[ 1, 5, 25, 50, 75, 95, 99 ]。响应将如下所示：

{..."aggregations": {"load_time_outlier": {"values": {"1.0": 5.0,"5.0": 25.0,"25.0": 165.0,"50.0": 445.0,"75.0": 725.0,"95.0": 945.0,"99.0": 985.0}}}
}

如您所见，聚合将返回默认范围内每个百分比的计算值。如果我们假设响应时间以毫秒为单位，那么很明显，网页通常在10-725ms内加载，但偶尔会峰值到945-985ms。也就是说小于等于945ms的响应的占比95%，小于等于985ms的响应占比为99%。

通常，管理员只对离群值感兴趣——极端的百分位数。我们可以指定我们感兴趣的百分比(请求的百分比必须是0-100之间的值):

curl -X GET "localhost:9200/latency/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"load_time_outlier": {"percentiles": {"field": "load_time","percents": [ 95, 99, 99.9 ] }}}
}
'

脚本

如果您需要针对未编入索引的值运行聚合，请使用运行时字段。例如，如果我们的加载时间以毫秒为单位，但您希望以秒为单位计算百分位数：

curl -X GET "localhost:9200/latency/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"runtime_mappings": {"load_time.seconds": {"type": "long","script": {"source": "emit(doc[\u0027load_time\u0027].value / params.timeUnit)","params": {"timeUnit": 1000}}}},"aggs": {"load_time_outlier": {"percentiles": {"field": "load_time.seconds"}}}
}
'

rate聚合

速率聚合。速率指标聚合只能在 date_histogram 内使用，并计算每个 date_histogram 存储桶中的文档或字段的速率。可以从文档中的特定数字或直方图字段中提取字段值。

语法

速率聚合是这样的：

{"rate": {"unit": "month","field": "requests"}
}

下面的请求将所有销售记录分组到月度销售桶中，然后将每个销售桶中的销售交易数转换为年销售额

curl -X GET "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"by_date": {"date_histogram": {"field": "date","calendar_interval": "month"  },"aggs": {"my_rate": {"rate": {"unit": "year"  }}}}}
}
'

响应将返回每个bucket中的年交易率。由于每年有12个月，年利率将自动计算为每月利率乘以12。

{..."aggregations" : {"by_date" : {"buckets" : [{"key_as_string" : "2015/01/01 00:00:00","key" : 1420070400000,"doc_count" : 3,"my_rate" : {"value" : 36.0}},{"key_as_string" : "2015/02/01 00:00:00","key" : 1422748800000,"doc_count" : 2,"my_rate" : {"value" : 24.0}},{"key_as_string" : "2015/03/01 00:00:00","key" : 1425168000000,"doc_count" : 2,"my_rate" : {"value" : 24.0}}]}}
}

也可以不计算文档的数量，而是计算每个bucket中文档中字段的所有值的总和或每个bucket中的值的数量。下面的请求将所有销售记录分组到monthly bucket中，然后计算每月总销售额并将其转换为平均每日销售额。

curl -X GET "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"by_date": {"date_histogram": {"field": "date","calendar_interval": "month"  },"aggs": {"avg_price": {"rate": {"field": "price", "unit": "day"  }}}}}
}
'

响应将包含每个月的每日平均销售额。

{..."aggregations" : {"by_date" : {"buckets" : [{"key_as_string" : "2015/01/01 00:00:00","key" : 1420070400000,"doc_count" : 3,"avg_price" : {"value" : 17.741935483870968}},{"key_as_string" : "2015/02/01 00:00:00","key" : 1422748800000,"doc_count" : 2,"avg_price" : {"value" : 2.142857142857143}},{"key_as_string" : "2015/03/01 00:00:00","key" : 1425168000000,"doc_count" : 2,"avg_price" : {"value" : 12.096774193548388}}]}}
}

通过添加mode:value_count值，我们可以将sum改成count指标:

curl -X GET "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"by_date": {"date_histogram": {"field": "date","calendar_interval": "month"  },"aggs": {"avg_number_of_sales_per_year": {"rate": {"field": "price", "unit": "year",  "mode": "value_count" }}}}}
}
'

使用脚本聚合

使用脚本执行来执行指标聚合。目前先忽略

stats聚合

一个多值指标聚合，它根据从聚合文档中提取的数值计算统计信息。返回的统计数据包括：min、max、sum、count 和 avg。

curl -X POST "localhost:9200/exams/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"grades_stats": { "stats": { "field": "grade" } }}
}
'

响应：

{..."aggregations": {"grades_stats": {"count": 2,"min": 50.0,"max": 100.0,"avg": 75.0,"sum": 150.0}}
}

脚本

略

缺失值

略

string stats聚合

一个多值度量聚合，它计算从聚合文档中提取的字符串值的统计信息。这些值可以从特定的keyword字段中检索。

字符串 stats 聚合返回以下结果：

count：统计的非空字段数
min_length：最短术语的长度
max_length：最长数据的长度
avg_length：在所有术语平均长度。
entropy：在集合收集的所有项上计算的香农熵值。香农熵量化了领域中包含的信息量。它是一个非常有用的度量标准，用于测量一个数据集的广泛的属性，如多样性，相似性，随机性等。

例如：

curl -X POST "localhost:9200/my-index-000001/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"message_stats": { "string_stats": { "field": "message.keyword" } }}
}
'

上面的聚合计算所有文档中消息字段的字符串统计信息。聚合类型是string_stats,

{..."aggregations": {"message_stats": {"count": 5,"min_length": 24,"max_length": 30,"avg_length": 28.8,"entropy": 3.94617750050791}}
}

字符分布

熵值的计算是基于集合收集到的每个字符在所有项中出现的概率。要查看所有字符的概率分布，可以添加show_distribution(默认:false)参数。

curl -X POST "localhost:9200/my-index-000001/_search?size=0&pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"message_stats": {"string_stats": {"field": "message.keyword","show_distribution": true  }}}
}
'

将 show_distribution 参数设置为 true，以便在结果中返回所有字符的概率分布。

{..."aggregations": {"message_stats": {"count": 5,"min_length": 24,"max_length": 30,"avg_length": 28.8,"entropy": 3.94617750050791,"distribution": {" ": 0.1527777777777778,"e": 0.14583333333333334,"s": 0.09722222222222222,"m": 0.08333333333333333,"t": 0.0763888888888889,"h": 0.0625,"a": 0.041666666666666664,"i": 0.041666666666666664,"r": 0.041666666666666664,"g": 0.034722222222222224,"n": 0.034722222222222224,"o": 0.034722222222222224,"u": 0.034722222222222224,"b": 0.027777777777777776,"w": 0.027777777777777776,"c": 0.013888888888888888,"E": 0.006944444444444444,"l": 0.006944444444444444,"1": 0.006944444444444444,"2": 0.006944444444444444,"3": 0.006944444444444444,"4": 0.006944444444444444,"y": 0.006944444444444444}}}
}

提示：分布对象显示每个字符出现在所有术语中的概率。字符按降序排列。

脚本

略

缺失值

略

sum聚合

略

Top metrics 聚合

top_metrics聚合从具有最大或最小“排序”值的文档中选择指标。例如，该函数获取文档中s字段的最大值的文档并且获取m字段的值。

curl -X POST "localhost:9200/test/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index": {}}
{"s": 1, "m": 3.1415}
{"index": {}}
{"s": 2, "m": 1.0}
{"index": {}}
{"s": 3, "m": 2.71828}
'
curl -X POST "localhost:9200/test/_search?filter_path=aggregations&pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"tm": {"top_metrics": {"metrics": {"field": "m"},"sort": {"s": "desc"}}}}
}
'

响应：

{"aggregations": {"tm": {"top": [ {"sort": [3], "metrics": {"m": 2.718280076980591 } } ]}}
}

提示：top_metrics 在本质上与 top_hits 非常相似，但因为它更有限，它能够使用更少的内存来完成它的工作，并且通常更快。

size

top_metrics 可以使用 size 参数返回前几个文档的度量值：

curl -X POST "localhost:9200/test/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'
{"index": {}}
{"s": 1, "m": 3.1415}
{"index": {}}
{"s": 2, "m": 1.0}
{"index": {}}
{"s": 3, "m": 2.71828}
'
curl -X POST "localhost:9200/test/_search?filter_path=aggregations&pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"tm": {"top_metrics": {"metrics": {"field": "m"},"sort": {"s": "desc"},"size": 3}}}
}
'

响应：

{  "aggregations": {    "tm": {      "top": [        {"sort": [3], "metrics": {"m": 2.718280076980591 } },        {"sort": [2], "metrics": {"m": 1.0 } },        {"sort": [1], "metrics": {"m": 3.1414999961853027 } }      ]    }  }}

默认大小为1。默认的最大大小是10，因为聚合的工作存储是“密集的”，这意味着我们为每个桶分配大小槽。10是一个非常保守的默认最大值，如果需要，可以通过更改top_metrics_max_size索引设置来提高它。但是要知道，大容量可能会占用相当多的内存，特别是当它们位于一个聚合内时，这使得很多东西都像一个大术语聚合一样。如果你仍然想要提高它，可以使用以下方法:

curl -X PUT "localhost:9200/test/_settings?pretty" -H 'Content-Type: application/json' -d'{  "top_metrics_max_size": 100}'

提示：如果size大于1，那么top_metrics聚合不能作为排序的目标。

例子

这种聚合在术语聚合中应该非常有用，例如，查找每个服务器报告的最后一个值。

curl -X PUT "localhost:9200/node?pretty" -H 'Content-Type: application/json' -d'{  "mappings": {    "properties": {      "ip": {"type": "ip"},      "date": {"type": "date"}    }  }}'curl -X POST "localhost:9200/node/_bulk?refresh&pretty" -H 'Content-Type: application/json' -d'{"index": {}}{"ip": "192.168.0.1", "date": "2020-01-01T01:01:01", "m": 1}{"index": {}}{"ip": "192.168.0.1", "date": "2020-01-01T02:01:01", "m": 2}{"index": {}}{"ip": "192.168.0.2", "date": "2020-01-01T02:01:01", "m": 3}'curl -X POST "localhost:9200/node/_search?filter_path=aggregations&pretty" -H 'Content-Type: application/json' -d'{  "aggs": {    "ip": {      "terms": {        "field": "ip"      },      "aggs": {        "tm": {          "top_metrics": {            "metrics": {"field": "m"},            "sort": {"date": "desc"}          }        }      }    }  }}'

响应：

{  "aggregations": {    "ip": {      "buckets": [        {          "key": "192.168.0.1",          "doc_count": 2,          "tm": {            "top": [ {"sort": ["2020-01-01T02:01:01.000Z"], "metrics": {"m": 2 } } ]          }        },        {          "key": "192.168.0.2",          "doc_count": 1,          "tm": {            "top": [ {"sort": ["2020-01-01T02:01:01.000Z"], "metrics": {"m": 3 } } ]          }        }      ],      "doc_count_error_upper_bound": 0,      "sum_other_doc_count": 0    }  }}