本文主要是介绍ElasticSearch Aggregation(七),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- ElasticSearch Aggregation(七)
- 管道聚合
- average bucket聚合
- 语法
- 参数
- 响应体
- 例子
- bucket script聚合
- 语法
- bucket selector聚合
- 语法
- bucket sort聚合
- 语法
- 截断而不进行排序
- cumulative cardinality聚合
- 语法
- 增量累积基数聚合
ElasticSearch Aggregation(七)
管道聚合
管道聚合作用于其他聚合后的结果上,而不是作用于文档集合。有很多不同类型的管道聚合,每一种都会根据其他聚合数据计算出不同的信息,但是这些类型可以分为两种家族:
Parent
:一系列管道聚合,随其父聚合的输出一起提供,并且能够计算新存储桶或新聚合以添加到现有存储桶。Sibling
:管道聚合随同级聚合的输出一起提供,并且能够计算与同级聚合处于同一级别的新聚合。
管道聚合通过bucket_path
参数来指定所需聚合的路径。定义bucket_path
参数可以通过bucket_path
语法来定义
管道聚合不能有子聚合,但根据类型,它可以引用 buckets_path 中的另一个管道,允许链接管道聚合。
提示:由于管道聚合仅添加到输出,因此在链接管道聚合时,每个管道聚合的输出将包含在最终输出中。
bucket_path语法
大多数管道聚合要求其他的聚合作为他的输入。输入的定义通过bucket_path
参数来定义,可以根据以下格式来定义bucket_path
参数:
AGG_SEPARATOR = `>` ;
METRIC_SEPARATOR = `.` ;
AGG_NAME = <the name of the aggregation> ;
METRIC = <the name of the metric (in case of multi-value metrics aggregation)> ;
MULTIBUCKET_KEY = `[<KEY_NAME>]`
PATH = <AGG_NAME><MULTIBUCKET_KEY>? (<AGG_SEPARATOR>, <AGG_NAME> )* ( <METRIC_SEPARATOR>, <METRIC> ) ;
例如,路径my_bucket>my_stats.avg
将指向my_stats
指标中的avg
聚合值,该指标将会包含在my_bucket
桶聚合中。
管道聚合的路径是相对的,不是绝对路径。例如移动平均数被嵌入到日期直方图的内部,并且引用一个兄弟指标the_sum
:
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"my_date_histo": {"date_histogram": {"field": "timestamp","calendar_interval": "day"},"aggs": {"the_sum": {"sum": { "field": "lemmings" } },"the_movavg": {"moving_avg": { "buckets_path": "the_sum" } }}}}
}
'
以上例子目的是在每个聚合桶中对the_sum
指标进行moving avg
操作。
bucket_path
也可以在兄弟管道聚合中使用。
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"sales_per_month": {"date_histogram": {"field": "date","calendar_interval": "month"},"aggs": {"sales": {"sum": {"field": "price"}}}},"max_monthly_sales": {"max_bucket": {"buckets_path": "sales_per_month>sales" }}}
}
'
bucket_path
指示在date_per_month
数据直方图中后去sales
聚合的最大值。
如果兄弟管道聚合引用了一个多桶聚合,例如terms
聚合,可以在多桶中选择特定的键。例如bucket_script
可以选择特定的桶执行计算:
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"sales_per_month": {"date_histogram": {"field": "date","calendar_interval": "month"},"aggs": {"sale_type": {"terms": {"field": "type"},"aggs": {"sales": {"sum": {"field": "price"}}}},"hat_vs_bag_ratio": {"bucket_script": {"buckets_path": {"hats": "sale_type['hat']>sales", "bags": "sale_type['bag']>sales" },"script": "params.hats / params.bags"}}}}}
}
'
特殊的路径
buckets_path可以使用一个特殊的_count
路径,而不是一个指标路径。这意味着管道聚合使用文档数量作为其输入。例如,移动平均数可以根据每个桶的文档数量来计算,而不是一个特定的指标:
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{"aggs": {"my_date_histo": {"date_histogram": {"field": "timestamp","calendar_interval": "day"},"aggs": {"the_movavg": {"moving_avg": { "buckets_path": "_count" } }}}}
}
'
average bucket聚合
平均桶聚合。同级管道聚合,用于计算同级聚合中指定指标的平均值。指定的指标必须是数字,同级聚合必须是多桶聚合。
语法
"avg_bucket": {"buckets_path": "sales_per_month>sales","gap_policy": "skip","format": "#,##0.00;(#,##0.00)"
}
参数
bucket_path
:(必须,String),平均计算所需的桶路径gep_policy
:(可选,string),在数据中发现间距对应的处理策略。默认为skip
format
:(可选的,String),输出的DecimalFormat
模式的值。如果指定,将会在聚合的value_as_string
属性中返回被格式化的值。
响应体
value
:(flat
),bucket_path
指定的路径指标的平均值value_as_string
:(string
),聚合的格式化输出值。只有在请求中指定了格式时才提供此属性。
例子
根据avg_monthly_sales
聚合使用avg_bucket
来计算每个月sales
的平均值
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"sales_per_month": {"date_histogram": {"field": "date","calendar_interval": "month"},"aggs": {"sales": {"sum": {"field": "price"}}}},"avg_monthly_sales": { "avg_bucket": {"buckets_path": "sales_per_month>sales","gap_policy": "skip","format": "#,##0.00;(#,##0.00)"} }}
}
'
响应:
{"took": 11,"timed_out": false,"_shards": ...,"hits": ...,"aggregations": {"sales_per_month": {"buckets": [{"key_as_string": "2015/01/01 00:00:00","key": 1420070400000,"doc_count": 3,"sales": {"value": 550.0}},{"key_as_string": "2015/02/01 00:00:00","key": 1422748800000,"doc_count": 2,"sales": {"value": 60.0}},{"key_as_string": "2015/03/01 00:00:00","key": 1425168000000,"doc_count": 2,"sales": {"value": 375.0}}]},"avg_monthly_sales": {"value": 328.33333333333333,"value_as_string": "328.33"}}
}
bucket script聚合
可以执行脚本的父管道聚合,这个脚本可以在每个桶中单独执行。
语法
{"bucket_script": {"buckets_path": {"my_var1": "the_sum", "my_var2": "the_value_count"},"script": "params.my_var1 / params.my_var2"}
}
以下代码段计算了 T 恤销售额与每月总销售额的比率百分比:
curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"sales_per_month": {"date_histogram": {"field": "date","calendar_interval": "month"},"aggs": {"total_sales": {"sum": {"field": "price"}},"t-shirts": {"filter": {"term": {"type": "t-shirt"}},"aggs": {"sales": {"sum": {"field": "price"}}}},"t-shirt-percentage": {"bucket_script": {"buckets_path": {"tShirtSales": "t-shirts>sales","totalSales": "total_sales"},"script": "params.tShirtSales / params.totalSales * 100"}}}}}
}
'
响应:
{"took": 11,"timed_out": false,"_shards": ...,"hits": ...,"aggregations": {"sales_per_month": {"buckets": [{"key_as_string": "2015/01/01 00:00:00","key": 1420070400000,"doc_count": 3,"total_sales": {"value": 550.0},"t-shirts": {"doc_count": 1,"sales": {"value": 200.0}},"t-shirt-percentage": {"value": 36.36363636363637}},{"key_as_string": "2015/02/01 00:00:00","key": 1422748800000,"doc_count": 2,"total_sales": {"value": 60.0},"t-shirts": {"doc_count": 1,"sales": {"value": 10.0}},"t-shirt-percentage": {"value": 16.666666666666664}},{"key_as_string": "2015/03/01 00:00:00","key": 1425168000000,"doc_count": 2,"total_sales": {"value": 375.0},"t-shirts": {"doc_count": 1,"sales": {"value": 175.0}},"t-shirt-percentage": {"value": 46.666666666666664}}]}}
}
bucket selector聚合
执行脚本的父管道聚合,该脚本确定当前存储桶是否将保留在父多存储桶聚合中。指定的指标必须是数字,并且脚本必须返回布尔值。如果脚本语言是表达式,则允许使用数字返回值。在这种情况下,0.0
将被计算为false
,而所有其他值将被计算为true
。
提示:与所有管道聚合一样,bucket_selector聚合在所有其他同级聚合之后执行。这意味着使用bucket_selector聚合来过滤响应中返回的桶并不会节省运行聚合的执行时间。
语法
{"bucket_selector": {"buckets_path": {"my_var1": "the_sum", "my_var2": "the_value_count"},"script": "params.my_var1 > params.my_var2"}
}
以下代码段仅保留当月总销售额超过 200 的存储桶:
curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"sales_per_month": {"date_histogram": {"field": "date","calendar_interval": "month"},"aggs": {"total_sales": {"sum": {"field": "price"}},"sales_bucket_filter": {"bucket_selector": {"buckets_path": {"totalSales": "total_sales"},"script": "params.totalSales > 200"}}}}}
}
'
响应:
{"took": 11,"timed_out": false,"_shards": ...,"hits": ...,"aggregations": {"sales_per_month": {"buckets": [{"key_as_string": "2015/01/01 00:00:00","key": 1420070400000,"doc_count": 3,"total_sales": {"value": 550.0}},{"key_as_string": "2015/03/01 00:00:00","key": 1425168000000,"doc_count": 2,"total_sales": {"value": 375.0},}]}}
}
bucket sort聚合
父管道聚合,这聚合对父多桶聚合的结果进行排序。可以指定0
个或者多个字段进行排序。每一个桶可能根据_key
或者_count
或者子聚合进行排序。此外,可以设置参数 from
和 size
以截断结果桶。
提示:与所有管道聚合一样,bucket_sort 聚合在所有其他非管道聚合之后执行。这意味着排序仅适用于已从父聚合返回的任何桶。例如,如果父聚合是 term 并且其大小设置为 10,bucket_sort 将仅对这 10 个返回的 term 桶进行排序。
语法
bucket_sort
聚合如下所示:
{"bucket_sort": {"sort": [{ "sort_field_1": { "order": "asc" } }, { "sort_field_2": { "order": "desc" } },"sort_field_3"],"from": 1,"size": 3}
}
下面的代码片段按降序返回总销售额最高的3个月对应的桶:
curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"sales_per_month": {"date_histogram": {"field": "date","calendar_interval": "month"},"aggs": {"total_sales": {"sum": {"field": "price"}},"sales_bucket_sort": {"bucket_sort": {"sort": [{ "total_sales": { "order": "desc" } } ],"size": 3 }}}}}
}
'
响应:
{"took": 82,"timed_out": false,"_shards": ...,"hits": ...,"aggregations": {"sales_per_month": {"buckets": [{"key_as_string": "2015/01/01 00:00:00","key": 1420070400000,"doc_count": 3,"total_sales": {"value": 550.0}},{"key_as_string": "2015/03/01 00:00:00","key": 1425168000000,"doc_count": 2,"total_sales": {"value": 375.0},},{"key_as_string": "2015/02/01 00:00:00","key": 1422748800000,"doc_count": 2,"total_sales": {"value": 60.0},}]}}
}
截断而不进行排序
你可能使用这个聚合来截断结果桶而不进行任何排序。为此你只需要设置from
和size
参数,而不设置sort
参数就可以达到这个效果
curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"sales_per_month": {"date_histogram": {"field": "date","calendar_interval": "month"},"aggs": {"bucket_truncate": {"bucket_sort": {"from": 1,"size": 1}}}}}
}
'
响应:
{"took": 11,"timed_out": false,"_shards": ...,"hits": ...,"aggregations": {"sales_per_month": {"buckets": [{"key_as_string": "2015/02/01 00:00:00","key": 1422748800000,"doc_count": 2}]}}
}
cumulative cardinality聚合
累积基数聚合。累积基数聚合是一个父管道聚合,该聚合在父直方图聚合中计算累积基数。指定的指标必须是一个基数聚合并且外围的直方图必须设置min_doc_count
为0
。
cumulative_cardinality
聚合对于发现总的新条目
非常有用,可以理解为每天新访问你web网站的用户数量。常规的基数聚合只会告诉你每天有多少个唯一的访问用户,但是他不会区分新用户或者重复用户。累积基数聚合可用于确定每天有多少独立访问者是“新”访问者。
语法
cumulative_cardinality
聚合看起来像这样:
{"cumulative_cardinality": {"buckets_path": "my_cardinality_agg"}
}
以下代码段计算每日总用户的累积基数:
curl -X GET "localhost:9200/user_hits/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"users_per_day": {"date_histogram": {"field": "timestamp","calendar_interval": "day"},"aggs": {"distinct_users": {"cardinality": {"field": "user_id"}},"total_new_users": {"cumulative_cardinality": {"buckets_path": "distinct_users" }}}}}
}
'
以下可能是响应:
{"took": 11,"timed_out": false,"_shards": ...,"hits": ...,"aggregations": {"users_per_day": {"buckets": [{"key_as_string": "2019-01-01T00:00:00.000Z","key": 1546300800000,"doc_count": 2,"distinct_users": {"value": 2},"total_new_users": {"value": 2}},{"key_as_string": "2019-01-02T00:00:00.000Z","key": 1546387200000,"doc_count": 2,"distinct_users": {"value": 2},"total_new_users": {"value": 3}},{"key_as_string": "2019-01-03T00:00:00.000Z","key": 1546473600000,"doc_count": 3,"distinct_users": {"value": 3},"total_new_users": {"value": 4}}]}}
}
注意第二天,2019-01-02
,有两个不同的用户,但是累积管道agg生成的total_new_users
指标只增加到3
。这意味着当天的两个用户中只有一个是新用户,另一个在前一天已经被看到过。这种情况会在第三天再次发生,即三个用户中只有一个是全新的。
增量累积基数聚合
cumulative_cardinality
聚合将向您显示查询时间段开始以来的总数、不同的计数。然而,有时查看“增量”数量是有用的。意思是每天新增多少用户,而不是累计总数。
这可以通过向我们的查询添加derivative
聚合来实现:
curl -X GET "localhost:9200/user_hits/_search?pretty" -H 'Content-Type: application/json' -d'
{"size": 0,"aggs": {"users_per_day": {"date_histogram": {"field": "timestamp","calendar_interval": "day"},"aggs": {"distinct_users": {"cardinality": {"field": "user_id"}},"total_new_users": {"cumulative_cardinality": {"buckets_path": "distinct_users"}},"incremental_new_users": {"derivative": {"buckets_path": "total_new_users"}}}}}
}
'
以下可能是响应:
{"took": 11,"timed_out": false,"_shards": ...,"hits": ...,"aggregations": {"users_per_day": {"buckets": [{"key_as_string": "2019-01-01T00:00:00.000Z","key": 1546300800000,"doc_count": 2,"distinct_users": {"value": 2},"total_new_users": {"value": 2}},{"key_as_string": "2019-01-02T00:00:00.000Z","key": 1546387200000,"doc_count": 2,"distinct_users": {"value": 2},"total_new_users": {"value": 3},"incremental_new_users": {"value": 1.0}},{"key_as_string": "2019-01-03T00:00:00.000Z","key": 1546473600000,"doc_count": 3,"distinct_users": {"value": 3},"total_new_users": {"value": 4},"incremental_new_users": {"value": 1.0}}]}}
}
这篇关于ElasticSearch Aggregation(七)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!