Aggregations
Metrics and Bucket Aggregations
Aggregations
- A flexible and powerful capability for analyzing data
- summarizes your data as metrics, statistics, or other analytics
- results are typically computed values that can be grouped
| Aggregation | Capability |
|---|---|
| Metric | calculate metrics, such as a sum or average, from fiel values |
| Bucket | group documents into buckets based on field values, ranges, or other criteria |
| Pipeline | take input from other aggregations instead of documents or fields |
Basic Structure of Aggregations
- Run aggregations as part of a search request
- specify using the search API’s aggs parameter
GET blogs/_search
{
"aggs": {
"my_agg_name": {
"AGG_TYPE": {
...
} } } }
Aggregation Results
- Aggregation results are in the response’s “aggregations” object
Request:
GET blogs/_search
{
"aggs": {
"first_blog": {
"min": {
"field": "publish_date"
}
}
}
}
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {...},
"hits" : {
...
"hits" : [
...
]
},
"aggregations" : {
"first_blog" : {
...
}
}
Return Only Aggregation Results
- To return only aggregation results, set
sizeto 0- faster responses and smaller payload
GET blogs/_search
{
"size": 0,
"aggs": {
"first_blog": {
"min": {
"field": "publish_date
}
}
}
}
GET blogs/_search?size=0
{
"aggs": {
"first_blog": {
"min": {
"field": "publish_date
}
}
}
}
Metrics Aggregations
- Metrics compute numeric values based on your dataset
- field values
- values generated by custom script
- Most metrics output a single value:
- count, avg, sum, min, max, median, cardinality
- Some metrics output multiple values:
- stats, percentiles, percentile_ranks
min
- Returns the minimum value among numeric values extraced from the aggregated documents
Request:
GET blogs/_search?size=0
{
"aggs": {
"first_blog": {
"min": {
"field": "publish_date
}
}
}
}
Response:
"aggregations" : {
"first_blog" : {
"value": 1265658554000,
"value_as_string": "2010-02-08T19:49:14.000Z"
}
}
value_count
- Counts the number of values that are extracted from the aggregated documents
- if a field has duplicates, each value will be counted individually
Request:
GET blogs/_search?size=0
{
"aggs": {
"no_of_authors": {
"value_count": {
"field":
"authors.last_name.keyword"
}
}
}
}
Response:
"aggregations" : {
"no_of_authors" : {
"value" : 4967
}
}
cardinality
- Counts the number of distinct occurences
- The result may not be exactly precise for large datasets
- based on HyperLogLog++ algorithm
- trades accuracy over speed
Request:
GET blogs/_search?size=0
{
"aggs": {
"no_of_authors": {
"cardinality": {
"field": "authors.last_name.keyword"
}
}
}
}
Response:
"aggregations" : {
"no_of_authors" : {
"value" : 956
}
}
Bucket Aggregations
- Group documents according to certain criterion
| Bucket by | Aggregation |
|---|---|
| Time Period | Date Range |
| Date Histogram | |
| Numerics | Range |
| Histogram | |
| Keyword | Terms |
| Significant Terms | |
| IP Address | IPv4 Range |
date_histogram
- Bucket aggregation used with time-based data
- Interval is specified using one of two ways:
calender_interval: calender unit name such as day, month, or year (1d, 1M, or 1y)fixed_interval: SI unit name such as seconds, minutes, hours, or days (s, h, m, or d)
Request:
GET blogs/_search?size=0
{
"aggs": {
"blogs_by_month": {
"date_histogram": {
"field": "publish_date",
"calendar_interval": "month"
}
}
}
}
Response:
"aggregations" : {
"blogs_by_month" : {
"buckets" : [
{
"key_as_string" : "2010-02-01T00...",
"key" : 1264982400000,
"doc_count" : 4
},
{
"key_as_string" : "2010-03-01T00...",
"key" : 1267401600000,
"doc_count" : 1
},
...
]
}
}
histogram
- Bucket aggregation that builds a histogram
- on a given
field - using a specified
interval
- on a given
- Similar to date histogram
Request:
GET sample_data_logs/_search
{
"size": 0,
"aggs": {
"logs_histogram": {
"histogram": {
"field": "runtime_ms",
"interval": "100",
}
}
}
}
Bucket Sorting
- Some aggregations enable you to specify the sorting order
| Aggregation | Default Sort Order |
|---|---|
| terms | _count in descending order |
| histogram date_histogram | _key in ascending order |
Request:
GET blogs/_search
{
"size": 0,
"aggs": {
"blogs_by_month": {
"date_histogram": {
"field": "publish_date",
"calendar_interval": "month",
"order": {
"_key": "desc"
}
}
}
}
}
terms
- Dynamically create a new bucket for every unique term of a specified
field
Request:
GET blogs/_search
{
"size": 0,
"aggs": {
"author_buckets": {
"terms": {
"field": "authors.job_title.keyword",
"size": 5
}
}
}
}
Response:
keyrepresents the distinct value of fielddoc_countis the number of documents in the bucketsum_other_doc_countis the number of documents not in any of the top buckets
"aggregations": {
"author_buckets": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 2316,
"buckets": [
{
"key": "",
"doc_count": 1554
},
{
"key": "Software Engineer",
"doc_count": 231
},
{
"key": "Stack Team Lead",
"doc_count": 181
},
...
Combining Aggregations
Working with Aggregations
- Combine aggregations
- Specify different aggregations in a single request
- Extract multiple insights from your data
- Change the aggregtion’s scope
- Use queries to limit the documents on which an aggregation runs
- Focus on specific, or relevant data
- Nest aggregations
- Create a hierarchy of aggregation levels, or sub-aggregations, by nesting bucket aggregations within bucket aggregations
- Use metric aggregations to calculate values over fields at any sub-aggregation level in the hierarchy
Reducing the Scope of an Aggregation
- By default, aggregations are performed on all documents in the index
- Combine with a query to reduce the scope
GET blogs/_search?size=0
{
"query": {
"match": {
"locale":"fr-fr"
}
},
"aggs": {
"no_of_authors": {
"cardinality": {
"field": "authors.last_name.keyword"
}
}
}
}
Run Multiple Aggregations
- You can specify multiple aggregations in the same request
Request:
GET blogs/_search?size=0
{
"aggs": {
"no_of_authors": {
"cardinality": {
"field": "authors.last_name.keyword" }
},
"first_name_stats": {
"string_stats": {
"field": "authors.first_name.keyword" }
}
}
}
Response:
"aggregations": {
"no_of_authors" : {
"value" : 956
},
"first_name_stats": {
"count" : 4961,
"min_length" : 2,
"max_length" : 41,
"avg_length" : 5.66539...,
"entropy" : 4.752609555991666
}
}
Sub-Aggregations
- Embed aggregations inside other aggregations
- separate groups based on criteria
- apply metrics at various levels in the aggregation hierarchy
- No depth limit for nesting sub-aggregations
Run Sub-Aggregations
- Bucket aggregations support bucket or metric sub-aggregations
Request:
GET blogs/_search?size=0
{
"aggs": {
"blogs_by_month": {
"date_histogram": {
"field": "publish_date",
"calendar_interval": "month" },
"aggs": {
"no_of_authors": {
"cardinality": {
"field":
"authors.last_name.keyword" }
} } } } }
Response:
"aggregations" : {
"blogs_by_month" : {
"buckets" : [
{
"key_as_string" : "2010-02...",
"key" : 1264982400000,
"doc_count" : 4,
"no_of_authors" : {"value" : 2}
},
{
"key_as_string" : "2010-03...",
"key" : 1267401600000,
"doc_count" : 1,
"no_of_authors" : {"value" : 2}
},
...
] } }
Pipeline Aggregations
- Work on output produced from other aggregations
- Examples:
- bucket min/max/sum/avg
- cumulative_avg
- moving_avg
- bucket_sort
- Use pipeline aggregations to use output from another aggregation
Request:
"aggs": {
"blogs_by_month": {
"date_histogram": {
"field": "publish_date",
"calendar_interval": "month" },
"aggs": {
"no_of_authors": {
"cardinality": {
"field":"authors.last_name.keyword" }},
"diff_author_ct": {
"derivative": {
"buckets_path": "no_of_authors" }}
Response:
"aggregations" : {
"blogs_by_month" : {
"buckets" : [
...
{"key_as_string" : "2019-11...",
"key" : 1572566400000,
"doc_count" : 26,
"no_of_authors" : {"value" : 22},
"diff_author_ct": {"value" : -32},
},
{"key_as_string" : "2019-12...",
"key" : 1575158400000,
"doc_count" : 46,
"no_of_authors" : {"value" : 44}
"diff_author_ct": {"value" : 22},
},
...
] } }
Transforming Data
Transform Your Data for Better Insights
- Summarize existing Elasticsearch indices using aggregations to create more efficient datasets
- pivot event-centric data into entity-centric indices for improved analysis
- retrieve the latest document based on a unique key, simplifying time-series data
Cluster-efficient Aggregations
- Elasticsearch Aggregations provide powerful insights but can be resource-intensive with large datasets
- complex aggregations on large volumes of data may lead to memory issues or performance bottlenecks
- Common challenges:
- need for a complete feature index
- need to sort aggregation results using pipeline aggregations
- want to create summary tables to optimize query performance
- Solution:
- transform your data to create more efficient and scalable summaries for faster, optimized querying
Configuring Transform Settings
- Continuous Mode: transforms run continuously, processing new data as it arrives
- Retention Policy: identify and manage out-of-date documents in the destination index
- Checkpoints: created each time new source data is ingested and transformed
- Frequency: advanced option to set the interval between checkpoints
Destination Index
- Pre-create the destination index with custom settings for performance
- use the Preview transform API to review generated_dest_index
- optimize index mappings and settings for efficient storage and querying
- disable
_sourceto reduce storage usage - use index sorting if grouping by multiple fields
Latest Transforms
- Use Latest transforms to copy the most recent documents into a new index