Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Aggregations

Metrics and Bucket Aggregations

Aggregations

  • A flexible and powerful capability for analyzing data
    • summarizes your data as metrics, statistics, or other analytics
    • results are typically computed values that can be grouped
AggregationCapability
Metriccalculate metrics, such as a sum or average, from fiel values
Bucketgroup documents into buckets based on field values, ranges, or other criteria
Pipelinetake input from other aggregations instead of documents or fields

Basic Structure of Aggregations

  • Run aggregations as part of a search request
    • specify using the search API’s aggs parameter
GET blogs/_search
{
    "aggs": {
        "my_agg_name": {
            "AGG_TYPE": {
                ...
} } } }

Aggregation Results

  • Aggregation results are in the response’s “aggregations” object

Request:

GET blogs/_search
{
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date"
            }
        }
    }
}

Response:

{
    "took" : 2,
    "timed_out" : false,
    "_shards" : {...},
    "hits" : {
        ...
        "hits" : [
            ...
        ]
    },
    "aggregations" : {
        "first_blog" : {
        ...
    }
}

Return Only Aggregation Results

  • To return only aggregation results, set size to 0
    • faster responses and smaller payload
GET blogs/_search
{
    "size": 0,
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date
            }
        }
    }
}
GET blogs/_search?size=0
{
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date
            }
        }
    }
}

Metrics Aggregations

  • Metrics compute numeric values based on your dataset
    • field values
    • values generated by custom script
  • Most metrics output a single value:
    • count, avg, sum, min, max, median, cardinality
  • Some metrics output multiple values:
    • stats, percentiles, percentile_ranks

min

  • Returns the minimum value among numeric values extraced from the aggregated documents

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date
            }
        }
    }
}

Response:

"aggregations" : {
    "first_blog" : {
        "value": 1265658554000,
        "value_as_string": "2010-02-08T19:49:14.000Z"
    }
}

value_count

  • Counts the number of values that are extracted from the aggregated documents
    • if a field has duplicates, each value will be counted individually

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "no_of_authors": {
            "value_count": {
                "field":
                    "authors.last_name.keyword"
            }
        }
    }
}

Response:

"aggregations" : {
    "no_of_authors" : {
        "value" : 4967
    }
}

cardinality

  • Counts the number of distinct occurences
  • The result may not be exactly precise for large datasets
    • based on HyperLogLog++ algorithm
    • trades accuracy over speed

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "no_of_authors": {
            "cardinality": {
                "field": "authors.last_name.keyword"
            }
        }
    }
}

Response:

"aggregations" : {
    "no_of_authors" : {
        "value" : 956
    }
}

Bucket Aggregations

  • Group documents according to certain criterion
Bucket byAggregation
Time PeriodDate Range
Date Histogram
NumericsRange
Histogram
KeywordTerms
Significant Terms
IP AddressIPv4 Range

date_histogram

  • Bucket aggregation used with time-based data
  • Interval is specified using one of two ways:
    • calender_interval: calender unit name such as day, month, or year (1d, 1M, or 1y)
    • fixed_interval: SI unit name such as seconds, minutes, hours, or days (s, h, m, or d)

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "blogs_by_month": {
            "date_histogram": {
                "field": "publish_date",
                "calendar_interval": "month"
            }
        }
    }
}

Response:

"aggregations" : {
    "blogs_by_month" : {
        "buckets" : [
            {
                "key_as_string" : "2010-02-01T00...",
                "key" : 1264982400000,
                "doc_count" : 4
            },
            {
                "key_as_string" : "2010-03-01T00...",
                "key" : 1267401600000,
                "doc_count" : 1
            },
            ...
        ]
    }
}

histogram

  • Bucket aggregation that builds a histogram
    • on a given field
    • using a specified interval
  • Similar to date histogram

Request:

GET sample_data_logs/_search
{
    "size": 0,
    "aggs": {
        "logs_histogram": {
            "histogram": {
                "field": "runtime_ms",
                "interval": "100",
            }
        }
    }
}

Bucket Sorting

  • Some aggregations enable you to specify the sorting order
AggregationDefault Sort Order
terms_count in descending order
histogram
date_histogram
_key in ascending order

Request:

GET blogs/_search
{
    "size": 0,
    "aggs": {
        "blogs_by_month": {
            "date_histogram": {
                "field": "publish_date",
                "calendar_interval": "month",
                "order": {
                    "_key": "desc"
                }
            }
        }
    }
}

terms

  • Dynamically create a new bucket for every unique term of a specified field

Request:

GET blogs/_search
{
    "size": 0,
    "aggs": {
        "author_buckets": {
            "terms": {
                "field": "authors.job_title.keyword",
                "size": 5
            }
        }
    }
}

Response:

  • key represents the distinct value of field
  • doc_count is the number of documents in the bucket
  • sum_other_doc_count is the number of documents not in any of the top buckets
"aggregations": {
    "author_buckets": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 2316,
        "buckets": [
            {
                "key": "",
                "doc_count": 1554
            },
            {
                "key": "Software Engineer",
                "doc_count": 231
            },
            {
                "key": "Stack Team Lead",
                "doc_count": 181
            },
...

Combining Aggregations

Working with Aggregations

  • Combine aggregations
    • Specify different aggregations in a single request
    • Extract multiple insights from your data
  • Change the aggregtion’s scope
    • Use queries to limit the documents on which an aggregation runs
    • Focus on specific, or relevant data
  • Nest aggregations
    • Create a hierarchy of aggregation levels, or sub-aggregations, by nesting bucket aggregations within bucket aggregations
    • Use metric aggregations to calculate values over fields at any sub-aggregation level in the hierarchy

Reducing the Scope of an Aggregation

  • By default, aggregations are performed on all documents in the index
  • Combine with a query to reduce the scope
GET blogs/_search?size=0
{
    "query": {
        "match": {
            "locale":"fr-fr"
        }
    },
    "aggs": {
        "no_of_authors": {
            "cardinality": {
                "field": "authors.last_name.keyword"
            }
        }
    }
}

Run Multiple Aggregations

  • You can specify multiple aggregations in the same request

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "no_of_authors": {
            "cardinality": {
                "field": "authors.last_name.keyword" }
        },
        "first_name_stats": {
            "string_stats": {
                "field": "authors.first_name.keyword" }
        }
    }
}

Response:

"aggregations": {
    "no_of_authors" : {
        "value" : 956
    },
    "first_name_stats": {
        "count" : 4961,
        "min_length" : 2,
        "max_length" : 41,
        "avg_length" : 5.66539...,
        "entropy" : 4.752609555991666
    }
}

Sub-Aggregations

  • Embed aggregations inside other aggregations
    • separate groups based on criteria
    • apply metrics at various levels in the aggregation hierarchy
  • No depth limit for nesting sub-aggregations

Run Sub-Aggregations

  • Bucket aggregations support bucket or metric sub-aggregations

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "blogs_by_month": {
            "date_histogram": {
                "field": "publish_date",
                "calendar_interval": "month" },
                "aggs": {
                    "no_of_authors": {
                        "cardinality": {
                                "field":
                                    "authors.last_name.keyword" }
} } } } }

Response:

"aggregations" : {
    "blogs_by_month" : {
        "buckets" : [
            {
                "key_as_string" : "2010-02...",
                "key" : 1264982400000,
                "doc_count" : 4,
                "no_of_authors" : {"value" : 2}
            },
            {
                "key_as_string" : "2010-03...",
                "key" : 1267401600000,
                "doc_count" : 1,
                "no_of_authors" : {"value" : 2}
            },
            ...
] } }

Pipeline Aggregations

  • Work on output produced from other aggregations
  • Examples:
    • bucket min/max/sum/avg
    • cumulative_avg
    • moving_avg
    • bucket_sort
  • Use pipeline aggregations to use output from another aggregation

Request:

"aggs": {
    "blogs_by_month": {
        "date_histogram": {
            "field": "publish_date",
            "calendar_interval": "month" },
        "aggs": {
            "no_of_authors": {
                "cardinality": {
                    "field":"authors.last_name.keyword" }},
            "diff_author_ct": {
                "derivative": {
                    "buckets_path": "no_of_authors" }}

Response:

"aggregations" : {
    "blogs_by_month" : {
        "buckets" : [
        ...
        {"key_as_string" : "2019-11...",
        "key" : 1572566400000,
        "doc_count" : 26,
        "no_of_authors" : {"value" : 22},
        "diff_author_ct": {"value" : -32},
        },
        {"key_as_string" : "2019-12...",
        "key" : 1575158400000,
        "doc_count" : 46,
        "no_of_authors" : {"value" : 44}
        "diff_author_ct": {"value" : 22},
        },
        ...
] } }

Transforming Data

Transform Your Data for Better Insights

  • Summarize existing Elasticsearch indices using aggregations to create more efficient datasets
    • pivot event-centric data into entity-centric indices for improved analysis
    • retrieve the latest document based on a unique key, simplifying time-series data

Cluster-efficient Aggregations

  • Elasticsearch Aggregations provide powerful insights but can be resource-intensive with large datasets
    • complex aggregations on large volumes of data may lead to memory issues or performance bottlenecks
  • Common challenges:
    • need for a complete feature index
    • need to sort aggregation results using pipeline aggregations
    • want to create summary tables to optimize query performance
  • Solution:
    • transform your data to create more efficient and scalable summaries for faster, optimized querying

Configuring Transform Settings

  • Continuous Mode: transforms run continuously, processing new data as it arrives
  • Retention Policy: identify and manage out-of-date documents in the destination index
  • Checkpoints: created each time new source data is ingested and transformed
  • Frequency: advanced option to set the interval between checkpoints

Destination Index

  • Pre-create the destination index with custom settings for performance
    • use the Preview transform API to review generated_dest_index
    • optimize index mappings and settings for efficient storage and querying
    • disable _source to reduce storage usage
    • use index sorting if grouping by multiple fields

Latest Transforms

  • Use Latest transforms to copy the most recent documents into a new index