Aggregations

Metrics and Bucket Aggregations

Aggregations

A flexible and powerful capability for analyzing data
- summarizes your data as metrics, statistics, or other analytics
- results are typically computed values that can be grouped

Aggregation	Capability
Metric	calculate metrics, such as a sum or average, from fiel values
Bucket	group documents into buckets based on field values, ranges, or other criteria
Pipeline	take input from other aggregations instead of documents or fields

Basic Structure of Aggregations

Run aggregations as part of a search request
- specify using the search API’s aggs parameter

GET blogs/_search
{
    "aggs": {
        "my_agg_name": {
            "AGG_TYPE": {
                ...
} } } }

Aggregation Results

Aggregation results are in the response’s “aggregations” object

Request:

GET blogs/_search
{
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date"
            }
        }
    }
}

Response:

{
    "took" : 2,
    "timed_out" : false,
    "_shards" : {...},
    "hits" : {
        ...
        "hits" : [
            ...
        ]
    },
    "aggregations" : {
        "first_blog" : {
        ...
    }
}

Return Only Aggregation Results

To return only aggregation results, set size to 0
- faster responses and smaller payload

GET blogs/_search
{
    "size": 0,
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date
            }
        }
    }
}

GET blogs/_search?size=0
{
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date
            }
        }
    }
}

Metrics Aggregations

Metrics compute numeric values based on your dataset
- field values
- values generated by custom script
Most metrics output a single value:
- count, avg, sum, min, max, median, cardinality
Some metrics output multiple values:
- stats, percentiles, percentile_ranks

min

Returns the minimum value among numeric values extraced from the aggregated documents

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "first_blog": {
            "min": {
                "field": "publish_date
            }
        }
    }
}

Response:

"aggregations" : {
    "first_blog" : {
        "value": 1265658554000,
        "value_as_string": "2010-02-08T19:49:14.000Z"
    }
}

value_count

Counts the number of values that are extracted from the aggregated documents
- if a field has duplicates, each value will be counted individually

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "no_of_authors": {
            "value_count": {
                "field":
                    "authors.last_name.keyword"
            }
        }
    }
}

Response:

"aggregations" : {
    "no_of_authors" : {
        "value" : 4967
    }
}

cardinality

Counts the number of distinct occurences
The result may not be exactly precise for large datasets
- based on HyperLogLog++ algorithm
- trades accuracy over speed

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "no_of_authors": {
            "cardinality": {
                "field": "authors.last_name.keyword"
            }
        }
    }
}

Response:

"aggregations" : {
    "no_of_authors" : {
        "value" : 956
    }
}

Bucket Aggregations

Group documents according to certain criterion

Bucket by	Aggregation
Time Period	Date Range
	Date Histogram
Numerics	Range
	Histogram
Keyword	Terms
	Significant Terms
IP Address	IPv4 Range

date_histogram

Bucket aggregation used with time-based data
Interval is specified using one of two ways:
- calender_interval: calender unit name such as day, month, or year (1d, 1M, or 1y)
- fixed_interval: SI unit name such as seconds, minutes, hours, or days (s, h, m, or d)

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "blogs_by_month": {
            "date_histogram": {
                "field": "publish_date",
                "calendar_interval": "month"
            }
        }
    }
}

Response:

"aggregations" : {
    "blogs_by_month" : {
        "buckets" : [
            {
                "key_as_string" : "2010-02-01T00...",
                "key" : 1264982400000,
                "doc_count" : 4
            },
            {
                "key_as_string" : "2010-03-01T00...",
                "key" : 1267401600000,
                "doc_count" : 1
            },
            ...
        ]
    }
}

histogram

Bucket aggregation that builds a histogram
- on a given field
- using a specified interval
Similar to date histogram

Request:

GET sample_data_logs/_search
{
    "size": 0,
    "aggs": {
        "logs_histogram": {
            "histogram": {
                "field": "runtime_ms",
                "interval": "100",
            }
        }
    }
}

Bucket Sorting

Some aggregations enable you to specify the sorting order

Aggregation	Default Sort Order
terms	`_count` in descending order
histogram date_histogram	`_key` in ascending order

Request:

GET blogs/_search
{
    "size": 0,
    "aggs": {
        "blogs_by_month": {
            "date_histogram": {
                "field": "publish_date",
                "calendar_interval": "month",
                "order": {
                    "_key": "desc"
                }
            }
        }
    }
}

terms

Dynamically create a new bucket for every unique term of a specified field

Request:

GET blogs/_search
{
    "size": 0,
    "aggs": {
        "author_buckets": {
            "terms": {
                "field": "authors.job_title.keyword",
                "size": 5
            }
        }
    }
}

Response:

key represents the distinct value of field
doc_count is the number of documents in the bucket
sum_other_doc_count is the number of documents not in any of the top buckets

"aggregations": {
    "author_buckets": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 2316,
        "buckets": [
            {
                "key": "",
                "doc_count": 1554
            },
            {
                "key": "Software Engineer",
                "doc_count": 231
            },
            {
                "key": "Stack Team Lead",
                "doc_count": 181
            },
...

Combining Aggregations

Working with Aggregations

Combine aggregations
- Specify different aggregations in a single request
- Extract multiple insights from your data
Change the aggregtion’s scope
- Use queries to limit the documents on which an aggregation runs
- Focus on specific, or relevant data
Nest aggregations
- Create a hierarchy of aggregation levels, or sub-aggregations, by nesting bucket aggregations within bucket aggregations
- Use metric aggregations to calculate values over fields at any sub-aggregation level in the hierarchy

Reducing the Scope of an Aggregation

By default, aggregations are performed on all documents in the index
Combine with a query to reduce the scope

GET blogs/_search?size=0
{
    "query": {
        "match": {
            "locale":"fr-fr"
        }
    },
    "aggs": {
        "no_of_authors": {
            "cardinality": {
                "field": "authors.last_name.keyword"
            }
        }
    }
}

Run Multiple Aggregations

You can specify multiple aggregations in the same request

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "no_of_authors": {
            "cardinality": {
                "field": "authors.last_name.keyword" }
        },
        "first_name_stats": {
            "string_stats": {
                "field": "authors.first_name.keyword" }
        }
    }
}

Response:

"aggregations": {
    "no_of_authors" : {
        "value" : 956
    },
    "first_name_stats": {
        "count" : 4961,
        "min_length" : 2,
        "max_length" : 41,
        "avg_length" : 5.66539...,
        "entropy" : 4.752609555991666
    }
}

Sub-Aggregations

Embed aggregations inside other aggregations
- separate groups based on criteria
- apply metrics at various levels in the aggregation hierarchy
No depth limit for nesting sub-aggregations

Run Sub-Aggregations

Bucket aggregations support bucket or metric sub-aggregations

Request:

GET blogs/_search?size=0
{
    "aggs": {
        "blogs_by_month": {
            "date_histogram": {
                "field": "publish_date",
                "calendar_interval": "month" },
                "aggs": {
                    "no_of_authors": {
                        "cardinality": {
                                "field":
                                    "authors.last_name.keyword" }
} } } } }

Response:

"aggregations" : {
    "blogs_by_month" : {
        "buckets" : [
            {
                "key_as_string" : "2010-02...",
                "key" : 1264982400000,
                "doc_count" : 4,
                "no_of_authors" : {"value" : 2}
            },
            {
                "key_as_string" : "2010-03...",
                "key" : 1267401600000,
                "doc_count" : 1,
                "no_of_authors" : {"value" : 2}
            },
            ...
] } }

Pipeline Aggregations

Work on output produced from other aggregations
Examples:
- bucket min/max/sum/avg
- cumulative_avg
- moving_avg
- bucket_sort
Use pipeline aggregations to use output from another aggregation

Request:

"aggs": {
    "blogs_by_month": {
        "date_histogram": {
            "field": "publish_date",
            "calendar_interval": "month" },
        "aggs": {
            "no_of_authors": {
                "cardinality": {
                    "field":"authors.last_name.keyword" }},
            "diff_author_ct": {
                "derivative": {
                    "buckets_path": "no_of_authors" }}

Response:

"aggregations" : {
    "blogs_by_month" : {
        "buckets" : [
        ...
        {"key_as_string" : "2019-11...",
        "key" : 1572566400000,
        "doc_count" : 26,
        "no_of_authors" : {"value" : 22},
        "diff_author_ct": {"value" : -32},
        },
        {"key_as_string" : "2019-12...",
        "key" : 1575158400000,
        "doc_count" : 46,
        "no_of_authors" : {"value" : 44}
        "diff_author_ct": {"value" : 22},
        },
        ...
] } }

Transforming Data

Transform Your Data for Better Insights

Summarize existing Elasticsearch indices using aggregations to create more efficient datasets
- pivot event-centric data into entity-centric indices for improved analysis
- retrieve the latest document based on a unique key, simplifying time-series data

Cluster-efficient Aggregations

Elasticsearch Aggregations provide powerful insights but can be resource-intensive with large datasets
- complex aggregations on large volumes of data may lead to memory issues or performance bottlenecks
Common challenges:
- need for a complete feature index
- need to sort aggregation results using pipeline aggregations
- want to create summary tables to optimize query performance
Solution:
- transform your data to create more efficient and scalable summaries for faster, optimized querying

Configuring Transform Settings

Continuous Mode: transforms run continuously, processing new data as it arrives
Retention Policy: identify and manage out-of-date documents in the destination index
Checkpoints: created each time new source data is ingested and transformed
Frequency: advanced option to set the interval between checkpoints

Destination Index

Pre-create the destination index with custom settings for performance
- use the Preview transform API to review generated_dest_index
- optimize index mappings and settings for efficient storage and querying
- disable _source to reduce storage usage
- use index sorting if grouping by multiple fields

Latest Transforms

Use Latest transforms to copy the most recent documents into a new index

Keyboard shortcuts

Cybersecurity Notes