Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Management

Data Management Concepts

Managing Data

  • Data management needs differ depending on the type of data you are collecting:
StaticTime Series Data
Data grows slowlyData grows fast
Updates may happenUpdates never happen
Old data is read frequentlyOld data is read infrequently

Index Aliases

Scaling Indices

  • Indices scaly by adding more shards
    • increasing the number of shards of an index is expensive
  • Solution: create a new index

Using Aliases

  • Use index aliases to simplify your access to the growing number of indices

An Alias to Multiple Indices

  • Use the _aliases endpoint to create an alias
    • specify the write index using is_write_index
  • Define an alias at index creation
POST _aliases
{
    "actions": [ {
        "add": {
            "index": "my_logs-*",
            "alias": "my_logs"
        } },
    {
        "add": {
            "index": "my_logs-2021-07",
            "alias": "my_logs",
            "is_write_index": true
} }] }

Index Templates

What are Index Templates?

  • If you need to create multiple indices with the same settings and mappings, use an index template
    • templates match an index pattern
    • if a new index matches the pattern, then the template is applied

Elements of an Index Template

  • An index template can contain the following sections:
    • component templates
    • settings
    • mappings
    • aliases
  • Component templates are reusable building blocks that can contain:
    • settings, mappings or aliases
    • components are reused across multiple templates

Defining an Index Template

  • This logs-template:
    • overrides the default setting of 1 replica
    • for any new indices with a name that begins with logs:
PUT _index_template/logs-template
{
    "index_patterns": [ "logs*" ],
    "template": {
        "settings": {
            "number_of_replicas": 2
        }
    }
}

Applying an Index Template

  • Create an index that matches the index pattern of one of your index templates:
{
    "logs1" : {
        "settings" : {
            "index" : {
                ...
                "number_of_replicas" : 2,
                ...
            }
        }
    }
}

Component Template Example

  • A common setting across many indices may be to auto expand replica shards as more nodes become available
    • put this setting into a component template:

data management 1

  • Use the component in an index template:

data management 2

Resolving Template Match Conflicts

  • One and only one template will be applied to a newly created index
  • If more than one template defines a matching index pattern, the priority setting is used to determine which template applies
    • the highest priority is applied, others are not used
    • set a priority over 200 to override auto-created index templates
    • use the _simulate tool to test how an index would match
POST /_index_template/_simulate_index/logs2

Data Streams

Time Series Data Management

  • Time series data typically grows quickly and is almost never updated

Data Streams

  • A data stream lets you store time-series data across multiple indices, while giving you a single named resource for requests
    • indexing and search requests are sent to the data stream
    • the stream routes the request to the appropriate backing index

Backing Indices

  • Every data stream is made up of hidden backing indices
    • with a single write index
  • A rollover creates a new backing index
    • which becomes the stream’s new write index

Choosing the right Data Stream

  • Use the index.mode setting to control how your time series data will be ingested
    • Optimize the storage of your documents
index.modeUse case_sourceStorage saving
standardfor default settingspersisted-
time_seriesfor storing metricssyntheticup to 70%
logsdbfor storing logssynthetic~ 2.5 times

Data Stream Naming Convention

  • Data streams are named by:
    • type: to describe the generic data type
    • dataset: to describe the specific subset of data
    • namespace: for user-specific details
  • Each data stream should include constant_keyword fields for:
    • data_stream.type
    • data_stream.dataset
    • data_stream.namespace
  • constant_keyword has the same value for all documents

Example Use of Data Streams

  • Log data separated by app and env
  • Each data stream can have separate lifecycles
  • Different datasets can have different fields
GET logs-*-*/_search
{
    "query": {
        "bool": {
            "filter": {
                "term": {
                    "data_stream.namespace": "prod"
                }
            }
        }
    }
}

Creating a Data Stream

  • Step 1: create component templates
    • make sure you have a @timestamp field
  • Step 2: create a data stream-enabled index template
  • Step 3: create the data stream by indexing documents

Step 1

PUT _component_template/my-mappings
{
    "template": {
        "mappings": {
            "properties": {
                "@timestamp": {
                    "type": "date",
                    "format": "date_optional_time||epoch_millis"
                }
            }
        }
    }
}

Step 2

PUT _index_template/my-index-template
{
    "index_patterns": ["logs-myapp-default"],
    "data_stream": { },
    "composed_of": [ "my-mappings"],
    "priority": 500
}

Step 3

  • Use POST <stream>/_doc or PUT <stream>/_create/<doc_id>
    • if you use _bulk, you must use the create action

Request:

POST logs-myapp-default/_doc
{
    "@timestamp": "2099-05-06T16:21:15.000Z",
    "message": "192.0.2.42 -[06/May/2099:16:21:15] \"GET /images/bg.jp..."
}

Response:

{
    "_index": ".ds-logs-myapp-default-2024.10.22-000001",
    "_id": "XZPRtZIBS7arFsx0_FAp",
    ...
}

Rollover a Data Stream

  • The rollover API creates a new index for a data stream
    • Every new document will be indexed into the new index
    • You cannot add new documents to other backing indices

Request:

POST logs-myapp-default/_rollover

Response:

{
    ...
    "old_index": ".ds-logs-myapp-default-2024.10.22-000001",
    "new_index": ".ds-logs-myapp-default-2024.10.22-000002",
    ...
}

Changing a Data Stream

  • Changes should be made to the index template associated with the stream
    • new backing indices will get the changes when they are created
    • older backing indices can have limited changes applied
  • Changes to static mappings still require a reindex
  • Before reindexing, use the resolve API to check for conflicting names:
GET /_resolve/index/logs-myapp-new*

Reindexing a Data Stream

  • Set up a new data stream template
    • use the data stream API to create an empty data stream:
PUT /_data_stream/logs-myapp-new
  • Reindex with op_type of create:
    • can also use single backing indices to preserve order
POST /_reindex
{
    "source": {
        "index": "logs-myapp-default"
    },
    "dest": {
        "index": "logs-myapp-new",
        "op_type": "create"
    }
}

Index Lifecycle Management

Data Tiers

What is a data tier?

  • A data tier is a collection of nodes with the same data role
    • that typically share the same hardware profile
  • There are five types of data tiers:
    • content
    • hot
    • warm
    • cold
    • frozen

Overview of the Five Data Tiers

  • The content tier is useful for static datasets
  • Implementing a hot -> warm -> cold -> frozen architecture can be achieved using the following data tiers:
    • hot tier: have the fastes storage for writing data and for frequent searching
    • warm tier: for read-only data that is searched less often
    • cold tier: for data that is searched sparingly
    • frozen tier: for data that is accesses rarely and never updated

Data Tiers, Nodes, and Indices

  • Every node is all data tiers by default
    • change using the node.roles parameter
    • node roles are handled for you automatically on Elastic Cloud
  • Move indices to colder tiers as the data gets older
    • define an index lifecycle management policy to manage this

Configuring an Index to Prefer a Data Tier

  • Set the data tier preference of an index using the routing.allocation.include._tier_preference property
    • data_content is the default for all indices
    • data_hot is the default for all data streams
    • you can update the property at any time
    • ILM can manage this setting for you
PUT logs-2021-03
{
    "settings": {
        "index.routing.allocation.include._tier_preference" : "data_hot"
    }
}

Index Lifecycle Management

data management 3

ILM Actions

  • ILM consists of policies that trigger actions, such as:
ActionDescription
rollovercreate a new index based on age, size, or doc count
shrinkreduce the number of primary shards
force mergeoptimize storage space
searchable snapshotsaves memory on rarely used indices
deletepermanently remove an index

ILM Policy Example

  • During the hot phase you might:
    • create a new index every two weeks
  • In the warm phase you might:
    • make the index read-only and move to warm for one week
  • In the cold phase you might:
    • convert to a fully-mounted index, decrease the number of replicas, and move to cold for three weeks
  • In the delete phase:
    • the only action allowed is to delete the 28-days-old index

Define the Hot Phase

  • You want indices in the hot phase for two weeks:
PUT _ilm/policy/my-hwcd-policy
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_age": "14d"
                    }
                }
            },

Define the Warm Phase

  • You want the old index to move to the warm tier immediately and set the index as read-only:
    • data age is calculated from the time of rollover
            "warm": {
                "min_age": "0d",
                "actions": {
                    "readonly": {}
                }
            },

Define the Cold Phase

  • After one week of warm, move the index to the cold phase, and convert the index:
            "cold": {
                "min_age": "7d",
                "actions": {
                    "searchable_snapshot" : {
                        "snapshot_repository" : "my_snapshot"
                    }       
                }
            } },

Define the Delete Phase

  • Delete the data four weeks after rollover:
    • which means the documents lived for 14 days in hot
    • then 7 days in warm
    • then 21 days in cold
            "delete": {
                "min_age": "28d",
                "actions": {
                    "delete": {}
                }
            }

Applying the Policy

  • Create a component template
  • Link you ILM policy using the setting:
    • index.lifecycle.name
PUT _component_template/my-ilm-settings
{
    "template": {
        "settings": {
            "index.lifecycle.name": "my-hwcd-policy"
        }
    }
}

Create an Index Template

  • Use other components relevant to your stream
PUT _index_template/my-ilm-index-template
{
    "index_patterns": ["my-data-stream"],
    "data_stream": { },
    "composed_of": [ "my-mappings", "my-ilm-settings"],
    "priority": 500
}

Start Indexing Documents

  • ILM takes over from here
  • When a rollover happens, the number of indices is incremented
    • the new index is set as the write index of the data stream
    • old indices will automatically move to other tiers

Troubleshooting Lifecycle Rollovers

  • If an index is not healthy, it will not move to the next phase
  • The default poll interval for a cluster is 10 minutes
    • can change with indices.lifecycle.poll_interval
  • Check the server log for errors
  • Make sure you have the appropriate data tiers for migration
  • Reminder: use a template to apply a policy to new indices
  • Get detailed information about ILM status with:
GET <data-stream>/_ilm/explain

Agent and ILM

  • Agent uses ILM policies to manage rollover
  • By default, Agent policies:
    • remain in the host phase forever
    • never delete
    • indices are rolled over after 30 days or 50GB
  • The default Agent policies can be edited with Kibana

Searchable Snapshots

Cost Effective Storage

  • As your data streams and time series data grow, your storage and memory needs increase
    • at the same time, the utility of that older data decreases
  • You could delete this older data
    • but if it remains valuable, it is preferable to keep it available
  • There is an action available called searchable snapshot

Snapshots

Disaster Recovery

  • You already know about replica shards:
    • they provide redundant copies of your documents
    • that is not the same as a backup
  • Replicas do not protect you against catastrophic failure
    • you will need to keep a complete backup of your data

Snapshot and Restore

  • Snapshot and restore allows you to create and manage backups taken from a running ES cluster
    • takes the current state and data in your cluster and saves it to a repository
  • Repos can be on a local shared filed system or in the cloud
    • the ES Service performs snapshots automatically

Types of Repos

  • The backup process starts with the creation of a repository
    • different types are supported
Shared file systemdefine path.repo in every node
Read-only URLused when multiple clusters share a repo
AWS S3for AWS S3 repos
Azurefor Microsoft Azure Blob storage
GCSfor Google Cloud Storage
repository-hdfs pluginstore snapshots in Hadoop
Source-only repotake minimal snapshots

Setting Up a Repo

  • Cloud deployments come with free repos preconfigured
  • Use Kibana to register a repo

Taking a Snapshot Manually

  • Once the repo is configured, you can take a snapshot
    • using the _snapshot endpoint or the UI
    • snapshots are a “point-in-time” copy of the data and incremental
  • Can back up only certain indices
  • Can include cluster state
PUT _snapshot/my_repo/my_logs_snapshot_1
{
    "indices": "logs-*",
    "ignore_unavailable": true
}

Automating Snapshots

  • The _snapshot endpoint can be called manually
    • every time you want to take a snapshot
    • at regular invervals using an external tool
  • Or, you can automate snapshots with Snapshot Lifecycle Management (SLM) policies
    • policies can be created in Kibana
    • or using the _slm API

Restoring from a Snapshot

  • Use the _restore endpoint on the snapshot ID to restore all indices from that snapshot:
POST _snapshot/my_repo/my_logs_snapshot_1/_restore
  • Can also restore using Kibana

Searchable Snapshots

  • There is an action called searchable snapshots
  • Benefits include:
    • search old data in a very cost-effective fashion
    • reduce storage costs
    • use the same mechanism you are aleady using

How Searchable Snapshots Work

  • Searching a searchable snapshot index is the same as searching any other index
    • when a snapshot of an index is searched, the index must get mounted locally in a temporary index
    • the shards of the index are allocated to data nodes in the cluster

Setting up Searchable Snapshots

  • In the cold or frozen phase, you configure a searchable snapshot by selecting a registered repository

Add Searchable Snapshots to ILM

  • Edit your ILM policy to add a searchable snapshot to your cold or frozen phase
    • ILM will automatically handle the index mounting
    • the hot and cold phase uses fully mounted indices
    • the frozen phase uses partially mounted indices
  • If the delete phase is active, it will delete the searchable snapshot by default:
    • turn off with "delete_searchable_snapshot": false
  • If your policy applies to a data stream, the searchable snapshot will be included in searches by default