Data Modelling
Strings
Modelling
Analysis Makes Text Searchable
- By defualt, text analysis breaks up a text string into individual words (tokens) and lowercases those words
Analyzers
- Text analysis is done by an analyzer
- By default, Elasticsearch applies the standard analyzer
- There are many other built-in analyzers, including:
- whitespace, stop, pattern, simple, language-specific analyzers, and more
- The built-in analyzers work great for many use cases
- you can also define your own custom analyzers
Anatomy of an Analyzer
- An analyzer consists of:
- zero or more character filters
- exactly one tokenizer
- zero or more token filters
Standard Analyzer
- The default analyzer
- No character filters
- Uses the standard tokenizer
- Lowercases all tokens
- Optionally removes stop words
Testing an Analyzer
- Use the
_analyzeAPI to test what an analyzer will do to next
Request:
GET _analyze
{
"analyzer": "english",
"text": "Tuning Go Apps in a Beat"
}
flowchart LR
A["Tuning Go Apps in a Beat"]
B[<b>english</b><br>analyzer]
C[tune<br>go<br>app<br>beat]
A --> B --> C
Text and Keyword
Keyword vs. Text
- Elasticsearch has two kinds of string data types:
- text, for full-text search:
- text fields are analyzed
- keyword, for aggregations, sorting, and exact searches:
- keyword fields are not analyzed
- the original strings, as they occur in the documents
- text, for full-text search:
Mapping
- A mapping is a per-index schema definition that contains:
- name of fields
- data types of fields
- how the field should be indexed and stored
- Elasticsearch will happily index any document without knowing its details
- however, behind the scenes, Elasticsearch assigns data types to your fields in a mapping
Data Types for Fields
- Simple Types:
- text: for full-text strings
- keyword: for exact value strings and aggregations
- date and date_nanos: string formatted as dates, or numeric dates
- numbers: byte, short, integer, long, float, double, half_float
- boolean
- geo types
- Hierarchical types: obbjects, nested
Defining a Mapping
- In many cases, you will need to define your own mapping
- Defined in the mappings section of an index
PUT my_index
{
"mappings": {
define mappings here
}
}
PUT my_index/_mapping
{
additional mappings here
}
When not Defining a Mapping
- When you index a document with unmapped fields, Elasticsearch dynamically creates the mapping for those fields
- fields not already defined in a mapping are added
POST my_blogs/_doc
{
"username": "kimchy",
"comment": "Search is something that any application should have",
"details": {
"created_at": "2024-08-23T15:48:50",
"version": 8.15,
"employee": true
}
}
… turns into:
"my_blogs" : {
"mappings" : {
"properties" : {
...
"details" : {
"properties" :
"created_at" : {
"type" : "date"
},
"employee" : {
"type" : "boolean"
},
"version" : {
"type" : "float"
}}},
"username" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}}}}}
Multi-fields
Text and Keyword in Mapping
- Elasticsearch will give you both
textandkeywordby default
POST my_index/_doc
{
"country_name": "United States"
}
country_nameis analyzedcountry_name.keywordis not analyzed
Multi-fields in the Mapping
- The
country_namefield is of typetext country_name.keywordis the keyword version of the country_name field
Request:
GET my_index/_mapping
Response:
{
"my_index" : {
"mappings" : {
"properties" : {
"country_name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
Mapping-Optimization
Dynamic Mapping rarely optimal
- for example, the default for an integer is
long- not always appropriate for the content
- A more tailored type can help save on memory and speed
Can you change a Mapping?
- No - not without reindexing your documents
- adding new fields is possible
- all other mapping changes require reindexing
- Why not?
- if you could switch a field’s data type, all the values that were already indexed before the switch would become unsearchable on that field
- Invest the time to create a great mapping before you go to production
Fixing Mappings
- Create a new index with the updated mapping
PUT blogs_v2
{
"mappings": {
"properties": {
"publish_date": {
"type": "date"
}
}
}
}
Reindex API
- To populate the new index, use the
reindex API- reads data from one index and indexes them into another
- use it to modify your mappings
POST _reindex
{
"source": {
"index": "blogs"
},
"dest": {
"index": "blogs_v2"
}
}
Defining your own Mapping
- Kibana’s file uploader does an excellent job of guessing data types
- allows you to customize the mapping before index creation
Defining your own Mapping manually
- if not using the file uploader, to define an explicit mapping, follow these steps:
- Index a sample document that contains the fields you want defined in the mapping
- Get the dynamic mapping that was created automatically by Elasticsearch
- Modify the mapping definition
- Create your index using your custom mapping
Step 1
- Start by indexing a document into a dummy index
- Use values that will map closely to the data types you want
PUT blogs_temp/_doc/1
{
"date": "November 22, 2024",
"author": "Firstname Lastname",
"title": "Elastic is Open Source",
"seo_title": "A Good SEO Title",
"url": "/blog/some-url",
"content": "blog content",
"locale": "ja-jp",
"@timestamp": "2024-11-22T07:00:00.000Z",
"category": "Engineering"
}
Step 2
- GET the mapping, then copy-paste it into Console
- in Kibana’s file uploader, this is the
Advancedsection afterImport
- in Kibana’s file uploader, this is the
"blogs_temp": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"category": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
...
Step 3
- Define the mappings according to your use case:
keywordmight work well forcategorycontentmay only need to betext
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"content": {
"type": "text"
},
"category": {
"type": "keyword"
}
...
Step 4
new_blogsis now a new index with our explicit mappings- Documents can now be indexed
PUT new_blogs
{
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"category": {
"type": "keyword"
},
"content": {
"type": "text"
},
...
Types and Parameters
Mapping Parameters
- In addition to the
type, fileds in a mapping can be configured with additional parameters- for example to set the analyzer for a text field:
"mappings": {
"properties": {
...
"content": {
"type": "text",
"analyzer": "english"
},
...
Date Formats
- Use
formatto set thedate formatused for date fields- defaults to ISO 8601
- Choose from built-in date formats or define your own custom format
"properties": {
"my_date_field" : {
"type": "date",
"format": "dd/MM/yyyy||epoch_millis"
}
}
Coercing Data
- by default, Elasticsearch attempts to coerce data to match the data type of the field
- for example, suppose the
ratingfield is along:
- for example, suppose the
PUT ratings/_doc/1
{
"rating": 4
}
PUT ratings/_doc/2
{
"rating": "3"
}
PUT ratings/_doc/3
{
"rating": 4.5
}
- You can disable coercion if you want Elasticsearch to reject documents that have unexpected values:
"mappings": {
"properties": {
"rating": {
"type": "long",
"coerce": false
}
Not Storing Doc Values
- By default, Elasticsearch creates a
doc valuesdata structure for many fields during indexing- doc values enable you to aggregate/sort on those fields
- but take up disk space
- Fields that won’t be used for aggregations or sorting:
- set
doc_valuestofalse
- set
"url" : {
"type": "keyword",
"doc_values": false
}
Not Indexing a Field
- By default, for every field, Elasticsearch creates a data structure that enables fast queries
inverted indexorBKD tree- takes up disk space
- Set
indextofalsefor fields that do not require fast querying- fields with doc values still support slower queries
"display_name": {
"type": "keyword",
"index": false
}
Disabling a Field
- A field that won’t be used at all and should just be stored in
_source:- set
enabledtofalse
- set
"display_name": {
"enabled": false
}
copy_to Parameter
- Consider a document with three location fields:
POST locations/_doc
{
"region_name": "Victoria",
"country_name": "Australia",
"city_name": "Surrey Hills"
}
- You could use a
bool/multi_matchquery to search all three fields - Or you could copy all three values to a single field during indexing using
copy_to
"properties": {
"region_name": {
"type": "keyword",
"index": "false",
"copy_to": "locations_combined"
},
"country_name": {
"type": "keyword",
"index": "false",
"copy_to": "locations_combined"
},
"city_name": {
"type": "keyword",
"index": "false",
"copy_to": "locations_combined"
},
"locations_combined": {
"type": "text"
}
- The
locations_combinedfield is not stored in the_source- but it is indexed, so you can query it
Request:
GET locations/_search
{
"query": {
"match": {
"locations_combined": "victoria australia"
}
}
}
Response:
"hits": [
{
"_index": "weblogs",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"region_name": "Victoria",
"country_name": "Australia",
"city_name": "Surrey Hills"
}
}
Dynamic Data
Use Case
-
Manually defining a mapping can be tedious when you:
- have documents with a large number of fields
- or don’t know the fields ahead of time
- or want to change the default mapping for certain field types
-
Use dynamic templates to define a field’s mapping based on one of the following:
- the field’s date type
- the name of the field
- the path of the field
-
Map any string field with a name that starts with
ip*as type IP:
PUT my_index
{
"mappings": {
"dynamic_templates": [
{
"strings_as_ip": {
"match_mapping_type": "string",
"match": "ip*",
"mapping": {
"type": "ip"
}
}
}
]
}
}
Request:
POST my_index/_doc
{
"ip_address": "157.97.192.70"
}
GET my_index/_mapping
Response:
"properties" : {
"ip_address" : {
"type" : "ip"
}
}