In this post, we will try to collect best practices and also what things to avoid when working with Elasticsearch and feeding data into it. This way, we will know what all things we need to take care before we even start working with this excellent Search Engine.

Elasticsearch Best Practices

We will start working with Best Practices to follow with Elasticsearch and what problems it can create when we avoid these points. Let’s get started.

Always define ES Mappings

One thing ES can surely do is, working without mappings. So, when you start feeding JSON data to your ES index, it will iterate over the fields of data and create a suitable mapping. This seems direct and easy as ES is selecting the data-type itself. Based on your data, you might need a field to be of specific data-type.

For example, suppose you index the following document:

{
"id" : 1,
"title" : "Install ElasticSearch on Ubuntu",
"link" : "https://linuxhint.com/install-elasticsearch-ubuntu/",
"date" : "2018-03-25"
}

This way, Elasticsearch will mark the “date” field as “date” type. But when you index the following document:

{
"id" : 1,
"title" : "ES Best Practices and Performance",
"date" : "Pending"
}

This time, the type of the date field has been changed and ES will throw an error and won’t allow your document to be indexed. To make things easy, you can index a few documents, see what fields are indexed by ES and grab the mapping from this URL:

GET /index_name/doc_type/_mapping

This way, you won’t have to construct the complete mapping as well.

Production Flags

The default cluster name that ES starts is called elasticsearch. When you have a lot of nodes in your cluster, it is a good idea to keep the naming flags as consistent as possible, like:

cluster.name: app_es_production
node.name: app_es_node_001

Apart from this, recovery settings for nodes matter a lot as well. Suppose some of the nodes in a cluster restart due to a failure and some nodes restart a little after other nodes. To keep the data consistent between all these nodes, we will have to run consistency program that will keep all clusters in a consistent state.

gateway.recover_after_nodes: 10

It is also helpful when you tell the cluster in advance how many nodes will be present in the cluster and how much recovery time will these need:

gateway.expected_nodes: 20
gateway.recover_after_time: 7m

With the correct config, a recovery which would have taken hours can take as little as a minute and can save a lot of money to any company.

Capacity Provisioning

It is important to know how much space your data will take and the rate at which it flows into Elasticsearch, because that will decide the amount of RAM you will need on each of the node of the cluster and the master node as well.

Of course, there are no specific guidelines to achieve the numbers needed but we can take some steps which provide us with a good idea. One of the steps will be to simulate the use-case. Make an ES cluster and feed it with almost the same rate of data as you would expect with your production setup. The concept of start big and scale down can also help you be consistent about how much space is needed.

Large Templates

When you define indexed large templates, you will always face issues related to syncing the template across your various nodes of the cluster. Always note that the template will have to be re-defined whenever a data model change occurs. It is a much better idea to keep the templates as dynamic. Dynamic Templates automatically update field mappings based on the mappings we defined earlier and the new fields. Note that there is no substitute to keeping the templates as small as possible.

2Using mlockall on Ubuntu Servers

Linux makes use of Swapping process when it needs memory for new pages. Swapping make things slow as disks are slower than the memory. The mlockall property in ES configuration tells ES to not swap its pages out of the memory even if they aren’t required for now. This property can be set in the YAML file:

bootstrap.mlockall: true

In the ES v5.x+ versions, this property has changed to:

bootstrap.memory_lock: true

If you’re using this property, just make sure that you provide ES with big enough heap-memory using the -DXmx option or ES_HEAP_SIZE.

Minimize Mapping Updates

The performance of a cluster is slightly affected whenever you make mapping update requests on your ES cluster. If you can’t control this and still want to make updates to mappings, you can use a property in ES YAML config file:

indices.cluster.send_refresh_mapping: false

When the model update request is in pending queue for the master node and it sends data with the old mapping to the nodes, it also has to send an update request later to all the nodes. This can make things slow. When we set the above property to false, this makes master sense that an update has been made to the mapping and it won’t send the update request to the nodes. Note that this is only helpful if you make a lot of changes to your mappings regularly.

Optimized Thread-pool

ES nodes have many thread pools in order to improve how threads are managed within a node. But there are limitations on how much data each thread can take care of. To keep track of this value, we can use an ES property:

threadpool.bulk.queue_size: 2000

This informs ES the number of requests in a shard which can be queued for execution in the node when there is no thread available to process the request. If the number of tasks goes higher than this value, you will get a RemoteTransportException. The higher this value, the higher the amount of heap-space will be needed on your node machine and the JVM heap will be consumed as well. Also, you should keep your code ready in case this exception is thrown.

Conclusion

In this lesson, we looked at how we can improve Elasticsearch performance by avoiding common and not-so-common mistakes people make. Read more Elasticsearch articles on LinuxHint.

Elastic Search