Let's say the health status of Elastic Search on OpenShift is yellow or red.
~]# oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- health
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1700541655 04:40:55 elasticsearch red 3 3 1056 529 0 0 2 0 - 99.8%
Often, there will be UNASSIGNED shards and notice the reason (in this example) is ALLOCATION_FAILED.
- prirep "r" means replica shard
- prirep "p" means primary shard
~]$ oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- es_util --query="_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state" | grep -i unassign
index shard prirep state node unassigned.reason
app-019996 2 r UNASSIGNED ALLOCATION_FAILED
app-019996 1 r UNASSIGNED ALLOCATION_FAILED
app-019996 0 r UNASSIGNED ALLOCATION_FAILED
infra-019990 2 r UNASSIGNED ALLOCATION_FAILED
infra-019990 1 r UNASSIGNED ALLOCATION_FAILED
infra-019990 0 r UNASSIGNED ALLOCATION_FAILED
infra-019989 2 r UNASSIGNED ALLOCATION_FAILED
infra-019989 0 r UNASSIGNED ALLOCATION_FAILED
The following command may explain why the share is unassigned, why the allocation failed.
~]$ oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- es_util --query=_cluster/allocation/explain?pretty
{
"index" : "app-004812",
"shard" : 2,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2023-10-30T22:13:06.401Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [ZtWHX-icSYaXS8N_YC-ntg]: failed to create shard, failure IOException[No space left on device]",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions" : [
{
"node_id" : "ZtWHX-icSYaXS8N_YC-ntg",
"node_name" : "elasticsearch-cdm-nc0yql38-1",
"transport_address" : "10.128.10.6:9300",
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "ntVKVLSoT5qfNLZuw0B6nQ"
},
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-10-30T22:13:06.401Z], failed_attempts[5], delayed=false, details[failed shard on node [ZtWHX-icSYaXS8N_YC-ntg]: failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]"
}
]
},
{
"node_id" : "hTauU8PASjCAaJ8f84rAUg",
"node_name" : "elasticsearch-cdm-nc0yql38-2",
"transport_address" : "10.131.8.8:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "zZDAC6tURe6RnWbKOxqhpg",
"node_name" : "elasticsearch-cdm-nc0yql38-3",
"transport_address" : "10.130.8.8:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
}
]
}
Notice the above output has this.
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-10-30T22:13:06.401Z], failed_attempts[5], delayed=false, details[failed shard on node [ZtWHX-icSYaXS8N_YC-ntg]: failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]
And the explanation mentions calling /_cluster/reroute?retry_failed=true. Here is an example of how you could "call" /_cluster/reroute?retry_failed=true.
~]$ oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- curl -tls1.2 --silent --insecure --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key --request POST --url https://localhost:9200/_cluster/reroute?retry_failed=true
new elasticsearch-cdm pods should be created.
~]$ oc get pods --namespace openshift-logging
NAME READY STATUS RESTARTS AGE
elasticsearch-cdm-nc0yql38-1-574787fbf7-xdhxs 2/2 Running 0 28m
elasticsearch-cdm-nc0yql38-2-58846f758-7r4np 2/2 Running 0 27m
elasticsearch-cdm-nc0yql38-3-7b7f4d84b7-twp4g 2/2 Running 0 26m
And the health hopefully will now be green.
~]$ oc exec elasticsearch-cdm-nc0yql38-3-7b7f4d84b7-twp4g --container elasticsearch --namespace openshift-logging -- health
Thu Mar 28 11:26:56 UTC 2024
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1711625216 11:26:56 elasticsearch green 3 3 366 183 0 0 0 0 - 100.0%
Did you find this article helpful?
If so, consider buying me a coffee over at