Bootstrap

OpenShift - Resolve Elastic Search "failed to create shard failure IOException No space left on device"

by Jeremy Canfield | Updated: March 28 2024 | OpenShift articles

Let's say the health status of Elastic Search on OpenShift is yellow or red.

~]# oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- health
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1700541655 04:40:55  elasticsearch red             3         3   1056 529    0    0        2             0                  -                 99.8%

Often, there will be UNASSIGNED shards and notice the reason (in this example) is ALLOCATION_FAILED.

prirep "r" means replica shard
prirep "p" means primary shard

~]$ oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- es_util --query="_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state" | grep -i unassign
index                            shard prirep state      node                         unassigned.reason
app-019996                       2     r      UNASSIGNED                              ALLOCATION_FAILED
app-019996                       1     r      UNASSIGNED                              ALLOCATION_FAILED
app-019996                       0     r      UNASSIGNED                              ALLOCATION_FAILED
infra-019990                     2     r      UNASSIGNED                              ALLOCATION_FAILED
infra-019990                     1     r      UNASSIGNED                              ALLOCATION_FAILED
infra-019990                     0     r      UNASSIGNED                              ALLOCATION_FAILED
infra-019989                     2     r      UNASSIGNED                              ALLOCATION_FAILED
infra-019989                     0     r      UNASSIGNED                              ALLOCATION_FAILED

The following command may explain why the share is unassigned, why the allocation failed.

~]$ oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- es_util --query=_cluster/allocation/explain?pretty
{
  "index" : "app-004812",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2023-10-30T22:13:06.401Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [ZtWHX-icSYaXS8N_YC-ntg]: failed to create shard, failure IOException[No space left on device]",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "ZtWHX-icSYaXS8N_YC-ntg",
      "node_name" : "elasticsearch-cdm-nc0yql38-1",
      "transport_address" : "10.128.10.6:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "ntVKVLSoT5qfNLZuw0B6nQ"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-10-30T22:13:06.401Z], failed_attempts[5], delayed=false, details[failed shard on node [ZtWHX-icSYaXS8N_YC-ntg]: failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]"
        }
      ]
    },
    {
      "node_id" : "hTauU8PASjCAaJ8f84rAUg",
      "node_name" : "elasticsearch-cdm-nc0yql38-2",
      "transport_address" : "10.131.8.8:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "zZDAC6tURe6RnWbKOxqhpg",
      "node_name" : "elasticsearch-cdm-nc0yql38-3",
      "transport_address" : "10.130.8.8:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

Notice the above output has this.

shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-10-30T22:13:06.401Z], failed_attempts[5], delayed=false, details[failed shard on node [ZtWHX-icSYaXS8N_YC-ntg]: failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]

And the explanation mentions calling /_cluster/reroute?retry_failed=true. Here is an example of how you could "call" /_cluster/reroute?retry_failed=true.

~]$ oc exec elasticsearch-cdm-mrpf7eom-3-566bd5f5cb-lkdz9 --container elasticsearch --namespace openshift-logging -- curl -tls1.2 --silent --insecure --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key --request POST --url https://localhost:9200/_cluster/reroute?retry_failed=true

new elasticsearch-cdm pods should be created.

~]$ oc get pods --namespace openshift-logging
NAME                                            READY   STATUS      RESTARTS      AGE
elasticsearch-cdm-nc0yql38-1-574787fbf7-xdhxs   2/2     Running     0             28m
elasticsearch-cdm-nc0yql38-2-58846f758-7r4np    2/2     Running     0             27m
elasticsearch-cdm-nc0yql38-3-7b7f4d84b7-twp4g   2/2     Running     0             26m

And the health hopefully will now be green.

 ~]$ oc exec elasticsearch-cdm-nc0yql38-3-7b7f4d84b7-twp4g --container elasticsearch --namespace openshift-logging -- health
Thu Mar 28 11:26:56 UTC 2024
epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1711625216 11:26:56  elasticsearch green           3         3    366 183    0    0        0             0                  -                100.0%

Did you find this article helpful?

If so, consider buying me a coffee over at

Did you find this article helpful?

Comments

Add a Comment