Bootstrap

OpenShift - Resolve KubeletDown

by Jeremy Canfield | Updated: November 25 2024 | OpenShift articles

Let's say something like this is being returned.

KubeletDown
Kubelet has disappeared from Prometheus target discovery
Target disappeared from Prometheus target discovery

In the openshift-monitoring namespace, there should be a prometheus rulesfile configmap.

~]$ oc get configmaps --namespace openshift-monitoring
NAME                                             DATA   AGE
prometheus-k8s-rulefiles-0                       46     96d

And I was able to see this alert configuration is in a Prometheus rules file configmap.

~]$ oc describe configmap prometheus-custom-rulefiles-0 --namespace openshift-monitoring
  - alert: KubeletDown
    annotations:
      description: Kubelet has disappeared from Prometheus target discovery.
      runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeletDown.md
      summary: Target disappeared from Prometheus target discovery.
    expr: |
      absent(up{job="kubelet", metrics_path="/metrics"} == 1)
    for: 15m
    labels:
      namespace: kube-system
      severity: critical

I've saw this twice. Here is the cause for each occurrence of this I saw.

Rotated Certificate was not picked up by certain resources (Prometheus-custom and alertmanager-custom)
Persistent Volumes for the clusters Prometheus instances reached 100% capacity thus Prometheus was not able to pull in metrics from the cluster nodes to evaluate

Persistent Volumes for the clusters Prometheus instances reached 100% capacity

The oc get persistentvolumes command can be used to list the Persistent Volumes. Notice there are two Prometheus Persistent Volumes and two Alert Manager Persistent Volumes.

~]$ oc get persistentvolumes
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                          STORAGECLASS   REASON   AGE
pvc-53b2f9a5-1cc3-4945-a619-00d82490055a   300Gi      RWO            Delete           Bound    openshift-monitoring/prometheusk8s-db-prometheus-k8s-0         gp3-csi                 254d
pvc-9f311f27-5b80-4afb-8e39-be05b88bc309   10Gi       RWO            Delete           Bound    openshift-monitoring/alertmanagermain-db-alertmanager-main-0   gp3-csi                 254d
pvc-cf6a37aa-c619-4e1b-95e7-1449928401a6   10Gi       RWO            Delete           Bound    openshift-monitoring/alertmanagermain-db-alertmanager-main-1   gp3-csi                 254d
pvc-d2c305c6-847c-4b76-bdf1-e0015ac40b84   300Gi      RWO            Delete           Bound    openshift-monitoring/prometheusk8s-db-prometheus-k8s-1         gp3-csi                 254d

The oc get persistentvolumeclaims command can be used to list the Persistent Volume Claims associated with the Persistent Volumes.

~]$ oc get pvc --all-namespaces
NAMESPACE                     NAME                                      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
openshift-monitoring          alertmanagermain-db-alertmanager-main-0   Bound     pvc-9f311f27-5b80-4afb-8e39-be05b88bc309   10Gi       RWO            gp3-csi        254d
openshift-monitoring          alertmanagermain-db-alertmanager-main-1   Bound     pvc-cf6a37aa-c619-4e1b-95e7-1449928401a6   10Gi       RWO            gp3-csi        254d
openshift-monitoring          prometheusk8s-db-prometheus-k8s-0         Bound     pvc-53b2f9a5-1cc3-4945-a619-00d82490055a   300Gi      RWO            gp3-csi        254d
openshift-monitoring          prometheusk8s-db-prometheus-k8s-1         Bound     pvc-d2c305c6-847c-4b76-bdf1-e0015ac40b84   300Gi      RWO            gp3-csi        254d

Notice the Prometheus and Alert Manager Persistent Volume Claims are in the openshift-monitoring namespace. The oc get events command can be used to see if there are any interesting events in the openshift-monitoring namespace.

oc get events --namespace openshift-monitoring

Likewise, you can describe the persistent volume cliams to see if there are any interesting events with the Persistent Volume Claim.

oc describe persistentvolumeclaim prometheusk8s-db-prometheus-k8s-0 --namespace openshift-monitoring

Rotated Certificate was not picked up by certain resources

And I was able to see a stateful set was configured to use the configmap.

~]$ oc get statefulset prometheus-custom --output yaml
spec:
  template:
    spec:
      containers:
        name: prometheus
        volumeMounts:
        - mountPath: /etc/prometheus/rules/prometheus-custom-rulefiles-0
          name: prometheus-custom-rulefiles-0

And the stateful set was managed by the cluster-monitoring-operator.

metadata:
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/instance: custom
    app.kubernetes.io/managed-by: cluster-monitoring-operator
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: custom-monitoring
    app.kubernetes.io/version: 2.46.0
    operator.prometheus.io/name: custom
    operator.prometheus.io/shard: "0"

The first time I saw this alert I saw a bunch of "certificate has expired" events in the log, like this.

And these events started near the same time I started getting the KubeletDown alerts.

Apr 19 03:37:50.211885 ip-10-29-86-25 kubenswrapper[1507]: E0419 03:37:50.211852    1507 server.go:299] "Unable to authenticate the request due to an error" err="verifying certificate SN=14489139211812342166385763ABCD073300112, SKID=, AKID=67:3A:A2:57:2C:18:52:21:04:BC:36:3B:3A:F4:3A:C8:1B:B3:01:7B failed: x509: certificate has expired or is not yet valid: current time 2024-04-19T03:37:50Z is after 2024-04-19T01:12:29Z"

And I noticed new "cert" configmaps were created near the same time the alerts started. In this example, the 564 config map is the config map before the issue started and the 565 config map was created at nearly the same exact time the issue started.

~]$ oc get configmaps --namespace openshift-kube-apiserver | grep cert
kube-apiserver-cert-syncer-kubeconfig-564   1      7d11h
kube-apiserver-cert-syncer-kubeconfig-565   1      4d

I used the oc get configmap command with the --output jsonpath option to view the content of the ca-bunder.crt file in the 564 config map (the config map before the issue started).

oc get configmap kubelet-serving-ca-564 --namespace openshift-kube-apiserver --output jsonpath="{.data.ca-bundle\.crt}"

I took the content of the ca-bunder.crt file and placed it in a file named ca-bunder.crt and then used the openssl command to show the certficate details. The certificate in the 564 config map was indeed the certificate that expired.

~]$ openssl x509 -in ca-bundle.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 4867864711386769100 (0x438e2129f8de26cc)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=openshift-kube-controller-manager-operator_csr-signer-signer@1709471471
        Validity
            Not Before: Mar 20 01:12:28 2024 GMT
            Not After : Apr 19 01:12:29 2024 GMT <- expired when the issue started
        Subject: CN=kube-csr-signer_@1710897148
        X509v3 extensions:
            X509v3 Subject Key Identifier: 
                67:3A:A2:57:2C:18:52:21:04:BC:36:3B:3A:F4:3A:C8:1B:B3:01:7B <- this matches the AKID in the log

I created a Must Gather and opened a case with Red Hat.

oc adm must-gather

Likewise, the oc get service command can be used to see the kubelet service is in the kube-system namespace.

~]$ oc get service --namespace kube-system
NAME      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                        AGE
kubelet   ClusterIP   None         <none>        10250/TCP,10255/TCP,4194/TCP   2y72d

The oc describe service command can be used to see that the kubelet service forwards reqests on the nodes via HTTP, HTTPS and on port 4194 for the cadvisor (Certificate Authority Advisor).

~]$ oc describe service kubelet --namespace kube-system
Name:              kubelet
Namespace:         kube-system
Labels:            app.kubernetes.io/managed-by=prometheus-operator
                   app.kubernetes.io/name=kubelet
                   k8s-app=kubelet
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  RequireDualStack
IP Families:       IPv4,IPv6
IP:                None
IPs:               None
Port:              https-metrics  10250/TCP
TargetPort:        10250/TCP
Endpoints:         10.29.146.118:10250,10.29.148.190:10250,10.29.149.96:10250 + 6 more...
Port:              http-metrics  10255/TCP
TargetPort:        10255/TCP
Endpoints:         10.29.146.118:10255,10.29.148.190:10255,10.29.149.96:10255 + 6 more...
Port:              cadvisor  4194/TCP
TargetPort:        4194/TCP
Endpoints:         10.29.146.118:4194,10.29.148.190:4194,10.29.149.96:4194 + 6 more...
Session Affinity:  None
Events:            <none>

You can start one of the nodes that the kubelet service forwards requests onto in debug mode.

~]# oc debug node/my-node-5n4fj
Starting pod/my-node-5n4fj-debug ...
sh-4.4#

Typically you will first issue the chroot /host command is used to set /host as the root directory because the root file system is mounted to /host in the debug pod.

sh-4.4# chroot /host

systemctl can be used to determine if the kubelet service is running.

sh-5.1# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─01-kubens.conf, 10-mco-default-madv.conf, 20-aws-node-name.conf, 20-aws-providerid.conf, 20-logging.conf
     Active: active (running) since Tue 2024-04-23 04:14:54 UTC; 1 day 1h ago

Here is a one liner that I use to loop through each node and to return the status of the kubelet service.

for node in $(oc get nodes | grep -v ^NAME | awk '{print $1}'); do echo $node; oc debug node/$node -- chroot /host /usr/bin/systemctl status kubelet | grep Active:; done;

Which should return something like this. In this sceanrio, I can see that the kubelet has been running since before the certificate expired, meaning the kubelet service was not restarted after the certificate renewed.

infra-node-1
     Active: active (running) since Thu 2024-04-04 16:31:39 UTC; 2 weeks 5 days ago
infra-node-2
     Active: active (running) since Thu 2024-04-04 16:18:48 UTC; 2 weeks 5 days ago
infra-node-3
     Active: active (running) since Thu 2024-04-04 16:26:03 UTC; 2 weeks 5 days ago

master-node-1
     Active: active (running) since Thu 2024-04-04 16:34:01 UTC; 2 weeks 5 days ago
master-node-2
     Active: active (running) since Thu 2024-04-04 16:21:05 UTC; 2 weeks 5 days ago
master-node-3
     Active: active (running) since Thu 2024-04-04 16:27:32 UTC; 2 weeks 5 days ago

worker-node-1
     Active: active (running) since Tue 2024-04-04 16:34:01 UTC; 1 day 1h ago
worker-node-2
     Active: active (running) since Tue 2024-04-04 16:21:05 UTC; 1 day 1h ago
worker-node-3
     Active: active (running) since Tue 2024-04-04 16:27:32 UTC; 1 day 2h ago

This one liner can be used to restart the kubelet service in each node, if you want to try restarting the kubelet service to see if this resolves the issue.

for node in $(oc get nodes | grep -v ^NAME | awk '{print $1}'); do echo $node; oc debug node/$node -- chroot /host /usr/bin/systemctl restart kubelet; done;

The default directory where for kubelet SSL is /var/lib/kubelet/pki.

sh-5.1# ls -l /var/lib/kubelet/pki
total 32
-rw-------. 1 root root 1183 Feb 29 05:33 kubelet-client-2024-02-29-05-33-41.pem
-rw-------. 1 root root 1183 Mar 15 00:19 kubelet-client-2024-03-15-00-19-13.pem
-rw-------. 1 root root 1183 Mar 30 14:40 kubelet-client-2024-03-30-14-40-24.pem
-rw-------. 1 root root 1183 Apr 13 11:20 kubelet-client-2024-04-13-11-20-57.pem
lrwxrwxrwx. 1 root root   59 Apr 13 11:20 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2024-04-13-11-20-57.pem
-rw-------. 1 root root 1269 Feb 29 05:33 kubelet-server-2024-02-29-05-33-42.pem
-rw-------. 1 root root 1269 Mar 14 12:49 kubelet-server-2024-03-14-12-49-47.pem
-rw-------. 1 root root 1269 Mar 28 21:48 kubelet-server-2024-03-28-21-48-50.pem
-rw-------. 1 root root 1269 Apr 12 22:10 kubelet-server-2024-04-12-22-10-56.pem
lrwxrwxrwx. 1 root root   59 Apr 12 22:10 kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2024-04-12-22-10-56.pem

The openssl command can be used to display the details of each PEM. I noticed the prior PEM also had the expired certificate.

sh-5.1# openssl x509 -in /var/lib/kubelet/pki/kubelet-client-2024-03-30-14-40-24.pem -text -noout
Certificate:
    Data:
        Issuer: CN = kube-csr-signer_@1712193161
        Validity
            Not Before: Mar 30 14:35:24 2024 GMT
            Not After : Apr 19 01:12:29 2024 GMT <- expired when the issue started
        Subject: O = system:nodes, CN = system:node:ip-10-29-148-190.us-east-2.compute.internal
        X509v3 extensions:
            X509v3 Authority Key Identifier: 
                67:3A:A2:57:2C:18:52:21:04:BC:36:3B:3A:F4:3A:C8:1B:B3:01:7B <- this matches the AKID in the log

The which command should return the absolute path to the kubelet CLI.

sh-5.1# which kubelet
/usr/bin/kubelet

The kubelet --version command can be used to ensure the kubelet CLI is working as expected and returning stdout.

sh-5.1# kubelet --version
Kubernetes v1.26.13+8f85140

The oc adm node-logs command can be used to return the node logs pertaining to kubelet. The logs will be extensive, which is why I redirect the output to a file.

oc adm node-logs --role=master --unit=kubelet >> kubelet_node.log

Did you find this article helpful?

If so, consider buying me a coffee over at

Persistent Volumes for the clusters Prometheus instances reached 100% capacity

Rotated Certificate was not picked up by certain resources

Did you find this article helpful?

Comments

Add a Comment