
Let's say something like this is being returned.
KubeletDown
Kubelet has disappeared from Prometheus target discovery
Target disappeared from Prometheus target discovery
In the openshift-monitoring namespace, there should be a prometheus rulesfile configmap.
~]$ oc get configmaps --namespace openshift-monitoring
NAME DATA AGE
prometheus-k8s-rulefiles-0 46 96d
And I was able to see this alert configuration is in a Prometheus rules file configmap.
~]$ oc describe configmap prometheus-custom-rulefiles-0 --namespace openshift-monitoring
- alert: KubeletDown
annotations:
description: Kubelet has disappeared from Prometheus target discovery.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeletDown.md
summary: Target disappeared from Prometheus target discovery.
expr: |
absent(up{job="kubelet", metrics_path="/metrics"} == 1)
for: 15m
labels:
namespace: kube-system
severity: critical
I've saw this twice. Here is the cause for each occurrence of this I saw.
- Rotated Certificate was not picked up by certain resources (Prometheus-custom and alertmanager-custom)
- Persistent Volumes for the clusters Prometheus instances reached 100% capacity thus Prometheus was not able to pull in metrics from the cluster nodes to evaluate
Persistent Volumes for the clusters Prometheus instances reached 100% capacity
The oc get persistentvolumes command can be used to list the Persistent Volumes. Notice there are two Prometheus Persistent Volumes and two Alert Manager Persistent Volumes.
~]$ oc get persistentvolumes
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-53b2f9a5-1cc3-4945-a619-00d82490055a 300Gi RWO Delete Bound openshift-monitoring/prometheusk8s-db-prometheus-k8s-0 gp3-csi 254d
pvc-9f311f27-5b80-4afb-8e39-be05b88bc309 10Gi RWO Delete Bound openshift-monitoring/alertmanagermain-db-alertmanager-main-0 gp3-csi 254d
pvc-cf6a37aa-c619-4e1b-95e7-1449928401a6 10Gi RWO Delete Bound openshift-monitoring/alertmanagermain-db-alertmanager-main-1 gp3-csi 254d
pvc-d2c305c6-847c-4b76-bdf1-e0015ac40b84 300Gi RWO Delete Bound openshift-monitoring/prometheusk8s-db-prometheus-k8s-1 gp3-csi 254d
The oc get persistentvolumeclaims command can be used to list the Persistent Volume Claims associated with the Persistent Volumes.
~]$ oc get pvc --all-namespaces
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
openshift-monitoring alertmanagermain-db-alertmanager-main-0 Bound pvc-9f311f27-5b80-4afb-8e39-be05b88bc309 10Gi RWO gp3-csi 254d
openshift-monitoring alertmanagermain-db-alertmanager-main-1 Bound pvc-cf6a37aa-c619-4e1b-95e7-1449928401a6 10Gi RWO gp3-csi 254d
openshift-monitoring prometheusk8s-db-prometheus-k8s-0 Bound pvc-53b2f9a5-1cc3-4945-a619-00d82490055a 300Gi RWO gp3-csi 254d
openshift-monitoring prometheusk8s-db-prometheus-k8s-1 Bound pvc-d2c305c6-847c-4b76-bdf1-e0015ac40b84 300Gi RWO gp3-csi 254d
Notice the Prometheus and Alert Manager Persistent Volume Claims are in the openshift-monitoring namespace. The oc get events command can be used to see if there are any interesting events in the openshift-monitoring namespace.
oc get events --namespace openshift-monitoring
Likewise, you can describe the persistent volume cliams to see if there are any interesting events with the Persistent Volume Claim.
oc describe persistentvolumeclaim prometheusk8s-db-prometheus-k8s-0 --namespace openshift-monitoring
Rotated Certificate was not picked up by certain resources
And I was able to see a stateful set was configured to use the configmap.
~]$ oc get statefulset prometheus-custom --output yaml
spec:
template:
spec:
containers:
name: prometheus
volumeMounts:
- mountPath: /etc/prometheus/rules/prometheus-custom-rulefiles-0
name: prometheus-custom-rulefiles-0
And the stateful set was managed by the cluster-monitoring-operator.
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: custom
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: custom-monitoring
app.kubernetes.io/version: 2.46.0
operator.prometheus.io/name: custom
operator.prometheus.io/shard: "0"
The first time I saw this alert I saw a bunch of "certificate has expired" events in the log, like this.
And these events started near the same time I started getting the KubeletDown alerts.
Apr 19 03:37:50.211885 ip-10-29-86-25 kubenswrapper[1507]: E0419 03:37:50.211852 1507 server.go:299] "Unable to authenticate the request due to an error" err="verifying certificate SN=14489139211812342166385763ABCD073300112, SKID=, AKID=67:3A:A2:57:2C:18:52:21:04:BC:36:3B:3A:F4:3A:C8:1B:B3:01:7B failed: x509: certificate has expired or is not yet valid: current time 2024-04-19T03:37:50Z is after 2024-04-19T01:12:29Z"
And I noticed new "cert" configmaps were created near the same time the alerts started. In this example, the 564 config map is the config map before the issue started and the 565 config map was created at nearly the same exact time the issue started.
~]$ oc get configmaps --namespace openshift-kube-apiserver | grep cert
kube-apiserver-cert-syncer-kubeconfig-564 1 7d11h
kube-apiserver-cert-syncer-kubeconfig-565 1 4d
I used the oc get configmap command with the --output jsonpath option to view the content of the ca-bunder.crt file in the 564 config map (the config map before the issue started).
oc get configmap kubelet-serving-ca-564 --namespace openshift-kube-apiserver --output jsonpath="{.data.ca-bundle\.crt}"
I took the content of the ca-bunder.crt file and placed it in a file named ca-bunder.crt and then used the openssl command to show the certficate details. The certificate in the 564 config map was indeed the certificate that expired.
~]$ openssl x509 -in ca-bundle.crt -text -noout
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 4867864711386769100 (0x438e2129f8de26cc)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN=openshift-kube-controller-manager-operator_csr-signer-signer@1709471471
Validity
Not Before: Mar 20 01:12:28 2024 GMT
Not After : Apr 19 01:12:29 2024 GMT <- expired when the issue started
Subject: CN=kube-csr-signer_@1710897148
X509v3 extensions:
X509v3 Subject Key Identifier:
67:3A:A2:57:2C:18:52:21:04:BC:36:3B:3A:F4:3A:C8:1B:B3:01:7B <- this matches the AKID in the log
I created a Must Gather and opened a case with Red Hat.
oc adm must-gather
Likewise, the oc get service command can be used to see the kubelet service is in the kube-system namespace.
~]$ oc get service --namespace kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 2y72d
The oc describe service command can be used to see that the kubelet service forwards reqests on the nodes via HTTP, HTTPS and on port 4194 for the cadvisor (Certificate Authority Advisor).
~]$ oc describe service kubelet --namespace kube-system
Name: kubelet
Namespace: kube-system
Labels: app.kubernetes.io/managed-by=prometheus-operator
app.kubernetes.io/name=kubelet
k8s-app=kubelet
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP Family Policy: RequireDualStack
IP Families: IPv4,IPv6
IP: None
IPs: None
Port: https-metrics 10250/TCP
TargetPort: 10250/TCP
Endpoints: 10.29.146.118:10250,10.29.148.190:10250,10.29.149.96:10250 + 6 more...
Port: http-metrics 10255/TCP
TargetPort: 10255/TCP
Endpoints: 10.29.146.118:10255,10.29.148.190:10255,10.29.149.96:10255 + 6 more...
Port: cadvisor 4194/TCP
TargetPort: 4194/TCP
Endpoints: 10.29.146.118:4194,10.29.148.190:4194,10.29.149.96:4194 + 6 more...
Session Affinity: None
Events: <none>
You can start one of the nodes that the kubelet service forwards requests onto in debug mode.
~]# oc debug node/my-node-5n4fj
Starting pod/my-node-5n4fj-debug ...
sh-4.4#
Typically you will first issue the chroot /host command is used to set /host as the root directory because the root file system is mounted to /host in the debug pod.
sh-4.4# chroot /host
systemctl can be used to determine if the kubelet service is running.
sh-5.1# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─01-kubens.conf, 10-mco-default-madv.conf, 20-aws-node-name.conf, 20-aws-providerid.conf, 20-logging.conf
Active: active (running) since Tue 2024-04-23 04:14:54 UTC; 1 day 1h ago
Here is a one liner that I use to loop through each node and to return the status of the kubelet service.
for node in $(oc get nodes | grep -v ^NAME | awk '{print $1}'); do echo $node; oc debug node/$node -- chroot /host /usr/bin/systemctl status kubelet | grep Active:; done;
Which should return something like this. In this sceanrio, I can see that the kubelet has been running since before the certificate expired, meaning the kubelet service was not restarted after the certificate renewed.
infra-node-1
Active: active (running) since Thu 2024-04-04 16:31:39 UTC; 2 weeks 5 days ago
infra-node-2
Active: active (running) since Thu 2024-04-04 16:18:48 UTC; 2 weeks 5 days ago
infra-node-3
Active: active (running) since Thu 2024-04-04 16:26:03 UTC; 2 weeks 5 days ago
master-node-1
Active: active (running) since Thu 2024-04-04 16:34:01 UTC; 2 weeks 5 days ago
master-node-2
Active: active (running) since Thu 2024-04-04 16:21:05 UTC; 2 weeks 5 days ago
master-node-3
Active: active (running) since Thu 2024-04-04 16:27:32 UTC; 2 weeks 5 days ago
worker-node-1
Active: active (running) since Tue 2024-04-04 16:34:01 UTC; 1 day 1h ago
worker-node-2
Active: active (running) since Tue 2024-04-04 16:21:05 UTC; 1 day 1h ago
worker-node-3
Active: active (running) since Tue 2024-04-04 16:27:32 UTC; 1 day 2h ago
This one liner can be used to restart the kubelet service in each node, if you want to try restarting the kubelet service to see if this resolves the issue.
for node in $(oc get nodes | grep -v ^NAME | awk '{print $1}'); do echo $node; oc debug node/$node -- chroot /host /usr/bin/systemctl restart kubelet; done;
The default directory where for kubelet SSL is /var/lib/kubelet/pki.
sh-5.1# ls -l /var/lib/kubelet/pki
total 32
-rw-------. 1 root root 1183 Feb 29 05:33 kubelet-client-2024-02-29-05-33-41.pem
-rw-------. 1 root root 1183 Mar 15 00:19 kubelet-client-2024-03-15-00-19-13.pem
-rw-------. 1 root root 1183 Mar 30 14:40 kubelet-client-2024-03-30-14-40-24.pem
-rw-------. 1 root root 1183 Apr 13 11:20 kubelet-client-2024-04-13-11-20-57.pem
lrwxrwxrwx. 1 root root 59 Apr 13 11:20 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2024-04-13-11-20-57.pem
-rw-------. 1 root root 1269 Feb 29 05:33 kubelet-server-2024-02-29-05-33-42.pem
-rw-------. 1 root root 1269 Mar 14 12:49 kubelet-server-2024-03-14-12-49-47.pem
-rw-------. 1 root root 1269 Mar 28 21:48 kubelet-server-2024-03-28-21-48-50.pem
-rw-------. 1 root root 1269 Apr 12 22:10 kubelet-server-2024-04-12-22-10-56.pem
lrwxrwxrwx. 1 root root 59 Apr 12 22:10 kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2024-04-12-22-10-56.pem
The openssl command can be used to display the details of each PEM. I noticed the prior PEM also had the expired certificate.
sh-5.1# openssl x509 -in /var/lib/kubelet/pki/kubelet-client-2024-03-30-14-40-24.pem -text -noout
Certificate:
Data:
Issuer: CN = kube-csr-signer_@1712193161
Validity
Not Before: Mar 30 14:35:24 2024 GMT
Not After : Apr 19 01:12:29 2024 GMT <- expired when the issue started
Subject: O = system:nodes, CN = system:node:ip-10-29-148-190.us-east-2.compute.internal
X509v3 extensions:
X509v3 Authority Key Identifier:
67:3A:A2:57:2C:18:52:21:04:BC:36:3B:3A:F4:3A:C8:1B:B3:01:7B <- this matches the AKID in the log
The which command should return the absolute path to the kubelet CLI.
sh-5.1# which kubelet
/usr/bin/kubelet
The kubelet --version command can be used to ensure the kubelet CLI is working as expected and returning stdout.
sh-5.1# kubelet --version
Kubernetes v1.26.13+8f85140
The oc adm node-logs command can be used to return the node logs pertaining to kubelet. The logs will be extensive, which is why I redirect the output to a file.
oc adm node-logs --role=master --unit=kubelet >> kubelet_node.log
Did you find this article helpful?
If so, consider buying me a coffee over at