
Let's say you get an alert like this.
alertname = FluentdNodeDown
container = collector
endpoint = metrics
instance = 10.129.20.191:24231
job = collector
namespace = openshift-logging
openshiftCluster = openshift.example.com
openshift_io_alert_source = platform
pod = collector-bkzk2
prometheus = openshift-monitoring/k8s
service = collector
severity = critical
message = Prometheus could not scrape fluentd collector for more than 10m.
summary = Fluentd cannot be scraped
These alerts come from the alertmanager pod in the openshift-monitoring namespace. The oc get pods command can be used to list the pods in the openshift-monitoring namespace.
~]$ oc get pods --namespace openshift-monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 6/6 Running 1 (3d19h ago) 3d19h
alertmanager-main-1 6/6 Running 1 (3d19h ago) 3d19h
Notice in the above example that the alert has pod collector-bkzk2
in the openshift-logging namespace. We noticed the pods in the alerts do not exist, thus it appears that the underlying issue here is that Alert Manager is triggering alerts for collector pods that no longer exist.
~]$ oc get pods --namespace openshift-logging
NAME READY STATUS RESTARTS AGE
collector-dbdt2 2/2 Running 0 4d8h
collector-g27xg 2/2 Running 0 4d8h
The alerts are triggering because this expression evaluates to true. This expression evaluates to true because alert manager is checking for collector pods that no longer exist.
up{container="collector",job="collector} == 0 or absent(up{container="collector",job="collector}) == 1
The alerts trigger once every 30 minutes. The oc exec command can be used to issue the amtool silence add command in one of the alert manager pods to temporarily silence the alerts.
oc exec pod/alertmanager-main-0 \
--namespace openshift-monitoring -- \
amtool silence add 'alertname=FluentdNodeDown' \
--alertmanager.url=http://localhost:9093 \
--duration="1w" \
--comment="temporarily silencing this alert for 1 week" \
--author john.doe
The oc get pods command can be used to list the Prometheus pods in the openshift-monitoring namespace.
~]$ oc get pods --namespace openshift-monitoring
NAME READY STATUS RESTARTS AGE
prometheus-adapter-6b98c646c7-m4g76 1/1 Running 0 8d
prometheus-adapter-6b98c646c7-tczr2 1/1 Running 0 8d
prometheus-k8s-0 6/6 Running 0 11d
prometheus-k8s-1 6/6 Running 0 11d
prometheus-operator-6766f68555-mkfv9 2/2 Running 0 11d
prometheus-operator-admission-webhook-8589888cbc-mq2jx 1/1 Running 0 11d
prometheus-operator-admission-webhook-8589888cbc-t62mt 1/1 Running 0 11d
The oc exec and curl commands can be used to issue a GET request to the Prometheus /api/v1/targets endpoint inside each of the Prometheus k8s pods. This will often output a slew of data, too much to parse as stdout, so almost always the output will be redirected to a JSON file.
oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl 'http://localhost:9090/api/v1/targets' | python -m json.tool > prometheus-k8s-0-targets.json
oc exec pod/prometheus-k8s-1 --container prometheus --namespace openshift-monitoring -- curl 'http://localhost:9090/api/v1/targets' | python -m json.tool > prometheus-k8s-1-targets.json
The output JSON should have something like this.
{
"data": {
"activeTargets": [
{
"discoveredLabels": {
"__address__": "10.128.3.131:24231",
"__meta_kubernetes_endpoint_address_target_kind": "Pod",
"__meta_kubernetes_endpoint_address_target_name": "collector-s8sxp",
"__meta_kubernetes_endpoint_node_name": "stg001-worker-gt4wr",
"__meta_kubernetes_endpoint_port_name": "metrics",
"__meta_kubernetes_endpoint_port_protocol": "TCP",
"__meta_kubernetes_endpoint_ready": "true",
"__meta_kubernetes_endpoints_label_logging_infra": "support",
"__meta_kubernetes_endpoints_labelpresent_logging_infra": "true",
"__meta_kubernetes_endpoints_name": "collector",
"__meta_kubernetes_namespace": "openshift-logging",
"__meta_kubernetes_pod_annotation_k8s_v1_cni_cncf_io_network_status": "[{\n \"name\": \"openshift-sdn\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.128.3.131\"\n ],\n \"default\": true,\n \"dns\": {}\n}]",
"__meta_kubernetes_pod_annotation_k8s_v1_cni_cncf_io_networks_status": "[{\n \"name\": \"openshift-sdn\",\n \"interface\": \"eth0\",\n \"ips\": [\n \"10.128.3.131\"\n ],\n \"default\": true,\n \"dns\": {}\n}]",
"__meta_kubernetes_pod_annotation_openshift_io_scc": "log-collector-scc",
"__meta_kubernetes_pod_annotation_seccomp_security_alpha_kubernetes_io_pod": "runtime/default",
"__meta_kubernetes_pod_annotationpresent_k8s_v1_cni_cncf_io_network_status": "true",
"__meta_kubernetes_pod_annotationpresent_k8s_v1_cni_cncf_io_networks_status": "true",
"__meta_kubernetes_pod_annotationpresent_openshift_io_scc": "true",
"__meta_kubernetes_pod_annotationpresent_scheduler_alpha_kubernetes_io_critical_pod": "true",
"__meta_kubernetes_pod_annotationpresent_seccomp_security_alpha_kubernetes_io_pod": "true",
"__meta_kubernetes_pod_container_image": "registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:7ea3a69f306baf09775c68d97a5be637e2604e405cbe3d6b3e5110f821fc753f",
"__meta_kubernetes_pod_container_name": "collector",
"__meta_kubernetes_pod_container_port_name": "metrics",
"__meta_kubernetes_pod_container_port_number": "24231",
"__meta_kubernetes_pod_container_port_protocol": "TCP",
"__meta_kubernetes_pod_controller_kind": "DaemonSet",
"__meta_kubernetes_pod_controller_name": "collector",
"__meta_kubernetes_pod_host_ip": "10.85.188.107",
"__meta_kubernetes_pod_ip": "10.128.3.131",
"__meta_kubernetes_pod_label_component": "collector",
"__meta_kubernetes_pod_label_controller_revision_hash": "fb778c8f6",
"__meta_kubernetes_pod_label_logging_infra": "collector",
"__meta_kubernetes_pod_label_pod_template_generation": "17",
"__meta_kubernetes_pod_label_provider": "openshift",
"__meta_kubernetes_pod_labelpresent_component": "true",
"__meta_kubernetes_pod_labelpresent_controller_revision_hash": "true",
"__meta_kubernetes_pod_labelpresent_logging_infra": "true",
"__meta_kubernetes_pod_labelpresent_pod_template_generation": "true",
"__meta_kubernetes_pod_labelpresent_provider": "true",
"__meta_kubernetes_pod_name": "collector-s8sxp",
"__meta_kubernetes_pod_node_name": "stg001-worker-gt4wr",
"__meta_kubernetes_pod_phase": "Running",
"__meta_kubernetes_pod_ready": "true",
"__meta_kubernetes_pod_uid": "edb2225e-61a9-4c8f-a24a-d2a3067fbbac",
"__meta_kubernetes_service_annotation_service_alpha_openshift_io_serving_cert_secret_name": "collector-metrics",
"__meta_kubernetes_service_annotation_service_alpha_openshift_io_serving_cert_signed_by": "openshift-service-serving-signer@1602707828",
"__meta_kubernetes_service_annotation_service_beta_openshift_io_serving_cert_signed_by": "openshift-service-serving-signer@1602707828",
"__meta_kubernetes_service_annotationpresent_service_alpha_openshift_io_serving_cert_secret_name": "true",
"__meta_kubernetes_service_annotationpresent_service_alpha_openshift_io_serving_cert_signed_by": "true",
"__meta_kubernetes_service_annotationpresent_service_beta_openshift_io_serving_cert_signed_by": "true",
"__meta_kubernetes_service_label_logging_infra": "support",
"__meta_kubernetes_service_labelpresent_logging_infra": "true",
"__meta_kubernetes_service_name": "collector",
"__metrics_path__": "/metrics",
"__scheme__": "https",
"__scrape_interval__": "30s",
"__scrape_timeout__": "10s",
"__tmp_hash": "0"
},
"globalUrl": "https://10.128.3.131:24231/metrics",
"health": "down",
"labels": {
"container": "collector",
"endpoint": "metrics",
"instance": "10.128.3.131:24231",
"job": "collector",
"namespace": "openshift-logging",
"pod": "collector-s8sxp",
"service": "collector"
},
"lastError": "Get \"https://10.128.3.131:24231/metrics\": dial tcp 10.128.3.131:24231: connect: no route to host",
"lastScrape": "2024-01-25T03:56:52.930707233Z",
"lastScrapeDuration": 3.06031493,
"scrapeInterval": "30s",
"scrapePool": "serviceMonitor/openshift-logging/collector/0",
"scrapeTimeout": "10s",
"scrapeUrl": "https://10.128.3.131:24231/metrics"
}
]
}
}
This the output JSON has far too much data to manually parse, I used the following Python to parse the JSON file and print the IP address associated with the openshift-logging collector pods.
#!/usr/bin/python3
import argparse
import json
import logging
import re
import sys
logger = logging.getLogger()
logger.setLevel(logging.INFO)
format = logging.Formatter(fmt="[%(asctime)s %(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
fileHandler = logging.FileHandler("log.log")
fileHandler.setFormatter(format)
logger.addHandler(fileHandler)
consoleHandler = logging.StreamHandler(sys.stdout)
consoleHandler.setFormatter(format)
logger.addHandler(consoleHandler)
parser = argparse.ArgumentParser()
parser.add_argument("-f","--file",action="store",required=True)
args = parser.parse_args()
file = open(args.file)
try:
parsed_json = json.load(file)
except Exception as exception:
print(f"Got the following exception when attempting json.load(file): {exception}")
file.close()
list = []
for dict in parsed_json['data']['activeTargets']:
try:
dict['labels']['pod']
except KeyError:
pass
else:
if re.match('collector', dict['labels']['pod'], re.IGNORECASE):
list.append(f"pod {dict['labels']['pod']} instance = {dict['labels']['instance']}")
list.sort()
for item in list:
print(item)
Running this Python script outputs something like this. Notice in this example that the collector pod that cannot be scraped is included in the output and has IP address 10.129.20.191.
~]$ python3 parser.py --file prometheus-k8s-1.targets.01-24-2024_22:02:25.json
pod collector-26tk2 instance = 10.129.4.2:2112
pod collector-26tk2 instance = 10.129.4.2:24231
pod collector-2mjhm instance = 10.129.17.50:2112
pod collector-2mjhm instance = 10.129.17.50:24231
pod collector-2qs52 instance = 10.129.27.177:2112
pod collector-2qs52 instance = 10.129.27.177:24231
pod collector-45gzj instance = 10.130.0.168:2112
pod collector-45gzj instance = 10.130.0.168:24231
pod collector-4j26g instance = 10.131.3.158:2112
pod collector-4j26g instance = 10.131.3.158:24231
pod collector-4jgwk instance = 10.128.26.142:2112
pod collector-4jgwk instance = 10.128.26.142:24231
pod collector-4w69w instance = 10.130.27.75:2112
pod collector-4w69w instance = 10.130.27.75:24231
pod collector-52z9t instance = 10.130.7.38:2112
pod collector-52z9t instance = 10.130.7.38:24231
pod collector-5zw9c instance = 10.131.23.242:2112
pod collector-5zw9c instance = 10.131.23.242:24231
pod collector-62zrn instance = 10.131.17.209:2112
pod collector-62zrn instance = 10.131.17.209:24231
pod collector-6m9kb instance = 10.131.24.64:2112
pod collector-6m9kb instance = 10.131.24.64:24231
pod collector-6snss instance = 10.128.22.144:2112
pod collector-6snss instance = 10.128.22.144:24231
pod collector-74t62 instance = 10.128.19.117:2112
pod collector-74t62 instance = 10.128.19.117:24231
pod collector-7z5cd instance = 10.130.18.86:2112
pod collector-7z5cd instance = 10.130.18.86:24231
pod collector-8bbzf instance = 10.129.16.174:2112
pod collector-8bbzf instance = 10.129.16.174:24231
pod collector-8r4qh instance = 10.130.19.69:2112
pod collector-8r4qh instance = 10.130.19.69:24231
pod collector-9fncz instance = 10.129.25.223:2112
pod collector-9fncz instance = 10.129.25.223:24231
pod collector-9rz6p instance = 10.129.27.95:2112
pod collector-9rz6p instance = 10.129.27.95:24231
pod collector-9sq77 instance = 10.128.25.251:2112
pod collector-9sq77 instance = 10.128.25.251:24231
pod collector-bkzk2 instance = 10.129.20.191:2112
pod collector-bkzk2 instance = 10.129.20.191:24231
Then use the oc get pods command with the --output wide option to list the collector pods in the openshift-logging namespace. If the pod that cannot be scraped is not included in the output or has a different IP address than the IP address listed in the JSON file that was produced by running curl in one of the Prometheus k8s pods to return the /api/v1/targets output, you can try reloading the Prometheus configuration.
~]$ oc get pods --namespace openshift-logging --output wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cluster-logging-operator-647865f875-rtkj2 1/1 Running 0 6d6h 10.128.6.197 stg001-infra-5h56w <none> <none>
collector-26tk2 2/2 Running 0 34h 10.129.4.2 stg001-datadog-worker-95xbq <none> <none>
collector-2mjhm 2/2 Running 0 6d6h 10.129.17.50 stg001-worker-dwkfh <none> <none>
collector-2qs52 2/2 Running 0 6d6h 10.129.27.177 stg001-worker-4b84w <none> <none>
collector-45gzj 2/2 Running 0 6d6h 10.130.0.168 stg001-master-0 <none> <none>
collector-52z9t 2/2 Running 0 6d6h 10.130.7.38 stg001-edge-tgsrw <none> <none>
collector-62zrn 2/2 Running 0 6d6h 10.131.17.209 stg001-worker-xp76w <none> <none>
collector-6snss 2/2 Running 0 6d6h 10.128.22.144 stg001-worker-d4rdw <none> <none>
collector-74t62 2/2 Running 0 6d6h 10.128.19.117 stg001-worker-zkrbn <none> <none>
collector-7z5cd 2/2 Running 0 6d6h 10.130.18.86 stg001-worker-jnhl2 <none> <none>
collector-dbdt2 2/2 Running 0 6d6h 10.128.16.196 stg001-worker-bkfph <none> <none>
collector-g27xg 2/2 Running 0 6d6h 10.128.0.194 stg001-master-2 <none> <none>
collector-gs9s4 2/2 Running 0 6d6h 10.128.2.50 stg001-worker-gt4wr <none> <none>
collector-gzqst 2/2 Running 0 6d6h 10.129.21.10 stg001-worker-xxxr2 <none> <none>
collector-h65tv 2/2 Running 0 6d6h 10.130.23.105 stg001-worker-chwbx <none> <none>
collector-h7lwt 2/2 Running 0 6d6h 10.130.20.227 stg001-worker-678fb <none> <none>
collector-hbpgp 2/2 Running 0 6d6h 10.131.24.233 stg001-worker-kf9td <none> <none>
collector-j4svl 2/2 Running 0 6d6h 10.128.27.5 stg001-worker-lbstf <none> <none>
collector-jptz9 2/2 Running 0 6d6h 10.131.20.106 stg001-worker-kgkz7 <none> <none>
collector-lpgx7 2/2 Running 0 6d6h 10.130.27.107 stg001-worker-ntq65 <none> <none>
collector-mprj4 2/2 Running 0 6d6h 10.130.17.176 stg001-worker-rxgwv <none> <none>
collector-nhkvs 2/2 Running 0 6d6h 10.128.6.199 stg001-infra-5h56w <none> <none>
collector-ns9m8 2/2 Running 0 6d6h 10.129.19.108 stg001-worker-msmjk <none> <none>
collector-nw8q4 2/2 Running 0 6d6h 10.129.7.38 stg001-edge-d7mtx <none> <none>
collector-qfpd4 2/2 Running 0 6d6h 10.128.5.210 stg001-infra-wzwzn <none> <none>
collector-r8lpn 2/2 Running 0 6d6h 10.129.0.187 stg001-master-1 <none> <none>
collector-rbs5n 2/2 Running 0 34h 10.130.4.4 stg001-datadog-worker-bsnc9 <none> <none>
collector-rxmpq 2/2 Running 0 34h 10.130.2.2 stg001-datadog-worker-zs7ht <none> <none>
collector-rzndh 2/2 Running 0 6d6h 10.131.22.92 stg001-worker-bv2f5 <none> <none>
collector-tq5l2 2/2 Running 0 6d6h 10.129.23.157 stg001-worker-dlc6w <none> <none>
collector-tqxpc 2/2 Running 0 6d6h 10.131.3.193 stg001-infra-7jb2f <none> <none>
collector-v4sn2 2/2 Running 0 6d6h 10.131.18.118 stg001-worker-qfm8f <none> <none>
collector-vbxxv 2/2 Running 0 6d6h 10.128.21.125 stg001-worker-rznsf <none> <none>
collector-wrf8r 2/2 Running 0 6d6h 10.131.14.170 stg001-worker-d5c4g <none> <none>
collector-xgw8s 2/2 Running 0 6d6h 10.130.25.161 stg001-worker-xprqc <none> <none>
collector-xnj6x 2/2 Running 0 6d6h 10.131.1.226 stg001-worker-qfbbn <none> <none>
collector-z7kzk 2/2 Running 0 34h 10.131.4.2 stg001-datadog-worker-8qhp4 <none> <none>
collector-z8jcn 2/2 Running 0 6d6h 10.128.24.82 stg001-worker-nbrdd <none> <none>
collector-zsg47 2/2 Running 0 6d6h 10.129.24.115 stg001-worker-xr66b <none> <none>
The oc exec and curl commands can be used to issue a POST request inside of your Prometheus pod to the /-/reload to reload Prometheus configurations.
oc exec prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl --request POST --url http://localhost:9090/-/reload
After this is done, I would issue the amtool silence expire command in one of the alert manager pods to remove the silence, so that the alerts fire again if the issue persists. If alerts no longer come in, then you know reloading the Prometheus configuration resolved the issue.
oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool silence --alertmanager.url=http://localhost:9093 expire d86e8aa8-91f3-463d-a34a-daf530682f38
Did you find this article helpful?
If so, consider buying me a coffee over at