Bootstrap

OpenShift - Resolve "Fluentd cannot be scraped"

by Jeremy Canfield | Updated: January 24 2024 | OpenShift articles

Let's say you get an alert like this.

alertname = FluentdNodeDown
container = collector
endpoint = metrics
instance = 10.129.20.191:24231
job = collector
namespace = openshift-logging
openshiftCluster = openshift.example.com
openshift_io_alert_source = platform
pod = collector-bkzk2
prometheus = openshift-monitoring/k8s
service = collector
severity = critical
message = Prometheus could not scrape fluentd collector for more than 10m.
summary = Fluentd cannot be scraped

These alerts come from the alertmanager pod in the openshift-monitoring namespace. The oc get pods command can be used to list the pods in the openshift-monitoring namespace.

~]$ oc get pods --namespace openshift-monitoring
NAME                     READY   STATUS    RESTARTS        AGE
alertmanager-main-0      6/6     Running   1 (3d19h ago)   3d19h
alertmanager-main-1      6/6     Running   1 (3d19h ago)   3d19h

Notice in the above example that the alert has pod collector-bkzk2 in the openshift-logging namespace. We noticed the pods in the alerts do not exist, thus it appears that the underlying issue here is that Alert Manager is triggering alerts for collector pods that no longer exist.

~]$ oc get pods --namespace openshift-logging
NAME                                            READY   STATUS      RESTARTS   AGE
collector-dbdt2                                 2/2     Running     0          4d8h
collector-g27xg                                 2/2     Running     0          4d8h

The alerts are triggering because this expression evaluates to true. This expression evaluates to true because alert manager is checking for collector pods that no longer exist.

up{container="collector",job="collector} == 0 or absent(up{container="collector",job="collector}) == 1

The alerts trigger once every 30 minutes. The oc exec command can be used to issue the amtool silence add command in one of the alert manager pods to temporarily silence the alerts.

oc exec pod/alertmanager-main-0 \
--namespace openshift-monitoring -- \
amtool silence add 'alertname=FluentdNodeDown' \
--alertmanager.url=http://localhost:9093 \
--duration="1w" \
--comment="temporarily silencing this alert for 1 week" \
--author john.doe

The oc get pods command can be used to list the Prometheus pods in the openshift-monitoring namespace.

~]$ oc get pods --namespace openshift-monitoring
NAME                                                     READY   STATUS    RESTARTS        AGE
prometheus-adapter-6b98c646c7-m4g76                      1/1     Running   0               8d
prometheus-adapter-6b98c646c7-tczr2                      1/1     Running   0               8d
prometheus-k8s-0                                         6/6     Running   0               11d
prometheus-k8s-1                                         6/6     Running   0               11d
prometheus-operator-6766f68555-mkfv9                     2/2     Running   0               11d
prometheus-operator-admission-webhook-8589888cbc-mq2jx   1/1     Running   0               11d
prometheus-operator-admission-webhook-8589888cbc-t62mt   1/1     Running   0               11d

The oc exec and curl commands can be used to issue a GET request to the Prometheus /api/v1/targets endpoint inside each of the Prometheus k8s pods. This will often output a slew of data, too much to parse as stdout, so almost always the output will be redirected to a JSON file.

oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl 'http://localhost:9090/api/v1/targets' | python -m json.tool > prometheus-k8s-0-targets.json

oc exec pod/prometheus-k8s-1 --container prometheus --namespace openshift-monitoring -- curl 'http://localhost:9090/api/v1/targets' | python -m json.tool > prometheus-k8s-1-targets.json

The output JSON should have something like this.

{
    "data": {
        "activeTargets": [
            {
                "discoveredLabels": {
                    "__address__": "10.128.3.131:24231",
                    "__meta_kubernetes_endpoint_address_target_kind": "Pod",
                    "__meta_kubernetes_endpoint_address_target_name": "collector-s8sxp",
                    "__meta_kubernetes_endpoint_node_name": "stg001-worker-gt4wr",
                    "__meta_kubernetes_endpoint_port_name": "metrics",
                    "__meta_kubernetes_endpoint_port_protocol": "TCP",
                    "__meta_kubernetes_endpoint_ready": "true",
                    "__meta_kubernetes_endpoints_label_logging_infra": "support",
                    "__meta_kubernetes_endpoints_labelpresent_logging_infra": "true",
                    "__meta_kubernetes_endpoints_name": "collector",
                    "__meta_kubernetes_namespace": "openshift-logging",
                    "__meta_kubernetes_pod_annotation_k8s_v1_cni_cncf_io_network_status": "[{\n    \"name\": \"openshift-sdn\",\n    \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.128.3.131\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]",
                    "__meta_kubernetes_pod_annotation_k8s_v1_cni_cncf_io_networks_status": "[{\n    \"name\": \"openshift-sdn\",\n    \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.128.3.131\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n}]",
                    "__meta_kubernetes_pod_annotation_openshift_io_scc": "log-collector-scc",
                    "__meta_kubernetes_pod_annotation_seccomp_security_alpha_kubernetes_io_pod": "runtime/default",
                    "__meta_kubernetes_pod_annotationpresent_k8s_v1_cni_cncf_io_network_status": "true",
                    "__meta_kubernetes_pod_annotationpresent_k8s_v1_cni_cncf_io_networks_status": "true",
                    "__meta_kubernetes_pod_annotationpresent_openshift_io_scc": "true",
                    "__meta_kubernetes_pod_annotationpresent_scheduler_alpha_kubernetes_io_critical_pod": "true",
                    "__meta_kubernetes_pod_annotationpresent_seccomp_security_alpha_kubernetes_io_pod": "true",
                    "__meta_kubernetes_pod_container_image": "registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:7ea3a69f306baf09775c68d97a5be637e2604e405cbe3d6b3e5110f821fc753f",
                    "__meta_kubernetes_pod_container_name": "collector",
                    "__meta_kubernetes_pod_container_port_name": "metrics",
                    "__meta_kubernetes_pod_container_port_number": "24231",
                    "__meta_kubernetes_pod_container_port_protocol": "TCP",
                    "__meta_kubernetes_pod_controller_kind": "DaemonSet",
                    "__meta_kubernetes_pod_controller_name": "collector",
                    "__meta_kubernetes_pod_host_ip": "10.85.188.107",
                    "__meta_kubernetes_pod_ip": "10.128.3.131",
                    "__meta_kubernetes_pod_label_component": "collector",
                    "__meta_kubernetes_pod_label_controller_revision_hash": "fb778c8f6",
                    "__meta_kubernetes_pod_label_logging_infra": "collector",
                    "__meta_kubernetes_pod_label_pod_template_generation": "17",
                    "__meta_kubernetes_pod_label_provider": "openshift",
                    "__meta_kubernetes_pod_labelpresent_component": "true",
                    "__meta_kubernetes_pod_labelpresent_controller_revision_hash": "true",
                    "__meta_kubernetes_pod_labelpresent_logging_infra": "true",
                    "__meta_kubernetes_pod_labelpresent_pod_template_generation": "true",
                    "__meta_kubernetes_pod_labelpresent_provider": "true",
                    "__meta_kubernetes_pod_name": "collector-s8sxp",
                    "__meta_kubernetes_pod_node_name": "stg001-worker-gt4wr",
                    "__meta_kubernetes_pod_phase": "Running",
                    "__meta_kubernetes_pod_ready": "true",
                    "__meta_kubernetes_pod_uid": "edb2225e-61a9-4c8f-a24a-d2a3067fbbac",
                    "__meta_kubernetes_service_annotation_service_alpha_openshift_io_serving_cert_secret_name": "collector-metrics",
                    "__meta_kubernetes_service_annotation_service_alpha_openshift_io_serving_cert_signed_by": "openshift-service-serving-signer@1602707828",
                    "__meta_kubernetes_service_annotation_service_beta_openshift_io_serving_cert_signed_by": "openshift-service-serving-signer@1602707828",
                    "__meta_kubernetes_service_annotationpresent_service_alpha_openshift_io_serving_cert_secret_name": "true",
                    "__meta_kubernetes_service_annotationpresent_service_alpha_openshift_io_serving_cert_signed_by": "true",
                    "__meta_kubernetes_service_annotationpresent_service_beta_openshift_io_serving_cert_signed_by": "true",
                    "__meta_kubernetes_service_label_logging_infra": "support",
                    "__meta_kubernetes_service_labelpresent_logging_infra": "true",
                    "__meta_kubernetes_service_name": "collector",
                    "__metrics_path__": "/metrics",
                    "__scheme__": "https",
                    "__scrape_interval__": "30s",
                    "__scrape_timeout__": "10s",
                    "__tmp_hash": "0"
                },
                "globalUrl": "https://10.128.3.131:24231/metrics",
                "health": "down",
                "labels": {
                    "container": "collector",
                    "endpoint": "metrics",
                    "instance": "10.128.3.131:24231",
                    "job": "collector",
                    "namespace": "openshift-logging",
                    "pod": "collector-s8sxp",
                    "service": "collector"
                },
                "lastError": "Get \"https://10.128.3.131:24231/metrics\": dial tcp 10.128.3.131:24231: connect: no route to host",
                "lastScrape": "2024-01-25T03:56:52.930707233Z",
                "lastScrapeDuration": 3.06031493,
                "scrapeInterval": "30s",
                "scrapePool": "serviceMonitor/openshift-logging/collector/0",
                "scrapeTimeout": "10s",
                "scrapeUrl": "https://10.128.3.131:24231/metrics"
            }
        ]
    }
}

This the output JSON has far too much data to manually parse, I used the following Python to parse the JSON file and print the IP address associated with the openshift-logging collector pods.

#!/usr/bin/python3
import argparse
import json
import logging
import re
import sys

logger = logging.getLogger()
logger.setLevel(logging.INFO)
format = logging.Formatter(fmt="[%(asctime)s %(levelname)s] %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
fileHandler = logging.FileHandler("log.log")
fileHandler.setFormatter(format)
logger.addHandler(fileHandler)
consoleHandler = logging.StreamHandler(sys.stdout)
consoleHandler.setFormatter(format)
logger.addHandler(consoleHandler)

parser = argparse.ArgumentParser()
parser.add_argument("-f","--file",action="store",required=True)
args = parser.parse_args()

file = open(args.file)
try:
  parsed_json = json.load(file)
except Exception as exception:
  print(f"Got the following exception when attempting json.load(file): {exception}")

file.close()

list = []

for dict in parsed_json['data']['activeTargets']:
  try:
    dict['labels']['pod']
  except KeyError:
    pass
  else:
    if re.match('collector', dict['labels']['pod'], re.IGNORECASE):
      list.append(f"pod {dict['labels']['pod']} instance = {dict['labels']['instance']}")

list.sort()

for item in list:
  print(item)

Running this Python script outputs something like this. Notice in this example that the collector pod that cannot be scraped is included in the output and has IP address 10.129.20.191.

~]$ python3 parser.py --file prometheus-k8s-1.targets.01-24-2024_22:02:25.json
pod collector-26tk2 instance = 10.129.4.2:2112
pod collector-26tk2 instance = 10.129.4.2:24231
pod collector-2mjhm instance = 10.129.17.50:2112
pod collector-2mjhm instance = 10.129.17.50:24231
pod collector-2qs52 instance = 10.129.27.177:2112
pod collector-2qs52 instance = 10.129.27.177:24231
pod collector-45gzj instance = 10.130.0.168:2112
pod collector-45gzj instance = 10.130.0.168:24231
pod collector-4j26g instance = 10.131.3.158:2112
pod collector-4j26g instance = 10.131.3.158:24231
pod collector-4jgwk instance = 10.128.26.142:2112
pod collector-4jgwk instance = 10.128.26.142:24231
pod collector-4w69w instance = 10.130.27.75:2112
pod collector-4w69w instance = 10.130.27.75:24231
pod collector-52z9t instance = 10.130.7.38:2112
pod collector-52z9t instance = 10.130.7.38:24231
pod collector-5zw9c instance = 10.131.23.242:2112
pod collector-5zw9c instance = 10.131.23.242:24231
pod collector-62zrn instance = 10.131.17.209:2112
pod collector-62zrn instance = 10.131.17.209:24231
pod collector-6m9kb instance = 10.131.24.64:2112
pod collector-6m9kb instance = 10.131.24.64:24231
pod collector-6snss instance = 10.128.22.144:2112
pod collector-6snss instance = 10.128.22.144:24231
pod collector-74t62 instance = 10.128.19.117:2112
pod collector-74t62 instance = 10.128.19.117:24231
pod collector-7z5cd instance = 10.130.18.86:2112
pod collector-7z5cd instance = 10.130.18.86:24231
pod collector-8bbzf instance = 10.129.16.174:2112
pod collector-8bbzf instance = 10.129.16.174:24231
pod collector-8r4qh instance = 10.130.19.69:2112
pod collector-8r4qh instance = 10.130.19.69:24231
pod collector-9fncz instance = 10.129.25.223:2112
pod collector-9fncz instance = 10.129.25.223:24231
pod collector-9rz6p instance = 10.129.27.95:2112
pod collector-9rz6p instance = 10.129.27.95:24231
pod collector-9sq77 instance = 10.128.25.251:2112
pod collector-9sq77 instance = 10.128.25.251:24231
pod collector-bkzk2 instance = 10.129.20.191:2112
pod collector-bkzk2 instance = 10.129.20.191:24231

Then use the oc get pods command with the --output wide option to list the collector pods in the openshift-logging namespace. If the pod that cannot be scraped is not included in the output or has a different IP address than the IP address listed in the JSON file that was produced by running curl in one of the Prometheus k8s pods to return the /api/v1/targets output, you can try reloading the Prometheus configuration.

~]$ oc get pods --namespace openshift-logging --output wide
NAME                                            READY   STATUS      RESTARTS   AGE     IP              NODE                          NOMINATED NODE   READINESS GATES
cluster-logging-operator-647865f875-rtkj2       1/1     Running     0          6d6h    10.128.6.197    stg001-infra-5h56w            <none>           <none>
collector-26tk2                                 2/2     Running     0          34h     10.129.4.2      stg001-datadog-worker-95xbq   <none>           <none>
collector-2mjhm                                 2/2     Running     0          6d6h    10.129.17.50    stg001-worker-dwkfh           <none>           <none>
collector-2qs52                                 2/2     Running     0          6d6h    10.129.27.177   stg001-worker-4b84w           <none>           <none>
collector-45gzj                                 2/2     Running     0          6d6h    10.130.0.168    stg001-master-0               <none>           <none>
collector-52z9t                                 2/2     Running     0          6d6h    10.130.7.38     stg001-edge-tgsrw             <none>           <none>
collector-62zrn                                 2/2     Running     0          6d6h    10.131.17.209   stg001-worker-xp76w           <none>           <none>
collector-6snss                                 2/2     Running     0          6d6h    10.128.22.144   stg001-worker-d4rdw           <none>           <none>
collector-74t62                                 2/2     Running     0          6d6h    10.128.19.117   stg001-worker-zkrbn           <none>           <none>
collector-7z5cd                                 2/2     Running     0          6d6h    10.130.18.86    stg001-worker-jnhl2           <none>           <none>
collector-dbdt2                                 2/2     Running     0          6d6h    10.128.16.196   stg001-worker-bkfph           <none>           <none>
collector-g27xg                                 2/2     Running     0          6d6h    10.128.0.194    stg001-master-2               <none>           <none>
collector-gs9s4                                 2/2     Running     0          6d6h    10.128.2.50     stg001-worker-gt4wr           <none>           <none>
collector-gzqst                                 2/2     Running     0          6d6h    10.129.21.10    stg001-worker-xxxr2           <none>           <none>
collector-h65tv                                 2/2     Running     0          6d6h    10.130.23.105   stg001-worker-chwbx           <none>           <none>
collector-h7lwt                                 2/2     Running     0          6d6h    10.130.20.227   stg001-worker-678fb           <none>           <none>
collector-hbpgp                                 2/2     Running     0          6d6h    10.131.24.233   stg001-worker-kf9td           <none>           <none>
collector-j4svl                                 2/2     Running     0          6d6h    10.128.27.5     stg001-worker-lbstf           <none>           <none>
collector-jptz9                                 2/2     Running     0          6d6h    10.131.20.106   stg001-worker-kgkz7           <none>           <none>
collector-lpgx7                                 2/2     Running     0          6d6h    10.130.27.107   stg001-worker-ntq65           <none>           <none>
collector-mprj4                                 2/2     Running     0          6d6h    10.130.17.176   stg001-worker-rxgwv           <none>           <none>
collector-nhkvs                                 2/2     Running     0          6d6h    10.128.6.199    stg001-infra-5h56w            <none>           <none>
collector-ns9m8                                 2/2     Running     0          6d6h    10.129.19.108   stg001-worker-msmjk           <none>           <none>
collector-nw8q4                                 2/2     Running     0          6d6h    10.129.7.38     stg001-edge-d7mtx             <none>           <none>
collector-qfpd4                                 2/2     Running     0          6d6h    10.128.5.210    stg001-infra-wzwzn            <none>           <none>
collector-r8lpn                                 2/2     Running     0          6d6h    10.129.0.187    stg001-master-1               <none>           <none>
collector-rbs5n                                 2/2     Running     0          34h     10.130.4.4      stg001-datadog-worker-bsnc9   <none>           <none>
collector-rxmpq                                 2/2     Running     0          34h     10.130.2.2      stg001-datadog-worker-zs7ht   <none>           <none>
collector-rzndh                                 2/2     Running     0          6d6h    10.131.22.92    stg001-worker-bv2f5           <none>           <none>
collector-tq5l2                                 2/2     Running     0          6d6h    10.129.23.157   stg001-worker-dlc6w           <none>           <none>
collector-tqxpc                                 2/2     Running     0          6d6h    10.131.3.193    stg001-infra-7jb2f            <none>           <none>
collector-v4sn2                                 2/2     Running     0          6d6h    10.131.18.118   stg001-worker-qfm8f           <none>           <none>
collector-vbxxv                                 2/2     Running     0          6d6h    10.128.21.125   stg001-worker-rznsf           <none>           <none>
collector-wrf8r                                 2/2     Running     0          6d6h    10.131.14.170   stg001-worker-d5c4g           <none>           <none>
collector-xgw8s                                 2/2     Running     0          6d6h    10.130.25.161   stg001-worker-xprqc           <none>           <none>
collector-xnj6x                                 2/2     Running     0          6d6h    10.131.1.226    stg001-worker-qfbbn           <none>           <none>
collector-z7kzk                                 2/2     Running     0          34h     10.131.4.2      stg001-datadog-worker-8qhp4   <none>           <none>
collector-z8jcn                                 2/2     Running     0          6d6h    10.128.24.82    stg001-worker-nbrdd           <none>           <none>
collector-zsg47                                 2/2     Running     0          6d6h    10.129.24.115   stg001-worker-xr66b           <none>           <none>

The oc exec and curl commands can be used to issue a POST request inside of your Prometheus pod to the /-/reload to reload Prometheus configurations.

oc exec prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl --request POST --url http://localhost:9090/-/reload

After this is done, I would issue the amtool silence expire command in one of the alert manager pods to remove the silence, so that the alerts fire again if the issue persists. If alerts no longer come in, then you know reloading the Prometheus configuration resolved the issue.

oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool silence --alertmanager.url=http://localhost:9093 expire d86e8aa8-91f3-463d-a34a-daf530682f38

Did you find this article helpful?

If so, consider buying me a coffee over at

Did you find this article helpful?

Comments

Add a Comment