Bootstrap

OpenShift - Resolve "FluentDVeryHighErrorRate"

by Jeremy Canfield | Updated: July 04 2024 | OpenShift articles

Let's say you get an alert like this.

alertname = FluentDVeryHighErrorRate
instance = 10.131.17.130:24231
namespace = openshift-logging
openshiftCluster = stg001.op.thrivent.com
openshift_io_alert_source = platform
prometheus = openshift-monitoring/k8s
severity = critical
message = +Inf% of records have resulted in an error by fluentd 10.131.17.130:24231.
summary = FluentD output errors are very high

These alerts come from the alertmanager pod in the openshift-monitoring namespace. The oc get pods command can be used to list the pods in the openshift-monitoring namespace.

~]$ oc get pods --namespace openshift-monitoring
NAME                     READY   STATUS    RESTARTS        AGE
alertmanager-main-0      6/6     Running   1 (3d19h ago)   3d19h
alertmanager-main-1      6/6     Running   1 (3d19h ago)   3d19h

The oc exec command can be used to run the amtool CLI in one of the alertmanager pods. The amtool alert command can be used to list the active alerts. If FluentDVeryHighErrorRate is not in the ouput this means the alert is no longer active.

~]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert
Alertname                           Starts At                Summary                                             State   
FluentDVeryHighErrorRate            2024-01-12 23:12:17 UTC  FluentD output errors are very high.                active

The amtool alert command with the --silenced flag can be used to list silenced alerts.

~]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert --silenced
ID                                    Matchers                     Ends At                  Created By  Comment                                     
d86e8aa8-91f3-463d-a34a-daf530682f38  alertname="FluentdNodeDown"  2024-01-20 09:29:29 UTC  john.doe    temporarily silencing this alert for 1 day

This command can be used to return the scrape URLs being used between Prometheus and Fluentd.

~]$ oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl -s 'http://localhost:9090/api/v1/targets' | python -m json.tool | grep scrapeUrl | grep 24231 | sort
                "scrapeUrl": "https://10.128.0.52:24231/metrics",
                "scrapeUrl": "https://10.128.17.147:24231/metrics",
                "scrapeUrl": "https://10.128.19.37:24231/metrics",
                "scrapeUrl": "https://10.128.2.102:24231/metrics",
                "scrapeUrl": "https://10.128.21.49:24231/metrics",
                "scrapeUrl": "https://10.128.22.185:24231/metrics",
                "scrapeUrl": "https://10.128.25.124:24231/metrics",
                "scrapeUrl": "https://10.128.27.102:24231/metrics",
                "scrapeUrl": "https://10.128.5.230:24231/metrics",
                "scrapeUrl": "https://10.128.6.50:24231/metrics",
                "scrapeUrl": "https://10.129.0.27:24231/metrics",
                "scrapeUrl": "https://10.129.17.141:24231/metrics",
                "scrapeUrl": "https://10.129.18.155:24231/metrics",
                "scrapeUrl": "https://10.129.20.213:24231/metrics",
                "scrapeUrl": "https://10.129.22.243:24231/metrics",
                "scrapeUrl": "https://10.129.25.147:24231/metrics",
                "scrapeUrl": "https://10.129.26.247:24231/metrics",
                "scrapeUrl": "https://10.129.4.33:24231/metrics",
                "scrapeUrl": "https://10.129.6.13:24231/metrics",
                "scrapeUrl": "https://10.130.0.51:24231/metrics",
                "scrapeUrl": "https://10.130.16.195:24231/metrics",
                "scrapeUrl": "https://10.130.19.4:24231/metrics",
                "scrapeUrl": "https://10.130.21.199:24231/metrics",
                "scrapeUrl": "https://10.130.22.213:24231/metrics",
                "scrapeUrl": "https://10.130.2.33:24231/metrics",
                "scrapeUrl": "https://10.130.24.212:24231/metrics",
                "scrapeUrl": "https://10.130.26.222:24231/metrics",
                "scrapeUrl": "https://10.130.5.191:24231/metrics",
                "scrapeUrl": "https://10.130.6.13:24231/metrics",
                "scrapeUrl": "https://10.131.1.0:24231/metrics",
                "scrapeUrl": "https://10.131.14.158:24231/metrics",
                "scrapeUrl": "https://10.131.17.130:24231/metrics",
                "scrapeUrl": "https://10.131.19.49:24231/metrics",
                "scrapeUrl": "https://10.131.20.228:24231/metrics",
                "scrapeUrl": "https://10.131.23.27:24231/metrics",
                "scrapeUrl": "https://10.131.2.38:24231/metrics",
                "scrapeUrl": "https://10.131.25.171:24231/metrics",
                "scrapeUrl": "https://10.131.4.29:24231/metrics",

Here is a one-liner to loop through each /metrics endpoint and return the current fluentd_output_status_num_errors counts.

for scrapeurl in $(oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl -s 'http://localhost:9090/api/v1/targets' | python -m json.tool | grep scrapeUrl | grep 24231 | sort | sed 's|.*"scrapeUrl": "||g' | sed 's|".*||g'); do oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl --silent --insecure --request GET --url "$scrapeurl" | grep -i ^fluentd_output_status_num_errors; done;

Which should return something like this. Notice the counts are all zero except for Elastic Search.

fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:7bc",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:80c",type="rewrite_tag_filter"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c3dc",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c3f0",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c404",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c418",type="stdout"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c454",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c47c",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c4a4",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c4b8",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="object:c4cc",type="relabel"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="auditlogs_arcsight",type="remote_syslog"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="retry_default",type="elasticsearch"} 0.0
fluentd_output_status_num_errors{hostname="collector-pn2ck",plugin_id="default",type="elasticsearch"} 67.0 <- Elastic Search error count

Did you find this article helpful?

If so, consider buying me a coffee over at

Did you find this article helpful?

Comments

Add a Comment