
Prometheus and ELK (Elastic Search, Logstash, Kibana) are similar tools used to gather, display, and analyze data. This is sometimes referred to a "scraping metrics". For example, both Prometheus and ELK can be used to display data about servers, virtual machines (VMs), databases, containers (e.g. Docker, OpenShift), messaging (e.g. IBM MQ, RabbitMQ), and the list goes on.
Event "Prometheus has failed to evaluate rules in the last 5m" means that one of the Prometheus pods, which by default are in the openshift-monitoring namespace, could not evaluate one of it's rules. The oc get PrometheusRules command can be used to list the Prometheus Rules in the openshift-monitoring namespace.
~]$ oc get PrometheusRules --namespace openshift-monitoring
NAME AGE
alertmanager-main-rules 692d
cluster-monitoring-operator-prometheus-rules 692d
kube-state-metrics-rules 692d
kubernetes-monitoring-rules 692d
node-exporter-rules 692d
prometheus-k8s-prometheus-rules 692d
prometheus-k8s-thanos-sidecar-rules 692d
prometheus-operator-rules 692d
telemetry 692d
thanos-querier 692d
Each Prometheus Rule will almost always contain multiple alerts. For example, if you view the YAML for a particular Prometheus Rule (alertmanager-main-rules in this example), the rule contains multiple alerts.
~]$ oc get PrometheusRule alertmanager-main-rules --namespace openshift-monitoring --output yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: "2023-07-07T15:43:12Z"
generation: 2
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.26.0
prometheus: k8s
role: alert-rules
name: alertmanager-main-rules
namespace: openshift-monitoring
resourceVersion: "433329758"
uid: 1b98ab31-7439-4f52-9f48-c04a696979c3
spec:
groups:
- name: alertmanager.rules
rules:
- alert: AlertmanagerFailedReload
annotations:
description: Configuration has failed to load for {{ $labels.namespace }}/{{
$labels.pod}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedReload.md
summary: Reloading an Alertmanager configuration has failed.
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_config_last_reload_successful{job=~"alertmanager-main|alertmanager-user-workload"}[5m]) == 0
for: 10m
labels:
severity: critical
- alert: AlertmanagerMembersInconsistent
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} has only
found {{ $value }} members of the {{$labels.job}} cluster.
summary: A member of an Alertmanager cluster has not found all other cluster
members.
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_cluster_members{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
< on (namespace,service) group_left
count by (namespace,service) (max_over_time(alertmanager_cluster_members{job=~"alertmanager-main|alertmanager-user-workload"}[5m]))
for: 15m
labels:
severity: warning
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed
to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
)
> 0.01
for: 5m
labels:
severity: warning
- alert: AlertmanagerClusterFailedToSendAlerts
annotations:
description: The minimum notification failure rate to {{ $labels.integration
}} sent from any instance in the {{$labels.job}} cluster is {{ $value |
humanizePercentage }}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerClusterFailedToSendAlerts.md
summary: All Alertmanager instances in a cluster failed to send notifications
to a critical integration.
expr: |
min by (namespace,service, integration) (
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[5m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[5m])
)
> 0.01
for: 5m
labels:
severity: warning
- alert: AlertmanagerConfigInconsistent
annotations:
description: Alertmanager instances within the {{$labels.job}} cluster have
different configurations.
summary: Alertmanager instances within the same cluster have different configurations.
expr: |
count by (namespace,service) (
count_values by (namespace,service) ("config_hash", alertmanager_config_hash{job=~"alertmanager-main|alertmanager-user-workload"})
)
!= 1
for: 20m
labels:
severity: warning
- alert: AlertmanagerClusterDown
annotations:
description: '{{ $value | humanizePercentage }} of Alertmanager instances
within the {{$labels.job}} cluster have been up for less than half of the
last 5m.'
summary: Half or more of the Alertmanager instances within the same cluster
are down.
expr: |
(
count by (namespace,service) (
avg_over_time(up{job=~"alertmanager-main|alertmanager-user-workload"}[5m]) < 0.5
)
/
count by (namespace,service) (
up{job=~"alertmanager-main|alertmanager-user-workload"}
)
)
>= 0.5
for: 5m
labels:
severity: warning
Or in the OpenShift console, at Observe > Alerting > Alerting rules the alerts in each Prometheus Rule will be listed.

The oc get pods command can be used to list the Prometheus pods in the openshift-monitoring namespace.
~]$ oc get pods --namespace openshift-monitoring | grep -i prometheus
prometheus-adapter-8559d6b5fb-42mng 1/1 Running 0 13h
prometheus-adapter-8559d6b5fb-ppcxf 1/1 Running 0 13h
prometheus-k8s-0 6/6 Running 1 67d
prometheus-k8s-1 6/6 Running 1 67d
prometheus-operator-5956c5d77-84qzq 2/2 Running 0 68d
Use the oc logs command to look for interesting events with the Prometheus pods.
~]$ oc logs pod/prometheus-k8s-0 --namespace openshift-monitoring --container prometheus
Here is one such interesting event. Notice the event has "timed out". This may be related to this Red Hat Bugzilla, which indicates the rule has been removed starting with version 4.6.9 of OpenShift.
~]$ oc logs pod/prometheus-k8s-0 --namespace openshift-monitoring
level=warn ts=2021-09-14T09:20:29.235Z caller=manager.go:598 component="rule manager" group=kube-apiserver-availability.rules msg="Evaluating rule failed" rule="record: apiserver_request:availability30d\nexpr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~\"POST|PUT|PATCH|DELETE\"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le=\"1\",verb=~\"POST|PUT|PATCH|DELETE\"}[30d]))) + (sum(increase(apiserver_request_duration_seconds_count{verb=~\"LIST|GET\"}[30d])) - ((sum(increase(apiserver_request_duration_seconds_bucket{le=\"0.1\",scope=~\"resource|\",verb=~\"LIST|GET\"}[30d])) or vector(0)) + sum(increase(apiserver_request_duration_seconds_bucket{le=\"0.5\",scope=\"namespace\",verb=~\"LIST|GET\"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{le=\"5\",scope=\"cluster\",verb=~\"LIST|GET\"}[30d])))) + sum(code:apiserver_request_total:increase30d{code=~\"5..\"} or vector(0))) / sum(code:apiserver_request_total:increase30d)\nlabels:\n verb: all\n" err="query timed out in expression evaluation"
Or in the OpenShift console, at Observe > Alerting the active, firing alerts will be listed.

Did you find this article helpful?
If so, consider buying me a coffee over at