
If you are not familiar with the oc command, refer to OpenShift - Getting Started with the oc command.
Alert Manager, as the name implies, does things like receive alerts from a client (such as Prometheus) and routing alerts to certain targets, such as an SMTP email server or OpsGenie. Alert Manager also has some features such as being able to silence alerts for a period of time.
In the openshift-monitoring namespace, there should be one or more Prometheus rules.
~]$ oc get PrometheusRules --namespace openshift-monitoring
NAME AGE
alertmanager-main-rules 606d
cluster-monitoring-operator-prometheus-rules 606d
kube-state-metrics-rules 606d
kubernetes-monitoring-rules 606d
node-exporter-rules 606d
prometheus-k8s-prometheus-rules 606d
prometheus-k8s-thanos-sidecar-rules 606d
prometheus-operator-rules 606d
telemetry 606d
thanos-querier 606d
And each Prometheus rule should contain one or more rules. In this example, alertmanager-main-rules contains various rules, such as:
- AlertmanagerFailedReload
- AlertmanagerFailedToSendAlerts
- AlertmanagerClusterFailedToSendAlerts
- AlertmanagerConfigInconsistent
- AlertmanagerClusterDown
~]$ oc get PrometheusRule alertmanager-main-rules --namespace openshift-monitoring --output yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: "2023-07-07T15:43:12Z"
generation: 2
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.26.0
prometheus: k8s
role: alert-rules
name: alertmanager-main-rules
namespace: openshift-monitoring
resourceVersion: "433329758"
uid: 1b98ab31-7439-4f52-9f48-c04a696979c3
spec:
groups:
- name: alertmanager.rules
rules:
- alert: AlertmanagerFailedReload
annotations:
description: Configuration has failed to load for {{ $labels.namespace }}/{{
$labels.pod}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedReload.md
summary: Reloading an Alertmanager configuration has failed.
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_config_last_reload_successful{job=~"alertmanager-main|alertmanager-user-workload"}[5m]) == 0
for: 10m
labels:
severity: critical
- alert: AlertmanagerMembersInconsistent
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} has only
found {{ $value }} members of the {{$labels.job}} cluster.
summary: A member of an Alertmanager cluster has not found all other cluster
members.
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_cluster_members{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
< on (namespace,service) group_left
count by (namespace,service) (max_over_time(alertmanager_cluster_members{job=~"alertmanager-main|alertmanager-user-workload"}[5m]))
for: 15m
labels:
severity: warning
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed
to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
)
> 0.01
for: 5m
labels:
severity: warning
- alert: AlertmanagerClusterFailedToSendAlerts
annotations:
description: The minimum notification failure rate to {{ $labels.integration
}} sent from any instance in the {{$labels.job}} cluster is {{ $value |
humanizePercentage }}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerClusterFailedToSendAlerts.md
summary: All Alertmanager instances in a cluster failed to send notifications
to a critical integration.
expr: |
min by (namespace,service, integration) (
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[5m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[5m])
)
> 0.01
for: 5m
labels:
severity: warning
- alert: AlertmanagerConfigInconsistent
annotations:
description: Alertmanager instances within the {{$labels.job}} cluster have
different configurations.
summary: Alertmanager instances within the same cluster have different configurations.
expr: |
count by (namespace,service) (
count_values by (namespace,service) ("config_hash", alertmanager_config_hash{job=~"alertmanager-main|alertmanager-user-workload"})
)
!= 1
for: 20m
labels:
severity: warning
- alert: AlertmanagerClusterDown
annotations:
description: '{{ $value | humanizePercentage }} of Alertmanager instances
within the {{$labels.job}} cluster have been up for less than half of the
last 5m.'
summary: Half or more of the Alertmanager instances within the same cluster
are down.
expr: |
(
count by (namespace,service) (
avg_over_time(up{job=~"alertmanager-main|alertmanager-user-workload"}[5m]) < 0.5
)
/
count by (namespace,service) (
up{job=~"alertmanager-main|alertmanager-user-workload"}
)
)
>= 0.5
for: 5m
labels:
severity: warning
In the openshift-monitoring namespace, there should be one or more alertmanager pods.
~]$ oc get pods --namespace openshift-monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 6/6 Running 0 6d10h
alertmanager-main-1 6/6 Running 1 (6d10h ago) 6d10h
The oc exec command can be used to run the amtool CLI in one of the alertmanager pods. The amtool alert command can be used to list the active alerts.
~]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert
Alertname Starts At Summary State
Watchdog 2024-01-12 23:12:17 UTC An alert that should always be firing to certify that Alertmanager is working properly. active
UpdateAvailable 2024-01-12 23:12:23 UTC Your upstream update recommendation service recommends you update your cluster. active
KubeJobFailed 2024-01-12 23:13:36 UTC Job failed to complete. active
KubeJobFailed 2024-01-12 23:13:36 UTC Job failed to complete. active
KubeJobFailed 2024-01-12 23:13:36 UTC Job failed to complete. active
KubeJobFailed 2024-01-12 23:13:36 UTC Job failed to complete. active
ClusterNotUpgradeable 2024-01-12 23:13:42 UTC One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. active
SimpleContentAccessNotAvailable 2024-01-12 23:17:41 UTC Simple content access certificates are not available. active
PodStartupStorageOperationsFailing 2024-01-12 23:17:41 UTC Pods can't start because volume_mount of volume plugin kubernetes.io/secret is permanently failing on node stg001-infra-wzwzn. active
InsightsRecommendationActive 2024-01-12 23:19:11 UTC An Insights recommendation is active for this cluster. active
InsightsRecommendationActive 2024-01-12 23:19:11 UTC An Insights recommendation is active for this cluster. active
InsightsRecommendationActive 2024-01-12 23:19:11 UTC An Insights recommendation is active for this cluster. active
MachineWithNoRunningPhase 2024-01-16 17:51:01 UTC machine stg001-datadog-worker-wwlbv is in phase: active
MachineWithoutValidNode 2024-01-16 17:51:17 UTC machine stg001-datadog-worker-wwlbv does not have valid node reference active
APIRemovedInNextReleaseInUse 2024-01-17 14:32:06 UTC Deprecated API that will be removed in the next version is being used. active
APIRemovedInNextEUSReleaseInUse 2024-01-17 14:32:06 UTC Deprecated API that will be removed in the next EUS version is being used. active
query can be used to only display alerts matching an exact alertname.
alertmanager.url="http://localhost:9093" alert query alertname="watchdog"
Alertname Starts At Summary State
Watchdog 2024-01-12 23:12:17 UTC An alert that should always be firing to certify that Alertmanager is working properly. active
The -o or --output extended option can be used to display additional output.
]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert query alertname="FluentdNodeDown" --output extended
Labels Annotations Starts At Ends At Generator URL State
alertname="FluentdNodeDown" container="collector" endpoint="metrics" instance="10.11.12.13:24231" job="collector" namespace="openshift-logging" openshiftCluster="foo.openshift.example.com" openshift_io_alert_source="platform" pod="collector-5zw9c" prometheus="openshift-monitoring/k8s" service="collector" severity="critical" message="Prometheus could not scrape fluentd collector for more than 10m." summary="Fluentd cannot be scraped" 2024-01-18 22:29:09 UTC 2024-01-23 05:24:09 UTC https://console-openshift-console.apps.foo.openshift.example.com/monitoring/graph?g0.expr=up%7Bcontainer%3D%22collector%22%2Cjob%3D%22collector%22%7D+%3D%3D+0+or+absent%28up%7Bcontainer%3D%22collector%22%2Cjob%3D%22collector%22%7D%29+%3D%3D+1&g0.tab=1 active
The amtool alert command with the --silenced flag can be used to list silenced alerts.
~]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert --silenced
ID Matchers Ends At Created By Comment
d86e8aa8-91f3-463d-a34a-daf530682f38 alertname="FluentdNodeDown" 2024-01-20 09:29:29 UTC john.doe temporarily silencing this alert for 1 day
Did you find this article helpful?
If so, consider buying me a coffee over at