Bootstrap FreeKB - OpenShift - List Alert Manager Alerts using amtool
OpenShift - List Alert Manager Alerts using amtool

Updated:   |  OpenShift articles

If you are not familiar with the oc command, refer to OpenShift - Getting Started with the oc command.

Alert Manager, as the name implies, does things like receive alerts from a client (such as Prometheus) and routing alerts to certain targets, such as an SMTP email server or OpsGenie. Alert Manager also has some features such as being able to silence alerts for a period of time.

In the openshift-monitoring namespace, there should be one or more Prometheus rules.

~]$ oc get PrometheusRules --namespace openshift-monitoring
NAME                                           AGE
alertmanager-main-rules                        606d
cluster-monitoring-operator-prometheus-rules   606d
kube-state-metrics-rules                       606d
kubernetes-monitoring-rules                    606d
node-exporter-rules                            606d
prometheus-k8s-prometheus-rules                606d
prometheus-k8s-thanos-sidecar-rules            606d
prometheus-operator-rules                      606d
telemetry                                      606d
thanos-querier                                 606d

 

And each Prometheus rule should contain one or more rules. In this example, alertmanager-main-rules contains various rules, such as:

  • AlertmanagerFailedReload
  • AlertmanagerFailedToSendAlerts
  • AlertmanagerClusterFailedToSendAlerts
  • AlertmanagerConfigInconsistent
  • AlertmanagerClusterDown
~]$ oc get PrometheusRule alertmanager-main-rules --namespace openshift-monitoring --output yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2023-07-07T15:43:12Z"
  generation: 2
  labels:
    app.kubernetes.io/component: alert-router
    app.kubernetes.io/instance: main
    app.kubernetes.io/managed-by: cluster-monitoring-operator
    app.kubernetes.io/name: alertmanager
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.26.0
    prometheus: k8s
    role: alert-rules
  name: alertmanager-main-rules
  namespace: openshift-monitoring
  resourceVersion: "433329758"
  uid: 1b98ab31-7439-4f52-9f48-c04a696979c3
spec:
  groups:
  - name: alertmanager.rules
    rules:
    - alert: AlertmanagerFailedReload
      annotations:
        description: Configuration has failed to load for {{ $labels.namespace }}/{{
          $labels.pod}}.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedReload.md
        summary: Reloading an Alertmanager configuration has failed.
      expr: |
        # Without max_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
        max_over_time(alertmanager_config_last_reload_successful{job=~"alertmanager-main|alertmanager-user-workload"}[5m]) == 0
      for: 10m
      labels:
        severity: critical
    - alert: AlertmanagerMembersInconsistent
      annotations:
        description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} has only
          found {{ $value }} members of the {{$labels.job}} cluster.
        summary: A member of an Alertmanager cluster has not found all other cluster
          members.
      expr: |
        # Without max_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
          max_over_time(alertmanager_cluster_members{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
        < on (namespace,service) group_left
          count by (namespace,service) (max_over_time(alertmanager_cluster_members{job=~"alertmanager-main|alertmanager-user-workload"}[5m]))
      for: 15m
      labels:
        severity: warning
    - alert: AlertmanagerFailedToSendAlerts
      annotations:
        description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed
          to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
          }}.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
        summary: An Alertmanager instance failed to send notifications.
      expr: |
        (
          rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
        /
          ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
        )
        > 0.01
      for: 5m
      labels:
        severity: warning
    - alert: AlertmanagerClusterFailedToSendAlerts
      annotations:
        description: The minimum notification failure rate to {{ $labels.integration
          }} sent from any instance in the {{$labels.job}} cluster is {{ $value |
          humanizePercentage }}.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerClusterFailedToSendAlerts.md
        summary: All Alertmanager instances in a cluster failed to send notifications
          to a critical integration.
      expr: |
        min by (namespace,service, integration) (
          rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[5m])
        /
          ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[5m])
        )
        > 0.01
      for: 5m
      labels:
        severity: warning
    - alert: AlertmanagerConfigInconsistent
      annotations:
        description: Alertmanager instances within the {{$labels.job}} cluster have
          different configurations.
        summary: Alertmanager instances within the same cluster have different configurations.
      expr: |
        count by (namespace,service) (
          count_values by (namespace,service) ("config_hash", alertmanager_config_hash{job=~"alertmanager-main|alertmanager-user-workload"})
        )
        != 1
      for: 20m
      labels:
        severity: warning
    - alert: AlertmanagerClusterDown
      annotations:
        description: '{{ $value | humanizePercentage }} of Alertmanager instances
          within the {{$labels.job}} cluster have been up for less than half of the
          last 5m.'
        summary: Half or more of the Alertmanager instances within the same cluster
          are down.
      expr: |
        (
          count by (namespace,service) (
            avg_over_time(up{job=~"alertmanager-main|alertmanager-user-workload"}[5m]) < 0.5
          )
        /
          count by (namespace,service) (
            up{job=~"alertmanager-main|alertmanager-user-workload"}
          )
        )
        >= 0.5
      for: 5m
      labels:
        severity: warning

 

In the openshift-monitoring namespace, there should be one or more alertmanager pods.

~]$ oc get pods --namespace openshift-monitoring
NAME                       READY   STATUS    RESTARTS        AGE
alertmanager-main-0        6/6     Running   0               6d10h
alertmanager-main-1        6/6     Running   1 (6d10h ago)   6d10h

 

The oc exec command can be used to run the amtool CLI in one of the alertmanager pods. The amtool alert command can be used to list the active alerts.

~]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert
Alertname                           Starts At                Summary                                                                                                                         State   
Watchdog                            2024-01-12 23:12:17 UTC  An alert that should always be firing to certify that Alertmanager is working properly.                                         active  
UpdateAvailable                     2024-01-12 23:12:23 UTC  Your upstream update recommendation service recommends you update your cluster.                                                 active  
KubeJobFailed                       2024-01-12 23:13:36 UTC  Job failed to complete.                                                                                                         active  
KubeJobFailed                       2024-01-12 23:13:36 UTC  Job failed to complete.                                                                                                         active  
KubeJobFailed                       2024-01-12 23:13:36 UTC  Job failed to complete.                                                                                                         active  
KubeJobFailed                       2024-01-12 23:13:36 UTC  Job failed to complete.                                                                                                         active  
ClusterNotUpgradeable               2024-01-12 23:13:42 UTC  One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.                           active  
SimpleContentAccessNotAvailable     2024-01-12 23:17:41 UTC  Simple content access certificates are not available.                                                                           active  
PodStartupStorageOperationsFailing  2024-01-12 23:17:41 UTC  Pods can't start because volume_mount of volume plugin kubernetes.io/secret is permanently failing on node stg001-infra-wzwzn.  active  
InsightsRecommendationActive        2024-01-12 23:19:11 UTC  An Insights recommendation is active for this cluster.                                                                          active  
InsightsRecommendationActive        2024-01-12 23:19:11 UTC  An Insights recommendation is active for this cluster.                                                                          active  
InsightsRecommendationActive        2024-01-12 23:19:11 UTC  An Insights recommendation is active for this cluster.                                                                          active                  
MachineWithNoRunningPhase           2024-01-16 17:51:01 UTC  machine stg001-datadog-worker-wwlbv is in phase:                            active  
MachineWithoutValidNode             2024-01-16 17:51:17 UTC  machine stg001-datadog-worker-wwlbv does not have valid node reference      active  
APIRemovedInNextReleaseInUse        2024-01-17 14:32:06 UTC  Deprecated API that will be removed in the next version is being used.      active  
APIRemovedInNextEUSReleaseInUse     2024-01-17 14:32:06 UTC  Deprecated API that will be removed in the next EUS version is being used.  active

 

query can be used to only display alerts matching an exact alertname.

alertmanager.url="http://localhost:9093" alert query alertname="watchdog"
Alertname                           Starts At                Summary                                                                                                                         State   
Watchdog                            2024-01-12 23:12:17 UTC  An alert that should always be firing to certify that Alertmanager is working properly.                                         active

 

The -o or --output extended option can be used to display additional output.

]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert query alertname="FluentdNodeDown" --output extended
Labels                                                                                                                                                                                                                                                                                                                                Annotations                                                                                                     Starts At                Ends At                  Generator URL                                                                                                                                                                                                                                             State   
alertname="FluentdNodeDown" container="collector" endpoint="metrics" instance="10.11.12.13:24231" job="collector" namespace="openshift-logging" openshiftCluster="foo.openshift.example.com" openshift_io_alert_source="platform" pod="collector-5zw9c" prometheus="openshift-monitoring/k8s" service="collector" severity="critical"  message="Prometheus could not scrape fluentd collector for more than 10m." summary="Fluentd cannot be scraped"  2024-01-18 22:29:09 UTC  2024-01-23 05:24:09 UTC  https://console-openshift-console.apps.foo.openshift.example.com/monitoring/graph?g0.expr=up%7Bcontainer%3D%22collector%22%2Cjob%3D%22collector%22%7D+%3D%3D+0+or+absent%28up%7Bcontainer%3D%22collector%22%2Cjob%3D%22collector%22%7D%29+%3D%3D+1&g0.tab=1  active

 

The amtool alert command with the --silenced flag can be used to list silenced alerts.

~]$ oc exec pod/alertmanager-main-0 --namespace openshift-monitoring -- amtool --alertmanager.url="http://localhost:9093" alert --silenced
ID                                    Matchers                     Ends At                  Created By  Comment                                     
d86e8aa8-91f3-463d-a34a-daf530682f38  alertname="FluentdNodeDown"  2024-01-20 09:29:29 UTC  john.doe    temporarily silencing this alert for 1 day

 




Did you find this article helpful?

If so, consider buying me a coffee over at Buy Me A Coffee



Comments


Add a Comment


Please enter 1222d9 in the box below so that we can be sure you are a human.