
Prometheus and ELK (Elastic Search, Logstash, Kibana) are similar tools used to gather, display, and analyze data. This is sometimes referred to a "scraping metrics". For example, both Prometheus and ELK can be used to display data about servers, virtual machines (VMs), databases, containers (e.g. Docker, OpenShift), messaging (e.g. IBM MQ, RabbitMQ), and the list goes on.
Often, Prometheus or ELK send alerts to some other system, such as Alert Manager and then Alert Manager would route alerts to certain targets, such as an SMTP email server or OpsGenie.
The oc get PrometheusRules command can be used to list the Prometheus Rules in the openshift-monitoring namespace.
~]$ oc get PrometheusRules --namespace openshift-monitoring
NAME AGE
alertmanager-main-rules 692d
cluster-monitoring-operator-prometheus-rules 692d
kube-state-metrics-rules 692d
kubernetes-monitoring-rules 692d
node-exporter-rules 692d
prometheus-k8s-prometheus-rules 692d
prometheus-k8s-thanos-sidecar-rules 692d
prometheus-operator-rules 692d
telemetry 692d
thanos-querier 692d
For example, let's say you want to make a change to one of the Alert Manager rules. Let's redirect the alertmanager-main-rules to a YAML file.
oc get PrometheusRule alertmanager-main-rules --namespace openshift-monitoring --output yaml > alertmanager-main-rules.yaml
And then let's make some change to the YAML file. For example, perhaps updating AlertmanagerFailedToSendAlerts from 5m (five minutes) to 10m (ten minutes).
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed
to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
)
> 0.01
for: 10m
labels:
severity: warning
Let's apply the updated YAML to update the alertmanager-main-rules using the updated YAML.
oc apply --filename alertmanager-main-rules.yaml
This should cause the prometheus-k8s-rulefiles-0 configmap to get updated to have 10m (ten minutes) for AlertmanagerFailedToSendAlerts.
~]$ oc get configmap prometheus-k8s-rulefiles-0 --namespace openshift-monitoring --output yaml | grep AlertmanagerFailedToSendAlerts -A 11
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed to
send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
)
> 0.01
for: 10m
labels:
severity: warning
The oc get pods command can be used to list the Prometheus pods, by default in the openshift-monitoring namespace.
~]$ oc get pods --namespace openshift-monitoring
NAME READY STATUS RESTARTS AGE
prometheus-adapter-6b98c646c7-m4g76 1/1 Running 0 8d
prometheus-adapter-6b98c646c7-tczr2 1/1 Running 0 8d
prometheus-k8s-0 6/6 Running 0 11d
prometheus-k8s-1 6/6 Running 0 11d
prometheus-operator-6766f68555-mkfv9 2/2 Running 0 11d
prometheus-operator-admission-webhook-8589888cbc-mq2jx 1/1 Running 0 11d
prometheus-operator-admission-webhook-8589888cbc-t62mt 1/1 Running 0 11d
There should be a directory /etc/prometheus/rules/prometheus-k8s-rulefiles-0 in the prometheus pod.
~]$ oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- ls -l /etc/prometheus/rules/
total 12
drwxrwsrwx. 3 root nobody 8192 May 12 20:30 prometheus-k8s-rulefiles-0
And there should be a YAML file in the pod that contains AlertmanagerFailedToSendAlerts.
~]$ oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- cat /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-alertmanager-main-rules-1b98ab31-7439-4f52-9f48-c04a696979c3.yaml
- name: alertmanager.rules
rules:
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed to
send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
)
> 0.01
for: 5m
labels:
severity: warning
The oc exec and curl commands can be used to issue a POST request inside of your Prometheus pod to the /-/reload to reload Prometheus configurations.
oc exec prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl --request POST --url http://localhost:9090/-/reload
Did you find this article helpful?
If so, consider buying me a coffee over at