
Prometheus is a system used to
- Collect metrics (e.g. memory, CPU) - This is often referred to as "scraping metrics"
- Store the metrics, for example, in object storage, such as an Amazon Web Services S3 Bucket
- Configure conditions that should create an alert (e.g. high CPU or high memory usage)
For example, Prometheus can be used to gather metrics from servers, virtual machines (VMs), databases, containers (e.g. Docker, OpenShift), messaging (e.g. IBM MQ, RabbitMQ), and the list goes on. Then the metrics could be stored in object storage, such as an Amazon Web Services (AWS) S3 Bucket.
Often, an Observability system such as Kibana is used to display the metrics with a UI that is used to display query the metrics, for example, to find systems with high CPU or memory usage.
Also, an alerting system such as Alert Manager is often used to create alerts when a certain condition is met, such as a system with high CPU or high memory usage. The alerting system would route alerts to certain targets, such as an SMTP email server or OpsGenie.
The kubectl get PrometheusRules (Kubernetes) or oc get PrometheusRules (OpenShift) command can be used to list the Prometheus Rules.
~]$ oc get PrometheusRules --namespace openshift-monitoring
NAME AGE
alertmanager-main-rules 692d
cluster-monitoring-operator-prometheus-rules 692d
kube-state-metrics-rules 692d
kubernetes-monitoring-rules 692d
node-exporter-rules 692d
prometheus-k8s-prometheus-rules 692d
prometheus-k8s-thanos-sidecar-rules 692d
prometheus-operator-rules 692d
telemetry 692d
thanos-querier 692d
For example, let's say you want to make a change to one of the Alert Manager rules. Let's redirect the alertmanager-main-rules to a YAML file.
oc get PrometheusRule alertmanager-main-rules --namespace openshift-monitoring --output yaml > alertmanager-main-rules.yaml
And then let's make some change to the YAML file. For example, perhaps updating AlertmanagerFailedToSendAlerts from 5m (five minutes) to 10m (ten minutes).
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed
to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
)
> 0.01
for: 10m
labels:
severity: warning
Let's apply the updated YAML to update the alertmanager-main-rules using the updated YAML.
oc apply --filename alertmanager-main-rules.yaml
This should cause the prometheus-k8s-rulefiles-0 configmap to get updated to have 10m (ten minutes) for AlertmanagerFailedToSendAlerts.
~]$ oc get configmap prometheus-k8s-rulefiles-0 --namespace openshift-monitoring --output yaml | grep AlertmanagerFailedToSendAlerts -A 11
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed to
send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[10m])
)
> 0.01
for: 10m
labels:
severity: warning
The kubectl get pods (Kubernetes) or oc get pods (OpenShift) command can be used to list the Prometheus pods.
~]$ oc get pods --namespace openshift-monitoring
NAME READY STATUS RESTARTS AGE
prometheus-adapter-6b98c646c7-m4g76 1/1 Running 0 8d
prometheus-adapter-6b98c646c7-tczr2 1/1 Running 0 8d
prometheus-k8s-0 6/6 Running 0 11d
prometheus-k8s-1 6/6 Running 0 11d
prometheus-operator-6766f68555-mkfv9 2/2 Running 0 11d
prometheus-operator-admission-webhook-8589888cbc-mq2jx 1/1 Running 0 11d
prometheus-operator-admission-webhook-8589888cbc-t62mt 1/1 Running 0 11d
There should be a directory /etc/prometheus/rules/prometheus-k8s-rulefiles-0 in the prometheus pod.
~]$ oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- ls -l /etc/prometheus/rules/
total 12
drwxrwsrwx. 3 root nobody 8192 May 12 20:30 prometheus-k8s-rulefiles-0
And there should be a YAML file in the pod that contains AlertmanagerFailedToSendAlerts.
~]$ oc exec pod/prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- cat /etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-alertmanager-main-rules-1b98ab31-7439-4f52-9f48-c04a696979c3.yaml
- name: alertmanager.rules
rules:
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed to
send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration
}}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerFailedToSendAlerts.md
summary: An Alertmanager instance failed to send notifications.
expr: |
(
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload"}[5m])
)
> 0.01
for: 5m
labels:
severity: warning
The kubectl exec (Kubernetes) or oc exec (OpenShift) and curl commands can be used to issue a POST request inside of your Prometheus pod to the /-/reload to reload Prometheus configurations.
oc exec prometheus-k8s-0 --container prometheus --namespace openshift-monitoring -- curl --request POST --url http://localhost:9090/-/reload
Did you find this article helpful?
If so, consider buying me a coffee over at