MetalK8s is automatically deploying Prometheus, Alertmanager and a set of predefined alert rules. In order to leverage Prometheus and Alertmanager functionalities, we need to explain, in the documentation, how to use it. In a later stage, those functionalities will be exposed through various administration and alerting UIs, but for now, we want to provide our administrator with enough information in order to use very basic alerting functionalities.
As a MetalK8s administrator, I want to list or know the list of alert rules that are deployed on MetalK8s Prometheus cluster, In order to identify on what specific rule I want to be alerted.
As a MetalK8s administrator, I want to set notification routing and receiver for a specific alert, In order to get notified when such alert is fired The important routing to support are email, slack and pagerduty.
As a MetalK8s administrator, I want to update thresholds for a specific alert rule, In order to adapt the alert rule to the specificities and performances of my platform.
As a MetalK8s administrator, I want to add a new alert rule, In order to monitor a specific KPI which is not monitored out of the box by MetalK8s.
As a MetalK8s administrator, I want to inhibit an alert rule, In order to skip alerts in which I am not interested.
As a MetalK8s administrator, I want to silence an alert rule for a certain amount of time, In order to skip alert notifications during a planned maintenance operation.
In all cases, when MetalK8s administrator is upgrading the cluster, all listed customizations should remain.
Alertmanager configuration documentation is available here
To be able to edit existing rules, add new ones, etc., and in order to keep these changes across restorations, upgrades and downgrades, we need to put in place some mechanisms to configure Prometheus and Alertmanager and persist these configurations.
Extra Prometheus Rules¶
Adding extra alert and record rules will be done editing this ConfigMap
spec.extraRules key in
config.yaml as follows:
- name: <rulesGroupName>
- alert: <AlertName>
description: description of what this alert is
- alert: <AnotherAlertName>
- record: <recordName>
- name: <anotherRulesGroupName>
PromQL is to be used to define
spec.extraRules entry will be used to generate through Salt a
PrometheusRule object named
metalk8s-prometheus-extra-rules in the
metalk8s-monitoring namespace, which will be automatically consumed by the
Prometheus Operator to generate the new rules.
A CLI and UI tooling will be provided to show and edit this configuration.
Edit Existing Prometheus Alert Rules¶
To edit existing Prometheus rules, we can’t only define new
PrometheusRules resources since Prometheus Operator will not
overwrite those already existing, but will rather append them to the list of
rules, ending up with 2 rules with the same name but different parameters.
We also can’t edit the
PrometheusRules deployed by MetalK8s, otherwise we
would lose these changes in case of cluster restoration, upgrade or downgrade.
So, in order to allow the user to customize the alert rules, we will pick up some of them (the most relevant ones) and expose only few parts of their configurations (e.g. threshold) to be customized.
It also makes the customization of these alert rules easier for the user as, for example, he will not need to understand PromQL to adapt the threshold of an alert rule.
Since in Prometheus rules, there are duplicated group name + alert rule name, we also need to take the severity into account to understand which specific alert we’re editing.
These customization will be stored in the
ConfigMap with something like:
PrometheusRules object manifests
to be templatized to consume these customizations through CSC module.
Default values for customizable alert rules to fallback on, if not defined
in the ConfigMap, will be set in
Custom Alertmanager Configuration¶
We will use the already existing
ConfigMap to store the term:Alertmanager configuration
A Salt module will be developed to manipulate this object, so the logic can be kept in only one place.
This module must provide necessary methods to show or edit the configuration in 2 different ways:
simple mode will only display and allow to change some specific
configuration, such as the receivers or the inhibit rules, and in an as
simple as possible manner for the user.
advanced mode will allow to change all the configuration points,
exposing the whole configuration as a plain YAML.
This module will then be exposed through a CLI and a UI.
Retrieve Alert Rules List¶
To retrieve the list of alert rules, we must use the Prometheus API. This can be achieved using the following route:
This API call should be done in a Salt module
which could then be wrapped in a CLI and UI.
Silence an Alert¶
To silence an alert, we need to send a query to the Alertmanager API. This can be done using the following route:
With a POST query content formatted as below:
"comment": "Maintenance is planned",
We must also be able to retrieve silenced alerts and to remove a silence. This will be done using the API, with the same route using GET and DELETE word respectively:
# GET - to list all silences
# DELETE - to delete a specific silence
We will need to provide these functionnalities through a Salt module
metalk8s_monitoring which could then be wrapped in a CLI and UI.
Extract Rules Tooling¶
We need to build a tool to extract all alert rules from the Prometheus
Operator rendered chart
Its purpose will be to generate a file (each time this chart is updated) which will then be used to check that what’s deployed matches what was expected.
And so, we will be able to see what has been changed when updating Prometheus Operator chart and see if there is any change on customizable alert rules.
Rejected Design Choices¶
Managing alert silences can be done using amtool:
amtool --alertmanager.url=http://localhost:9093 silence add \
alertname="<alert-name>" --comment 'Maintenance is planned'
amtool --alertmanager.url=http://localhost:9093 silence query
amtool --alertmanager.url=http://localhost:9093 silence expire <silence-id>
This option has been rejected because, to do so, we need to install an extra dependency (amtool binary) or run the commands inside the Alertmanager container, rather than simply send HTTP queries on the API.
Add an internal tool to list all Prometheus alert rules from rendered chart
Implement Salt formulas to handle configuration customization (
Provide CLI and UI to wrap the Salt calls
Customization of node-exporter alert group thresholds
Document how to:
Retrieve the list of alert rules
Add a new alert rule
Edit an existing alert rule
Configure notifications (email, slack and pagerduty)
Silence an alert
Deactivate an alert
simplemode in Salt formulas
simplemode to both CLI and UI
Update the documentation with the
In the Operational Guide:
Document how to manage silence on alerts (list, create & delete)
Document how to manage alert rules (list, create, edit)
Document how to configure alertmanager notifications
Document how to deactivate an alert
Add a list of alert rules configured in Prometheus, with a brief explanation for each and what can be customized
Add a new test scenario using pytest-bdd framework to ensure the correct behavior of this feature. These tests must be put in the post-merge step in the CI and must include:
Configuration of a receiver in Alertmanager
Configuration of inhibit rules in Alertmanager
Add a new alert rule in Prometheus
Customize an existing alert rule in Prometheus
Alert silences management (add, list and delete)
Deployed Prometheus alert rules must match what’s expected from a given list (generated by a tool Extract Rules Tooling)