Alerting Functionalities¶
Context¶
MetalK8s is automatically deploying Prometheus, Alertmanager and a set of predefined alert rules. In order to leverage Prometheus and Alertmanager functionalities, we need to explain, in the documentation, how to use it. In a later stage, those functionalities will be exposed through various administration and alerting UIs, but for now, we want to provide our administrator with enough information in order to use very basic alerting functionalities.
Requirements¶
As a MetalK8s administrator, I want to list or know the list of alert rules that are deployed on MetalK8s Prometheus cluster, In order to identify on what specific rule I want to be alerted.
As a MetalK8s administrator, I want to set notification routing and receiver for a specific alert, In order to get notified when such alert is fired The important routing to support are email, slack and pagerduty.
As a MetalK8s administrator, I want to update thresholds for a specific alert rule, In order to adapt the alert rule to the specificities and performances of my platform.
As a MetalK8s administrator, I want to add a new alert rule, In order to monitor a specific KPI which is not monitored out of the box by MetalK8s.
As a MetalK8s administrator, I want to inhibit an alert rule, In order to skip alerts in which I am not interested.
As a MetalK8s administrator, I want to silence an alert rule for a certain amount of time, In order to skip alert notifications during a planned maintenance operation.
Warning
In all cases, when MetalK8s administrator is upgrading the cluster, all listed customizations should remain.
Note
Alertmanager configuration documentation is available here
Design Choices¶
To be able to edit existing rules, add new ones, etc., and in order to keep these changes across restorations, upgrades and downgrades, we need to put in place some mechanisms to configure Prometheus and Alertmanager and persist these configurations.
For the persistence part, we will rely on what has been done for CSC (Cluster and Services Configurations), and use the already defined resources for Alertmanager and Prometheus.
Extra Prometheus Rules¶
We will use the already existing metalk8s-prometheus-config
ConfigMap to store the Prometheus configuration
customizations.
Adding extra alert and record rules will be done editing this ConfigMap
under the spec.extraRules
key in config.yaml
as follows:
apiVersion: v1
kind: ConfigMap
metadata:
name: metalk8s-prometheus-config
namespace: metalk8s-monitoring
data:
config.yaml: |-
apiVersion: addons.metalk8s.scality.com
kind: PrometheusConfig
spec:
deployment:
replicas: 1
extraRules:
groups:
- name: <rulesGroupName>
rules:
- alert: <AlertName>
annotations:
description: description of what this alert is
expr: vector(1)
for: 10m
labels:
severity: critical
- alert: <AnotherAlertName>
[...]
- record: <recordName>
[...]
- name: <anotherRulesGroupName>
[...]
PromQL is to be used to define expr
field.
This spec.extraRules
entry will be used to generate through Salt a
PrometheusRule
object named metalk8s-prometheus-extra-rules
in the
metalk8s-monitoring
namespace, which will be automatically consumed by the
Prometheus Operator to generate the new rules.
A CLI and UI tooling will be provided to show and edit this configuration.
Edit Existing Prometheus Alert Rules¶
To edit existing Prometheus rules, we can’t only define new
PrometheusRules
resources since Prometheus Operator will not
overwrite those already existing, but will rather append them to the list of
rules, ending up with 2 rules with the same name but different parameters.
We also can’t edit the PrometheusRules
deployed by MetalK8s, otherwise we
would lose these changes in case of cluster restoration, upgrade or downgrade.
So, in order to allow the user to customize the alert rules, we will pick up some of them (the most relevant ones) and expose only few parts of their configurations (e.g. threshold) to be customized.
It also makes the customization of these alert rules easier for the user as, for example, he will not need to understand PromQL to adapt the threshold of an alert rule.
Since in Prometheus rules, there are duplicated group name + alert rule name, we also need to take the severity into account to understand which specific alert we’re editing.
These customization will be stored in the metalk8s-prometheus-config
ConfigMap with something like:
apiVersion: v1
kind: ConfigMap
metadata:
name: metalk8s-prometheus-config
namespace: metalk8s-monitoring
data:
config.yaml: |-
apiVersion: addons.metalk8s.scality.com
kind: PrometheusConfig
spec:
deployment:
replicas: 1
rules:
<alertGroupName>:
<alertName>:
warning:
threshold: 30
critical:
threshold: 10
<anotherAlertGroupName>:
<anotherAlertName>:
critical:
threshold: 20
anotherThreshold: 10
The PrometheusRules
object manifests
salt/metalk8s/addons/prometheus-operator/deployed/chart.sls
need
to be templatized to consume these customizations through CSC module.
Default values for customizable alert rules to fallback on, if not defined
in the ConfigMap, will be set in
salt/metalk8s/addons/prometheus-operator/config/prometheus.yaml
.
Custom Alertmanager Configuration¶
We will use the already existing metalk8s-alertmanager-config
ConfigMap to store the term:Alertmanager configuration
customizations.
A Salt module will be developed to manipulate this object, so the logic can be kept in only one place.
This module must provide necessary methods to show or edit the configuration in 2 different ways:
simple
advanced
The simple
mode will only display and allow to change some specific
configuration, such as the receivers or the inhibit rules, and in an as
simple as possible manner for the user.
The advanced
mode will allow to change all the configuration points,
exposing the whole configuration as a plain YAML.
This module will then be exposed through a CLI and a UI.
Retrieve Alert Rules List¶
To retrieve the list of alert rules, we must use the Prometheus API. This can be achieved using the following route:
http://<prometheus-ip>:9090/api/v1/rules
This API call should be done in a Salt module metalk8s_monitoring
which could then be wrapped in a CLI and UI.
Silence an Alert¶
To silence an alert, we need to send a query to the Alertmanager API. This can be done using the following route:
http://<alertmanager-ip>:9093/api/v1/silences
With a POST query content formatted as below:
{
"matchers": [
{
"name": "alert-name",
"value": "<alert-name>"
}
],
"startsAt": "2020-04-10T12:12:12",
"endsAt": "2020-04-10T13:12:12",
"createdBy": "<author>",
"comment": "Maintenance is planned",
"status": {
"state": "active"
}
}
We must also be able to retrieve silenced alerts and to remove a silence. This will be done using the API, with the same route using GET and DELETE word respectively:
# GET - to list all silences
http://<alertmanager-ip>:9093/api/v1/silences
# DELETE - to delete a specific silence
http://<alertmanager-ip>:9093/api/v1/silence/<silence-id>
We will need to provide these functionnalities through a Salt module
metalk8s_monitoring
which could then be wrapped in a CLI and UI.
Extract Rules Tooling¶
We need to build a tool to extract all alert rules from the Prometheus
Operator rendered chart
salt/metalk8s/addons/prometheus-operator/deployed/chart.sls
.
Its purpose will be to generate a file (each time this chart is updated) which will then be used to check that what’s deployed matches what was expected.
And so, we will be able to see what has been changed when updating Prometheus Operator chart and see if there is any change on customizable alert rules.
Rejected Design Choices¶
Using amtool vs Alertmanager API¶
Managing alert silences can be done using amtool:
# Add
amtool --alertmanager.url=http://localhost:9093 silence add \
alertname="<alert-name>" --comment 'Maintenance is planned'
# List
amtool --alertmanager.url=http://localhost:9093 silence query
# Delete
amtool --alertmanager.url=http://localhost:9093 silence expire <silence-id>
This option has been rejected because, to do so, we need to install an extra dependency (amtool binary) or run the commands inside the Alertmanager container, rather than simply send HTTP queries on the API.
Implementation Details¶
Iteration 1¶
Add an internal tool to list all Prometheus alert rules from rendered chart
Implement Salt formulas to handle configuration customization (
advanced
mode only)Provide CLI and UI to wrap the Salt calls
Customization of node-exporter alert group thresholds
Document how to:
Retrieve the list of alert rules
Add a new alert rule
Edit an existing alert rule
Configure notifications (email, slack and pagerduty)
Silence an alert
Deactivate an alert
Iteration 2¶
Implement the
simple
mode in Salt formulasAdd the
simple
mode to both CLI and UIUpdate the documentation with the
simple
mode
Documentation¶
In the Operational Guide:
Document how to manage silence on alerts (list, create & delete)
Document how to manage alert rules (list, create, edit)
Document how to configure alertmanager notifications
Document how to deactivate an alert
Add a list of alert rules configured in Prometheus, with a brief explanation for each and what can be customized
Test Plan¶
Add a new test scenario using pytest-bdd framework to ensure the correct behavior of this feature. These tests must be put in the post-merge step in the CI and must include:
Configuration of a receiver in Alertmanager
Configuration of inhibit rules in Alertmanager
Add a new alert rule in Prometheus
Customize an existing alert rule in Prometheus
Alert silences management (add, list and delete)
Deployed Prometheus alert rules must match what’s expected from a given list (generated by a tool Extract Rules Tooling)