Metalk8s predefined Alert rules and Alert Grouping

Context

As part of Metalk8s, we would like to provide the Administrator with built-in rules expressions that can be used to fire alerts and send notifications when one of the High Level entities of the system is degraded or impacted by the degradation of a Low Level component.

As an example, we would like to notify the administrator when the MetalK8s log service is degraded because of some specific observed symptoms:

  • not all log service replicas are scheduled

  • one of the persistent volumes claimed by one log service replica is getting full.

  • Log DB Ingestion rate is near zero

In this specific example, the goal is to invite the administrator to perform manual tasks to avoid having a Log Service interruption in the near future.

Vocabulary

Atomic Alert: An Alert which is based on existing metrics in Prometheus and which is linked to a specific symptom.

High Level Alert: An Alert which is based on other atomic alerts or High Level alerts.

Requirements

When receiving such High Level alerts, we would like the system to guide the administrator to find and understand the root cause of the alert as well as the path to resolve it. Accessing the list of observed low level symptoms will help the administrator’s investigation.

Having the High Level Alerts also helps the administrator to have a better understanding of what part/layer/component of the System is currently impacted (without having to build a mental model to guess the impact of any existing atomic alert in the System)

../../_images/metalk8s-overview.jpg

A bunch of atomic alerts are already deployed but we don’t yet have the High Level Alerts that we could use to build the above the MetalK8s dashboard. Being able to define the impact of one atomic alert is a way to build those High Level Alerts:

It is impossible to modelize all possible causes through this kind of impacting tree. However, when an alert is received, the system shall suggest other alerts that may be linked to it, (maybe using matching labels).

Also, when accessing the Node or the Volume page / alert tab, the administrator should be able to visualise all the fired alerts that are described under Nodes or Volumes entities.

In the end, the way to visualise the impact of an atomic alert in the alert page is described with the screenshot below:

../../_images/alertes.jpg

The High Level alerts should be easily identifiable in order to filter it out in the UI views. Indeed, the a first iteration we might want to display the atomic alerts only until all High Level alerts are implemented and deployed.

Severity Classification

  • Critical Alert = Red = Service Offline or At Risk, requires immediate intervention

  • Warning Alert = Yellow = Service Degraded, requires planned (within 1 week) intervention

  • No Active Alert = Green = Service Healthy

Notifications are either a mail, slack message or whatever routing supported by AlertManager or a decorated icon in the UI.

Data Model

We consider that Nodes and Volumes don’t impact the Platform directly. As such they are not belonging to Platform.

Volumes

Nodes

Platform

Platform

PlatformAtRisk

Severity

Critical

Summary

The Platform is at risk

Parent

none

Sub Alert

Severity

Filter

PlatformServicesAtRisk

Critical

PlatformDegraded

Severity

Warning

Summary

The Platform is degraded

Parent

none

Sub Alert

Severity

Filter

PlatformServicesDegraded

Warning

ControlPlaneNetworkDegraded

Warning

WorkloadPlaneNetworkDegraded

Warning

Nodes

NodeAtRisk

Severity

Critical

Summary

Node <nodename> is at risk

Parent

none

Sub Alert

Severity

Filter

KubeletClientCertificateExpiration

Critical

NodeRAIDDegraded

Critical

SystemPartitionAtRisk

Critical

NodeDegraded

Severity

Warning

Summary

Node <nodename> is degraded

Parent

none

Sub Alert

Severity

Filter

KubeNodeNotReady

Warning

KubeNodeReadinessFlapping

Warning

KubeNodeUnreachable

Warning

KubeletClientCertificateExpiration

Warning

KubeletClientCertificateRenewalErrors

Warning

KubeletPlegDurationHigh

Warning

KubeletPodStartUpLatencyHigh

Warning

KubeletServerCertificateExpiration

Warning

KubeletServerCertificateExpiration

Warning

KubeletServerCertificateRenewalErrors

Warning

KubeletTooManyPods

Warning

NodeClockNotSynchronising

Warning

NodeClockSkewDetected

Warning

NodeRAIDDiskFailure

Warning

NodeTextFileCollectorScrapeError

Warning

SystemPartitionDegraded

Warning

Currently no atomic Alert is defined yet for the following

  • System Unit (kubelet, containerd, salt-minion, ntp) would need to enrich node exporter

  • RAM

  • CPU

System Partitions

SystemPartitionAtRisk

Severity

Warning

Summary

The partition <mountpoint> on node <nodename> is at risk

Parent

NodeAtRisk

Sub Alert

Severity

Filter

NodeFilesystemAlmostOutOfSpace

Critical

NodeFilesystemAlmostOutOfFiles

Critical

NodeFilesystemFilesFillingUp

Critical

NodeFilesystemSpaceFillingUp

Critical

SystemPartitionDegraded

Severity

Warning

Summary

The partition <mountpoint> on node <nodename> is degraded

Parent

NodeDegraded

Sub Alert

Severity

Filter

NodeFilesystemAlmostOutOfSpace

Warning

NodeFilesystemAlmostOutOfFiles

Warning

NodeFilesystemFilesFillingUp

Warning

NodeFilesystemSpaceFillingUp

Warning

Volumes

VolumeAtRisk

Severity

Critical

Summary

The volume <volumename> on node <nodename> is at risk

Parent

multiple parents

Sub Alert

Severity

Filter

KubePersistentVolumeErrors

Warning

KubePersistentVolumeFillingUp

Critical

VolumeDegraded

Severity

Warning

Summary

The volume <volumename> on node <nodename> is degraded

Parent

multiple parents

Sub Alert

Severity

Filter

KubePersistentVolumeFillingUp

Warning

Platform Services

PlatformServicesAtRisk

Severity

Critical

Summary

The Platform services are at risk

Parent

PlatformAtRisk

Sub Alert

Severity

Filter

CoreServicesAtRisk

Critical

ObservabilityServicesAtRisk

Critical

PlatformServicesDegraded

Severity

Warning

Summary

The Platform services are degraded

Parent

PlatformDegraded

Sub Alert

Severity

Filter

CoreServicesDegraded

Warning

ObservabilityServicesDegraded

Warning

AccessServicesDegraded

Warning

Core

CoreServicesAtRisk

Severity

Critical

Summary

The Core services are at risk

Parent

PlatformServicesAtRisk

Sub Alert

Severity

Filter

K8sMasterServicesAtRisk

Critical

CoreServicesDegraded

Severity

Warning

Summary

The Core services are degraded

Parent

PlatformServicesDegraded

Sub Alert

Severity

Filter

K8sMasterServicesDegraded

Critical

BootstrapServicesDegraded

Critical

K8sMasterServicesAtRisk

Severity

Warning

Summary

The kubernetes master services are at risk

Parent

CoreServicesAtRisk

Sub Alert

Severity

Filter

KubeAPIErrorBudgetBurn

Critical

etcdHighNumberOfFailedGRPCRequests

Critical

etcdGRPCRequestsSlow

Critical

etcdHighNumberOfFailedHTTPRequests

Critical

etcdInsufficientMembers

Critical

etcdMembersDown

Critical

etcdNoLeader

Critical

KubeStateMetricsListErrors

Critical

KubeStateMetricsWatchErrors

Critical

KubeAPIDown

Critical

KubeClientCertificateExpiration

Critical

KubeClientCertificateExpiration

Critical

KubeControllerManagerDown

Critical

KubeletDown

Critical

KubeSchedulerDown

Critical

K8sMasterServicesDegraded

Severity

Warning

Summary

The kubernetes master services are degraded

Parent

CoreServicesDegraded

Sub Alert

Severity

Filter

KubeAPIErrorBudgetBurn

Warning

etcdHighNumberOfFailedGRPCRequests

Warning

etcdHTTPRequestsSlow

Warning

etcdHighCommitDurations

Warning

etcdHighFsyncDurations

Warning

etcdHighNumberOfFailedHTTPRequests

Warning

etcdHighNumberOfFailedProposals

Warning

etcdHighNumberOfLeaderChanges

Warning

etcdMemberCommunicationSlow

Warning

KubeCPUOvercommit

Warning

KubeCPUQuotaOvercommit

Warning

KubeMemoryOvercommit

Warning

KubeMemoryQuotaOvercommit

Warning

KubeClientCertificateExpiration

Warning

KubeClientErrors

Warning

KubeVersionMismatch

Warning

KubeDeploymentReplicasMismatch

Warning

kube-system/coredns

KubeDeploymentReplicasMismatch

Warning

metalk8s-monitoring/prometheus-adapter

KubeDeploymentReplicasMismatch

Warning

metalk8s-monitoring/prometheus-operator-kube-state-metrics

BootstrapServicesDegraded

Severity

Warning

Summary

The bootstrap services are degraded

Parent

CoreServicesDegraded

Sub Alert

Severity

Filter

KubePodNotReady

Warning

kube-system/repositories-<bootstrapname>

KubePodNotReady

Warning

kube-system/salt-master-<bootstrapname>

KubeDeploymentReplicasMismatch

Warning

kube-system/storage-operator

KubeDeploymentReplicasMismatch

Warning

metalk8s-ui/metalk8s-ui

Note

The name of the bootstrap node depends on how MetalK8s is deployed. We would need to automatically configure this alert during deployment. We may want to use more deterministic filter to find out the repository and salt-master pods.

Observability

ObservabilityServicesAtRisk

Severity

Critical

Summary

The observability services are at risk

Parent

PlatformServicesAtRisk

Sub Alert

Severity

Filter

MonitoringServiceAtRisk

Critical

AlertingServiceAtRisk

Critical

LoggingServiceAtRisk

Critical

ObservabilityServicesDegraded

Severity

Warning

Summary

The observability services are degraded

Parent

PlatformServicesDegraded

Sub Alert

Severity

Filter

MonitoringServiceDegraded

Warning

AlertingServiceDegraded

Warning

DashboardingServiceDegraded

Warning

LoggingServiceDegraded

Warning

MonitoringServiceAtRisk

Severity

Warning

Summary

The monitoring service is at risk

Parent

ObservabilityServicesAtRisk

Sub Alert

Severity

Filter

PrometheusRuleFailures

Critical

PrometheusRemoteWriteBehind

Critical

PrometheusRemoteStorageFailures

Critical

PrometheusErrorSendingAlertsToAnyAlertmanager

Critical

PrometheusBadConfig

Critical

MonitoringServiceDegraded

Severity

Warning

Summary

The monitoring service is degraded

Parent

ObservabilityServicesDegraded

Sub Alert

Severity

Filter

VolumeDegraded

Warning

app.kubernetes.io/name=prometheus-operator-prometheus

VolumeAtRisk

Critical

app.kubernetes.io/name=prometheus-operator-prometheus

TargetDown

Warning

To be defined

PrometheusTargetLimitHit

Warning

PrometheusTSDBReloadsFailing

Warning

PrometheusTSDBCompactionsFailing

Warning

PrometheusRemoteWriteDesiredShards

Warning

PrometheusOutOfOrderTimestamps

Warning

PrometheusNotificationQueueRunningFull

Warning

PrometheusNotIngestingSamples

Warning

PrometheusNotConnectedToAlertmanagers

Warning

PrometheusMissingRuleEvaluations

Warning

PrometheusErrorSendingAlertsToSomeAlertmanagers

Warning

PrometheusDuplicateTimestamps

Warning

PrometheusOperatorWatchErrors

Warning

PrometheusOperatorSyncFailed

Warning

PrometheusOperatorRejectedResources

Warning

PrometheusOperatorReconcileErrors

Warning

PrometheusOperatorNotReady

Warning

PrometheusOperatorNodeLookupErrors

Warning

PrometheusOperatorListErrors

Warning

KubeStatefulSetReplicasMismatch

Warning

metalk8s-monitoring/prometheus-prometheus-operator-prometheus

KubeDeploymentReplicasMismatch

Warning

metalk8s-monitoring/prometheus-operator-operator

KubeDaemonSetNotScheduled

Warning

metalk8s-monitoring/prometheus-operator-prometheus-node-exporter

LoggingServiceAtRisk

Severity

Critcal

Summary

The logging service is at risk

Parent

ObservabilityServicesAtRisk

Sub Alert

Severity

Filter

AlertmanagerConfigInconsistent

Critical

AlertmanagerMembersInconsistent

Critical

AlertmanagerFailedReload

Critical

LoggingServiceDegraded

Severity

Warning

Summary

The logging service is degraded

Parent

ObservabilityServicesDegraded

Sub Alert

Severity

Filter

VolumeDegraded

Warning

app.kubernetes.io/name=loki

VolumeAtRisk

Critical

app.kubernetes.io/name=loki

TargetDown

Warning

To be defined

KubeStatefulSetReplicasMismatch

Warning

metalk8s-logging/loki

KubeDaemonSetNotScheduled

Warning

metalk8s-logging/fluentbit

AlertingServiceAtRisk

Severity

Critcal

Summary

The alerting service is at risk

Parent

ObservabilityServicesAtRisk

Sub Alert

Severity

Filter

AlertmanagerConfigInconsistent

Critical

AlertmanagerMembersInconsistent

Critical

AlertmanagerFailedReload

Critical

AlertingServiceDegraded

Severity

Warning

Summary

The alerting service is degraded

Parent

ObservabilityServicesDegraded

Sub Alert

Severity

Filter

VolumeDegraded

Warning

app.kubernetes.io/name=prometheus-operator-alertmanager

VolumeAtRisk

Critical

app.kubernetes.io/name=prometheus-operator-alertmanager

TargetDown

Warning

To be defined

KubeStatefulSetReplicasMismatch

Warning

metalk8s-monitoring/alertmanager-prometheus-operator-alertmanager

AlertmanagerFailedReload

Warning

DashboardingServiceDegraded

Severity

Warning

Summary

The dashboarding service is degraded

Parent

ObservabilityServicesDegraded

Sub Alert

Severity

Filter

KubeStatefulSetReplicasMismatch

Warning

metalk8s-monitoring/prometheus-operator-grafana

TargetDown

Warning

To be defined

Network

ControlPlaneNetworkDegraded

Severity

Warning

Summary

The Control Plane Network is degraded

Parent

PlatformDegraded

Sub Alert

Severity

Filter

NodeNetworkReceiveErrs

Warning

Need to filter on the proper cp interface

NodeHighNumberConntrackEntriesUsed

Warning

Need to filter on the proper cp interface

NodeNetworkTransmitErrs

Warning

Need to filter on the proper cp interface

NodeNetworkInterfaceFlapping

Warning

Need to filter on the proper cp interface

WorkloadPlaneNetworkDegraded

Severity

Warning

Summary

The Workload Plane Network is degraded

Parent

PlatformDegraded

Sub Alert

Severity

Filter

NodeNetworkReceiveErrs

Warning

Need to filter on the proper wp interface

NodeHighNumberConntrackEntriesUsed

Warning

Need to filter on the proper wp interface

NodeNetworkTransmitErrs

Warning

Need to filter on the proper wp interface

NodeNetworkInterfaceFlapping

Warning

Need to filter on the proper wp interface

Note

The name of the interface used by Workload Plane and/or Control Plane is not known in advance. As such, we should find a way to automatically configure the Network alerts based on Network configuration.

Note

Currently we don’t have any alerts for the Virtual Plane which is provided by kube-proxy, calico-kube-controllers, calico-node. It is not even part of the MetalK8s Dashboard page. We may want to introduce it.

Access

AccessServicesDegraded

Severity

Warning

Summary

The Access services are degraded

Parent

PlatformServicesDegraded

Sub Alert

Severity

Filter

IngressControllerDegraded

Warning

AuthenticationDegraded

Warning

IngressControllerDegraded

Severity

Warning

Summary

The Ingress Controllers for CP and WP are degraded

Parent

AccessServicesDegraded

Sub Alert

Severity

Filter

KubeDeploymentReplicasMismatch

Warning

metalk8s-ingress/ingress-nginx-defaultbackend

KubeDaemonSetNotScheduled

Warning

metalk8s-system/ingress-nginx-controller

KubeDaemonSetNotScheduled

Warning

metalk8s-system/ingress-nginx-control-plane-controller

AuthenticationDegraded

Severity

Warning

Summary

The Authentication service for K8S API is degraded

Parent

AccessServicesDegraded

Sub Alert

Severity

Filter

KubeDeploymentReplicasMismatch

Warning

metalk8s-auth/dex