Alerts

MetalK8s monitoring stack can generate alerts from observed statistics and a set of rules. These alerts are then persisted for providing a historical view in MetalK8s UI, and can be routed to custom receivers if needed (see Alertmanager Configuration Customization for how to do this).

Predefined Alerting Rules

There are two categories of alerting rules deployed with MetalK8s: simple and composite.

Simple alerting rules define an expression only from “standard” Prometheus metrics, while composite rules rely on the special ALERTS metric to generate alerts from the state of other alerts.

Composite rules are used to build a hierarchy of alerts, encoding the relationship between low-level components and their simple alerting rules into higher level services.

Hierarchy

The alerts have a severity level attached, either WARNING or CRITICAL. We can represent the full hierarchy by describing the parent-child relationships of each composite alert, building a tree from one of the root-level alerts:

ClusterDegraded

ClusterAtRisk

Composite Rules

AccessServicesDegraded

Severity

WARNING

Message

The Access services are degraded.

Relationship

ANY

Children

ClusterAtRisk

Severity

CRITICAL

Message

The cluster is at risk.

Relationship

ANY

Children

CoreServicesAtRisk

Severity

CRITICAL

Message

The Core services are at risk.

Relationship

ANY

Children

CoreServicesDegraded

Severity

WARNING

Message

The Core services are degraded.

Relationship

ANY

Children

KubernetesControlPlaneDegraded

Severity

WARNING

Message

The Kubernetes control plane is degraded.

Relationship

ANY

Children

MonitoringServiceDegraded

Severity

WARNING

Message

The monitoring service is degraded.

Relationship

ANY

Children

NodeAtRisk

Severity

CRITICAL

Message

The node {{ $labels.instance }} is at risk.

Relationship

ANY

Children

ObservabilityServicesAtRisk

Severity

CRITICAL

Message

The observability services are at risk.

Relationship

ANY

Children

PlatformServicesAtRisk

Severity

CRITICAL

Message

The Platform services are at risk.

Relationship

ANY

Children

PlatformServicesDegraded

Severity

WARNING

Message

The Platform services are degraded.

Relationship

ANY

Children

VolumeAtRisk

Severity

CRITICAL

Message

The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is at risk.

Relationship

ANY

Children

VolumeDegraded

Severity

WARNING

Message

The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is degraded.

Relationship

ANY

Children

Simple Rules

AlertmanagerClusterCrashlooping

Severity

CRITICAL

Message

Half or more of the Alertmanager instances within the same cluster are crashlooping.

Query
(count by(namespace, service) (changes(process_start_time_seconds{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[10m]) > 4) / count by(namespace, service) (up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) >= 0.5

AlertmanagerClusterDown

Severity

CRITICAL

Message

Half or more of the Alertmanager instances within the same cluster are down.

Query
(count by(namespace, service) (avg_over_time(up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) < 0.5) / count by(namespace, service) (up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) >= 0.5

AlertmanagerClusterFailedToSendAlerts

Severity

CRITICAL

Message

All Alertmanager instances in a cluster failed to send notifications to a critical integration.

Query
min by(namespace, service, integration) (rate(alertmanager_notifications_failed_total{integration=~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) / rate(alertmanager_notifications_total{integration=~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m])) > 0.01

AlertmanagerClusterFailedToSendAlerts

Severity

WARNING

Message

All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.

Query
min by(namespace, service, integration) (rate(alertmanager_notifications_failed_total{integration!~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) / rate(alertmanager_notifications_total{integration!~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m])) > 0.01

AlertmanagerConfigInconsistent

Severity

CRITICAL

Message

Alertmanager instances within the same cluster have different configurations.

Query
count by(namespace, service) (count_values by(namespace, service) ("config_hash", alertmanager_config_hash{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) != 1

AlertmanagerFailedReload

Severity

CRITICAL

Message

Reloading an Alertmanager configuration has failed.

Query
max_over_time(alertmanager_config_last_reload_successful{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) == 0

AlertmanagerFailedToSendAlerts

Severity

WARNING

Message

An Alertmanager instance failed to send notifications.

Query
(rate(alertmanager_notifications_failed_total{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) / rate(alertmanager_notifications_total{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m])) > 0.01

AlertmanagerMembersInconsistent

Severity

CRITICAL

Message

A member of an Alertmanager cluster has not found all other cluster members.

Query
max_over_time(alertmanager_cluster_members{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) < on(namespace, service) group_left() count by(namespace, service) (max_over_time(alertmanager_cluster_members{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]))

ConfigReloaderSidecarErrors

Severity

WARNING

Message

config-reloader sidecar has not had a successful reload for 10m

Query
max_over_time(reloader_last_reload_successful{namespace=~".+"}[5m]) == 0

CPUThrottlingHigh

Severity

INFO

Message

Processes experience elevated CPU throttling.

Query
sum by(container, pod, namespace) (increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) / sum by(container, pod, namespace) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100)

etcdGRPCRequestsSlow

Severity

CRITICAL

Message

etcd cluster “{{ $labels.job }}”: gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}.

Query
histogram_quantile(0.99, sum by(job, instance, grpc_service, grpc_method, le) (rate(grpc_server_handling_seconds_bucket{grpc_type="unary",job=~".*etcd.*"}[5m]))) > 0.15

etcdHighCommitDurations

Severity

WARNING

Message

etcd cluster “{{ $labels.job }}”: 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.

Query
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.25

etcdHighFsyncDurations

Severity

WARNING

Message

etcd cluster “{{ $labels.job }}”: 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.

Query
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.5

etcdHighNumberOfFailedGRPCRequests

Severity

CRITICAL

Message

etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.

Query
100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

etcdHighNumberOfFailedGRPCRequests

Severity

WARNING

Message

etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.

Query
100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1

etcdHighNumberOfFailedHTTPRequests

Severity

CRITICAL

Message

{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}.

Query
sum by(method) (rate(etcd_http_failed_total{code!="404",job=~".*etcd.*"}[5m])) / sum by(method) (rate(etcd_http_received_total{job=~".*etcd.*"}[5m])) > 0.05

etcdHighNumberOfFailedHTTPRequests

Severity

WARNING

Message

{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}

Query
sum by(method) (rate(etcd_http_failed_total{code!="404",job=~".*etcd.*"}[5m])) / sum by(method) (rate(etcd_http_received_total{job=~".*etcd.*"}[5m])) > 0.01

etcdHighNumberOfFailedProposals

Severity

WARNING

Message

etcd cluster “{{ $labels.job }}”: {{ $value }} proposal failures within the last hour on etcd instance {{ $labels.instance }}.

Query
rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5

etcdHighNumberOfLeaderChanges

Severity

WARNING

Message

etcd cluster “{{ $labels.job }}”: instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last hour.

Query
rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}[15m]) > 3

etcdHTTPRequestsSlow

Severity

WARNING

Message

etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow.

Query
histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15

etcdInsufficientMembers

Severity

CRITICAL

Message

etcd cluster “{{ $labels.job }}”: insufficient members ({{ $value }}).

Query
sum by(job) (up{job=~".*etcd.*"} == bool 1) < ((count by(job) (up{job=~".*etcd.*"}) + 1) / 2)

etcdMemberCommunicationSlow

Severity

WARNING

Message

etcd cluster “{{ $labels.job }}”: member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}.

Query
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.15

etcdNoLeader

Severity

CRITICAL

Message

etcd cluster “{{ $labels.job }}”: member {{ $labels.instance }} has no leader.

Query
etcd_server_has_leader{job=~".*etcd.*"} == 0

KubeAggregatedAPIDown

Severity

WARNING

Message

Kubernetes aggregated API is down.

Query
(1 - max by(name, namespace) (avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85

KubeAggregatedAPIErrors

Severity

WARNING

Message

Kubernetes aggregated API has reported errors.

Query
sum by(name, namespace) (increase(aggregator_unavailable_apiservice_total[10m])) > 4

KubeAPIDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="apiserver"} == 1)

KubeAPIErrorBudgetBurn

Severity

CRITICAL

Message

The API server is burning too much error budget.

Query
sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01)
Additional query
sum(apiserver_request:burnrate6h) > (6 * 0.01) and sum(apiserver_request:burnrate30m) > (6 * 0.01)

KubeAPIErrorBudgetBurn

Severity

WARNING

Message

The API server is burning too much error budget.

Query
sum(apiserver_request:burnrate1d) > (3 * 0.01) and sum(apiserver_request:burnrate2h) > (3 * 0.01)
Additional query
sum(apiserver_request:burnrate3d) > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01)

KubeAPITerminatedRequests

Severity

WARNING

Message

The kubernetes apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.

Query
sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) / (sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m]))) > 0.2

KubeClientCertificateExpiration

Severity

CRITICAL

Message

Client certificate is about to expire.

Query
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400

KubeClientCertificateExpiration

Severity

WARNING

Message

Client certificate is about to expire.

Query
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800

KubeClientErrors

Severity

WARNING

Message

Kubernetes API server client is experiencing errors.

Query
(sum by(cluster, instance, job, namespace) (rate(rest_client_requests_total{code=~"5.."}[5m])) / sum by(cluster, instance, job, namespace) (rate(rest_client_requests_total[5m]))) > 0.01

KubeContainerWaiting

Severity

WARNING

Message

Pod container waiting longer than 1 hour

Query
sum by(namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*"}) > 0

KubeControllerManagerDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="kube-controller-manager"} == 1)

KubeCPUOvercommit

Severity

WARNING

Message

Cluster has overcommitted CPU resource requests.

Query
sum(namespace_cpu:kube_pod_container_resource_requests:sum) - (sum(kube_node_status_allocatable{resource="cpu"}) - max(kube_node_status_allocatable{resource="cpu"})) > 0 and (sum(kube_node_status_allocatable{resource="cpu"}) - max(kube_node_status_allocatable{resource="cpu"})) > 0

KubeCPUQuotaOvercommit

Severity

WARNING

Message

Cluster has overcommitted CPU resource requests.

Query
sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) > 1.5

KubeDaemonSetMisScheduled

Severity

WARNING

Message

DaemonSet pods are misscheduled.

Query
kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} > 0

KubeDaemonSetNotScheduled

Severity

WARNING

Message

DaemonSet pods are not scheduled.

Query
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} > 0

KubeDaemonSetRolloutStuck

Severity

WARNING

Message

DaemonSet rollout is stuck.

Query
((kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} != 0) or (kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_available{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"}[5m]) == 0)

KubeDeploymentGenerationMismatch

Severity

WARNING

Message

Deployment generation mismatch due to possible roll-back

Query
kube_deployment_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_deployment_metadata_generation{job="kube-state-metrics",namespace=~".*"}

KubeDeploymentReplicasMismatch

Severity

WARNING

Message

Deployment has not matched the expected number of replicas.

Query
(kube_deployment_spec_replicas{job="kube-state-metrics",namespace=~".*"} > kube_deployment_status_replicas_available{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[10m]) == 0)

KubeHpaMaxedOut

Severity

WARNING

Message

HPA is running at max replicas

Query
kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} == kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}

KubeHpaReplicasMismatch

Severity

WARNING

Message

HPA has not matched descired number of replicas.

Query
(kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics",namespace=~".*"} != kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} > kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} < kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}) and changes(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"}[15m]) == 0

KubeJobFailed

Severity

WARNING

Message

Job failed to complete.

Query
kube_job_failed{job="kube-state-metrics",namespace=~".*"} > 0

KubeJobNotCompleted

Severity

WARNING

Message

Job did not complete in time

Query
time() - max by(namespace, job_name) (kube_job_status_start_time{job="kube-state-metrics",namespace=~".*"} and kube_job_status_active{job="kube-state-metrics",namespace=~".*"} > 0) > 43200

KubeletClientCertificateExpiration

Severity

CRITICAL

Message

Kubelet client certificate is about to expire.

Query
kubelet_certificate_manager_client_ttl_seconds < 86400

KubeletClientCertificateExpiration

Severity

WARNING

Message

Kubelet client certificate is about to expire.

Query
kubelet_certificate_manager_client_ttl_seconds < 604800

KubeletClientCertificateRenewalErrors

Severity

WARNING

Message

Kubelet has failed to renew its client certificate.

Query
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0

KubeletDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="kubelet",metrics_path="/metrics"} == 1)

KubeletPlegDurationHigh

Severity

WARNING

Message

Kubelet Pod Lifecycle Event Generator is taking too long to relist.

Query
node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10

KubeletPodStartUpLatencyHigh

Severity

WARNING

Message

Kubelet Pod startup latency is too high.

Query
histogram_quantile(0.99, sum by(cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",metrics_path="/metrics"}[5m]))) * on(cluster, instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"} > 60

KubeletServerCertificateExpiration

Severity

CRITICAL

Message

Kubelet server certificate is about to expire.

Query
kubelet_certificate_manager_server_ttl_seconds < 86400

KubeletServerCertificateExpiration

Severity

WARNING

Message

Kubelet server certificate is about to expire.

Query
kubelet_certificate_manager_server_ttl_seconds < 604800

KubeletServerCertificateRenewalErrors

Severity

WARNING

Message

Kubelet has failed to renew its server certificate.

Query
increase(kubelet_server_expiration_renew_errors[5m]) > 0

KubeletTooManyPods

Severity

INFO

Message

Kubelet is running at capacity.

Query
count by(cluster, node) ((kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on(instance, pod, namespace, cluster) group_left(node) topk by(instance, pod, namespace, cluster) (1, kube_pod_info{job="kube-state-metrics"})) / max by(cluster, node) (kube_node_status_capacity{job="kube-state-metrics",resource="pods"} != 1) > 0.95

KubeMemoryOvercommit

Severity

WARNING

Message

Cluster has overcommitted memory resource requests.

Query
sum(namespace_memory:kube_pod_container_resource_requests:sum) - (sum(kube_node_status_allocatable{resource="memory"}) - max(kube_node_status_allocatable{resource="memory"})) > 0 and (sum(kube_node_status_allocatable{resource="memory"}) - max(kube_node_status_allocatable{resource="memory"})) > 0

KubeMemoryQuotaOvercommit

Severity

WARNING

Message

Cluster has overcommitted memory resource requests.

Query
sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5

KubeNodeNotReady

Severity

WARNING

Message

Node is not ready.

Query
kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0

KubeNodeReadinessFlapping

Severity

WARNING

Message

Node readiness status is flapping.

Query
sum by(cluster, node) (changes(kube_node_status_condition{condition="Ready",status="true"}[15m])) > 2

KubeNodeUnreachable

Severity

WARNING

Message

Node is unreachable.

Query
(kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless ignoring(key, value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1

KubePersistentVolumeErrors

Severity

CRITICAL

Message

PersistentVolume is having issues with provisioning.

Query
kube_persistentvolume_status_phase{job="kube-state-metrics",phase=~"Failed|Pending"} > 0

KubePersistentVolumeFillingUp

Severity

CRITICAL

Message

PersistentVolume is filling up.

Query
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeFillingUp

Severity

WARNING

Message

PersistentVolume is filling up.

Query
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and kubelet_volume_stats_used_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

Severity

CRITICAL

Message

PersistentVolumeInodes are filling up.

Query
(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_inodes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.03 and kubelet_volume_stats_inodes_used{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

Severity

WARNING

Message

PersistentVolumeInodes are filling up.

Query
(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_inodes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and kubelet_volume_stats_inodes_used{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 and predict_linear(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePodCrashLooping

Severity

WARNING

Message

Pod is crash looping.

Query
max_over_time(kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*",reason="CrashLoopBackOff"}[5m]) >= 1

KubePodNotReady

Severity

WARNING

Message

Pod has been in a non-ready state for more than 15 minutes.

Query
sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))) > 0

KubeProxyDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="kube-proxy"} == 1)

KubeQuotaAlmostFull

Severity

INFO

Message

Namespace quota is going to be full.

Query
kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 0.9 < 1

KubeQuotaExceeded

Severity

WARNING

Message

Namespace quota has exceeded the limits.

Query
kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 1

KubeQuotaFullyUsed

Severity

INFO

Message

Namespace quota is fully used.

Query
kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) == 1

KubeSchedulerDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="kube-scheduler"} == 1)

KubeStatefulSetGenerationMismatch

Severity

WARNING

Message

StatefulSet generation mismatch due to possible roll-back

Query
kube_statefulset_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_metadata_generation{job="kube-state-metrics",namespace=~".*"}

KubeStatefulSetReplicasMismatch

Severity

WARNING

Message

Deployment has not matched the expected number of replicas.

Query
(kube_statefulset_status_replicas_ready{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[10m]) == 0)

KubeStatefulSetUpdateNotRolledOut

Severity

WARNING

Message

StatefulSet update has not been rolled out.

Query
(max without(revision) (kube_statefulset_status_current_revision{job="kube-state-metrics",namespace=~".*"} unless kube_statefulset_status_update_revision{job="kube-state-metrics",namespace=~".*"}) * (kube_statefulset_replicas{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0)

KubeStateMetricsListErrors

Severity

CRITICAL

Message

kube-state-metrics is experiencing errors in list operations.

Query
(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01

KubeStateMetricsShardingMismatch

Severity

CRITICAL

Message

kube-state-metrics sharding is misconfigured.

Query
stdvar(kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0

KubeStateMetricsShardsMissing

Severity

CRITICAL

Message

kube-state-metrics shards are missing.

Query
2 ^ max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1 - sum(2 ^ max by(shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"})) != 0

KubeStateMetricsWatchErrors

Severity

CRITICAL

Message

kube-state-metrics is experiencing errors in watch operations.

Query
(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01

KubeVersionMismatch

Severity

WARNING

Message

Different semantic versions of Kubernetes components running.

Query
count(count by(git_version) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"}, "git_version", "$1", "git_version", "(v[0-9]*.[0-9]*).*"))) > 1

NodeClockNotSynchronising

Severity

WARNING

Message

Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.

Query
min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16

NodeClockSkewDetected

Severity

WARNING

Message

Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host.

Query
(node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)

NodeFileDescriptorLimit

Severity

CRITICAL

Message

Kernel is predicted to exhaust file descriptors limit soon.

Query
(node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90)

NodeFileDescriptorLimit

Severity

WARNING

Message

Kernel is predicted to exhaust file descriptors limit soon.

Query
(node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70)

NodeFilesystemAlmostOutOfFiles

Severity

CRITICAL

Message

Filesystem has less than 8% inodes left.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 8 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemAlmostOutOfFiles

Severity

WARNING

Message

Filesystem has less than 15% inodes left.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 15 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemAlmostOutOfSpace

Severity

CRITICAL

Message

Filesystem has less than 12% space left.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 12 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemAlmostOutOfSpace

Severity

WARNING

Message

Filesystem has less than 20% space left.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 20 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemFilesFillingUp

Severity

CRITICAL

Message

Filesystem is predicted to run out of inodes within the next 4 hours.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemFilesFillingUp

Severity

WARNING

Message

Filesystem is predicted to run out of inodes within the next 24 hours.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemSpaceFillingUp

Severity

CRITICAL

Message

Filesystem is predicted to run out of space within the next 4 hours.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemSpaceFillingUp

Severity

WARNING

Message

Filesystem is predicted to run out of space within the next 24 hours.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeHighNumberConntrackEntriesUsed

Severity

WARNING

Message

Number of conntrack are getting close to the limit

Query
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75

NodeNetworkInterfaceFlapping

Severity

WARNING

Message

Network interface is often changing its status

Query
changes(node_network_up{device!~"veth.+",job="node-exporter"}[2m]) > 2

NodeNetworkReceiveErrs

Severity

WARNING

Message

Network interface is reporting many receive errors.

Query
increase(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01

NodeNetworkTransmitErrs

Severity

WARNING

Message

Network interface is reporting many transmit errors.

Query
increase(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01

NodeRAIDDegraded

Severity

CRITICAL

Message

RAID Array is degraded

Query
node_md_disks_required - ignoring(state) (node_md_disks{state="active"}) >= 1

NodeRAIDDiskFailure

Severity

WARNING

Message

Failed device in RAID array

Query
node_md_disks{state="failed"} >= 1

NodeTextFileCollectorScrapeError

Severity

WARNING

Message

Node Exporter text file collector failed to scrape.

Query
node_textfile_scrape_error{job="node-exporter"} == 1

PrometheusBadConfig

Severity

CRITICAL

Message

Failed Prometheus configuration reload.

Query
max_over_time(prometheus_config_last_reload_successful{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) == 0

PrometheusDuplicateTimestamps

Severity

WARNING

Message

Prometheus is dropping samples with duplicate timestamps.

Query
rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusErrorSendingAlertsToAnyAlertmanager

Severity

CRITICAL

Message

Prometheus encounters more than 3% errors sending alerts to any Alertmanager.

Query
min without(alertmanager) (rate(prometheus_notifications_errors_total{alertmanager!~"",job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) / rate(prometheus_notifications_sent_total{alertmanager!~"",job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) * 100 > 3

PrometheusErrorSendingAlertsToSomeAlertmanagers

Severity

WARNING

Message

Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.

Query
(rate(prometheus_notifications_errors_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) * 100 > 1

PrometheusLabelLimitHit

Severity

WARNING

Message

Prometheus has dropped targets because some scrape configs have exceeded the labels limit.

Query
increase(prometheus_target_scrape_pool_exceeded_label_limits_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusMissingRuleEvaluations

Severity

WARNING

Message

Prometheus is missing rule evaluations due to slow rule group evaluation.

Query
increase(prometheus_rule_group_iterations_missed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusNotConnectedToAlertmanagers

Severity

WARNING

Message

Prometheus is not connected to any Alertmanagers.

Query
max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) < 1

PrometheusNotificationQueueRunningFull

Severity

WARNING

Message

Prometheus alert notification queue predicted to run full in less than 30m.

Query
(predict_linear(prometheus_notifications_queue_length{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]))

PrometheusNotIngestingSamples

Severity

WARNING

Message

Prometheus is not ingesting samples.

Query
(rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) <= 0 and (sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}) > 0 or sum without(rule_group) (prometheus_rule_group_rules{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}) > 0))

PrometheusOperatorListErrors

Severity

WARNING

Message

Errors while performing list operations in controller.

Query
(sum by(controller, namespace) (rate(prometheus_operator_list_operations_failed_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[10m])) / sum by(controller, namespace) (rate(prometheus_operator_list_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[10m]))) > 0.4

PrometheusOperatorNodeLookupErrors

Severity

WARNING

Message

Errors while reconciling Prometheus.

Query
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]) > 0.1

PrometheusOperatorNotReady

Severity

WARNING

Message

Prometheus operator not ready

Query
min by(namespace, controller) (max_over_time(prometheus_operator_ready{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]) == 0)

PrometheusOperatorReconcileErrors

Severity

WARNING

Message

Errors while reconciling controller.

Query
(sum by(controller, namespace) (rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) / (sum by(controller, namespace) (rate(prometheus_operator_reconcile_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) > 0.1

PrometheusOperatorRejectedResources

Severity

WARNING

Message

Resources rejected by Prometheus operator

Query
min_over_time(prometheus_operator_managed_resources{job="prometheus-operator-operator",namespace="metalk8s-monitoring",state="rejected"}[5m]) > 0

PrometheusOperatorSyncFailed

Severity

WARNING

Message

Last controller reconciliation failed

Query
min_over_time(prometheus_operator_syncs{job="prometheus-operator-operator",namespace="metalk8s-monitoring",status="failed"}[5m]) > 0

PrometheusOperatorWatchErrors

Severity

WARNING

Message

Errors while performing watch operations in controller.

Query
(sum by(controller, namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m])) / sum by(controller, namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) > 0.4

PrometheusOutOfOrderTimestamps

Severity

WARNING

Message

Prometheus drops samples with out-of-order timestamps.

Query
rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusRemoteStorageFailures

Severity

CRITICAL

Message

Prometheus fails to send samples to remote storage.

Query
((rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) / ((rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) + (rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])))) * 100 > 1

PrometheusRemoteWriteBehind

Severity

CRITICAL

Message

Prometheus remote write is behind.

Query
(max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) - ignoring(remote_name, url) group_right() max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) > 120

PrometheusRemoteWriteDesiredShards

Severity

WARNING

Message

Prometheus remote write desired shards calculation wants to run more than configured max shards.

Query
(max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]))

PrometheusRuleFailures

Severity

CRITICAL

Message

Prometheus is failing rule evaluations.

Query
increase(prometheus_rule_evaluation_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusScrapeBodySizeLimitHit

Severity

WARNING

Message

Prometheus has dropped some targets that exceeded body size limit.

Query
increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusScrapeSampleLimitHit

Severity

WARNING

Message

Prometheus has failed scrapes that have exceeded the configured sample limit.

Query
increase(prometheus_target_scrapes_exceeded_sample_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusTargetLimitHit

Severity

WARNING

Message

Prometheus has dropped targets because some scrape configs have exceeded the targets limit.

Query
increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusTargetSyncFailure

Severity

CRITICAL

Message

Prometheus has failed to sync targets.

Query
increase(prometheus_target_sync_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[30m]) > 0

PrometheusTSDBCompactionsFailing

Severity

WARNING

Message

Prometheus has issues compacting blocks.

Query
increase(prometheus_tsdb_compactions_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[3h]) > 0

PrometheusTSDBReloadsFailing

Severity

WARNING

Message

Prometheus has issues reloading blocks from disk.

Query
increase(prometheus_tsdb_reloads_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[3h]) > 0

TargetDown

Severity

WARNING

Message

One or more targets are unreachable.

Query
100 * (count by(job, namespace, service) (up == 0) / count by(job, namespace, service) (up)) > 10

Watchdog

Severity

NONE

Message

An alert that should always be firing to certify that Alertmanager is working properly.

Query
vector(1)