Alerts

MetalK8s monitoring stack can generate alerts from observed statistics and a set of rules. These alerts are then persisted for providing a historical view in MetalK8s UI, and can be routed to custom receivers if needed (see Alertmanager Configuration Customization for how to do this).

Predefined Alerting Rules

There are two categories of alerting rules deployed with MetalK8s: simple and composite.

Simple alerting rules define an expression only from “standard” Prometheus metrics, while composite rules rely on the special ALERTS metric to generate alerts from the state of other alerts.

Composite rules are used to build a hierarchy of alerts, encoding the relationship between low-level components and their simple alerting rules into higher level services.

Hierarchy

The alerts have a severity level attached, either WARNING or CRITICAL. We can represent the full hierarchy by describing the parent-child relationships of each composite alert, building a tree from one of the root-level alerts:

ClusterDegraded

ClusterAtRisk

Composite Rules

AccessServicesDegraded

Severity

WARNING

Message

The Access services are degraded.

Relationship

ANY

Children

ClusterAtRisk

Severity

CRITICAL

Message

The cluster is at risk.

Relationship

ANY

Children

CoreServicesAtRisk

Severity

CRITICAL

Message

The Core services are at risk.

Relationship

ANY

Children

CoreServicesDegraded

Severity

WARNING

Message

The Core services are degraded.

Relationship

ANY

Children

KubernetesControlPlaneDegraded

Severity

WARNING

Message

The Kubernetes control plane is degraded.

Relationship

ANY

Children

LoggingServiceAtRisk

Severity

CRITICAL

Message

The logging service is at risk.

Relationship

ANY

Children

MonitoringServiceDegraded

Severity

WARNING

Message

The monitoring service is degraded.

Relationship

ANY

Children

NodeAtRisk

Severity

CRITICAL

Message

The node {{ $labels.instance }} is at risk.

Relationship

ANY

Children

ObservabilityServicesAtRisk

Severity

CRITICAL

Message

The observability services are at risk.

Relationship

ANY

Children

PlatformServicesAtRisk

Severity

CRITICAL

Message

The Platform services are at risk.

Relationship

ANY

Children

PlatformServicesDegraded

Severity

WARNING

Message

The Platform services are degraded.

Relationship

ANY

Children

VolumeAtRisk

Severity

CRITICAL

Message

The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is at risk.

Relationship

ANY

Children

VolumeDegraded

Severity

WARNING

Message

The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is degraded.

Relationship

ANY

Children

Simple Rules

AlertmanagerClusterCrashlooping

Severity

CRITICAL

Message

Half or more of the Alertmanager instances within the same cluster are crashlooping.

Query
(count by (namespace, service, cluster) (changes(process_start_time_seconds{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[10m]) > 4) / count by (namespace, service, cluster) (up{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) >= 0.5

AlertmanagerClusterDown

Severity

CRITICAL

Message

Half or more of the Alertmanager instances within the same cluster are down.

Query
(count by (namespace, service, cluster) (avg_over_time(up{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) < 0.5) / count by (namespace, service, cluster) (up{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) >= 0.5

AlertmanagerClusterFailedToSendAlerts

Severity

CRITICAL

Message

All Alertmanager instances in a cluster failed to send notifications to a critical integration.

Query
min by (namespace, service, integration) (rate(alertmanager_notifications_failed_total{container="alertmanager",integration=~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[15m]) / ignoring (reason) group_left () rate(alertmanager_notifications_total{container="alertmanager",integration=~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[15m]) > 0) > 0.01

AlertmanagerClusterFailedToSendAlerts

Severity

WARNING

Message

All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.

Query
min by (namespace, service, integration) (rate(alertmanager_notifications_failed_total{container="alertmanager",integration!~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[15m]) / ignoring (reason) group_left () rate(alertmanager_notifications_total{container="alertmanager",integration!~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[15m]) > 0) > 0.01

AlertmanagerConfigInconsistent

Severity

CRITICAL

Message

Alertmanager instances within the same cluster have different configurations.

Query
count by (namespace, service, cluster) (count_values by (namespace, service, cluster) ("config_hash", alertmanager_config_hash{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) != 1

AlertmanagerFailedReload

Severity

CRITICAL

Message

Reloading an Alertmanager configuration has failed.

Query
max_over_time(alertmanager_config_last_reload_successful{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) == 0

AlertmanagerFailedToSendAlerts

Severity

WARNING

Message

An Alertmanager instance failed to send notifications.

Query
(rate(alertmanager_notifications_failed_total{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[15m]) / ignoring (reason) group_left () rate(alertmanager_notifications_total{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[15m])) > 0.01

AlertmanagerMembersInconsistent

Severity

CRITICAL

Message

A member of an Alertmanager cluster has not found all other cluster members.

Query
max_over_time(alertmanager_cluster_members{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) < on (namespace, service, cluster) group_left () count by (namespace, service, cluster) (max_over_time(alertmanager_cluster_members{container="alertmanager",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]))

ConfigReloaderSidecarErrors

Severity

WARNING

Message

config-reloader sidecar has not had a successful reload for 10m

Query
max_over_time(reloader_last_reload_successful{namespace=~".+"}[5m]) == 0

CPUThrottlingHigh

Severity

INFO

Message

Processes experience elevated CPU throttling.

Query
sum without (id, metrics_path, name, image, endpoint, job, node) (topk by (cluster, namespace, pod, container, instance) (1, increase(container_cpu_cfs_throttled_periods_total{container!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m]))) / on (cluster, namespace, pod, container, instance) group_left () sum without (id, metrics_path, name, image, endpoint, job, node) (topk by (cluster, namespace, pod, container, instance) (1, increase(container_cpu_cfs_periods_total{job="kubelet",metrics_path="/metrics/cadvisor"}[5m]))) > (25 / 100)

etcdDatabaseQuotaLowSpace

Severity

CRITICAL

Message

etcd cluster database is running full.

Query
(last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m]) / last_over_time(etcd_server_quota_backend_bytes{job=~".*etcd.*"}[5m])) * 100 > 95

etcdExcessiveDatabaseGrowth

Severity

WARNING

Message

etcd cluster database growing very fast.

Query
predict_linear(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[4h], 4 * 60 * 60) > etcd_server_quota_backend_bytes{job=~".*etcd.*"}

etcdGRPCRequestsSlow

Severity

CRITICAL

Message

etcd grpc requests are slow

Query
histogram_quantile(0.99, sum without (grpc_type) (rate(grpc_server_handling_seconds_bucket{grpc_method!="Defragment",grpc_type="unary",job=~".*etcd.*"}[5m]))) > 0.15

etcdHighCommitDurations

Severity

WARNING

Message

etcd cluster 99th percentile commit durations are too high.

Query
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.25

etcdHighFsyncDurations

Severity

CRITICAL

Message

etcd cluster 99th percentile fsync durations are too high.

Query
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 1

etcdHighFsyncDurations

Severity

WARNING

Message

etcd cluster 99th percentile fsync durations are too high.

Query
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.5

etcdHighNumberOfFailedGRPCRequests

Severity

CRITICAL

Message

etcd cluster has high number of failed grpc requests.

Query
100 * sum without (grpc_type, grpc_code) (rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job=~".*etcd.*"}[5m])) / sum without (grpc_type, grpc_code) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

etcdHighNumberOfFailedGRPCRequests

Severity

WARNING

Message

etcd cluster has high number of failed grpc requests.

Query
100 * sum without (grpc_type, grpc_code) (rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded",job=~".*etcd.*"}[5m])) / sum without (grpc_type, grpc_code) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1

etcdHighNumberOfFailedProposals

Severity

WARNING

Message

etcd cluster has high number of proposal failures.

Query
rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5

etcdHighNumberOfLeaderChanges

Severity

WARNING

Message

etcd cluster has high number of leader changes.

Query
increase((max without (instance, pod) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0 * absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 4

etcdInsufficientMembers

Severity

CRITICAL

Message

etcd cluster has insufficient number of members.

Query
sum without (instance, pod) (up{job=~".*etcd.*"} == bool 1) < ((count without (instance, pod) (up{job=~".*etcd.*"}) + 1) / 2)

etcdMemberCommunicationSlow

Severity

WARNING

Message

etcd cluster member communication is slow.

Query
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.15

etcdMembersDown

Severity

WARNING

Message

etcd cluster members are down.

Query
max without (endpoint) (sum without (instance, pod) (up{job=~".*etcd.*"} == bool 0) or count without (To) (sum without (instance, pod) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[2m])) > 0.01)) > 0

etcdNoLeader

Severity

CRITICAL

Message

etcd cluster has no leader.

Query
etcd_server_has_leader{job=~".*etcd.*"} == 0

FluentBitBackPressure

Severity

WARNING

Message

Fluent Bit pod {{ $labels.pod }} input {{ $labels.name }} has been paused due to back pressure for more than 5 minutes. Log ingestion is halted until the output pipeline catches up.

Query
fluentbit_input_ingestion_paused == 1 or fluentbit_input_storage_overlimit == 1

FluentBitOutputRetryLimit

Severity

CRITICAL

Message

Fluent Bit pod {{ $labels.pod }} output {{ $labels.name }} has been failing retries and dropping log records for more than 15 minutes. Check the destination and network connectivity.

Query
rate(fluentbit_output_retries_failed_total[5m]) > 0

KubeAggregatedAPIDown

Severity

WARNING

Message

Kubernetes aggregated API is down.

Query
(1 - max by (name, namespace, cluster) (avg_over_time(aggregator_unavailable_apiservice{job="apiserver"}[10m]))) * 100 < 85

KubeAggregatedAPIErrors

Severity

WARNING

Message

Kubernetes aggregated API has reported errors.

Query
sum by (cluster, instance, name, reason) (increase(aggregator_unavailable_apiservice_total{job="apiserver"}[1m])) > 0

KubeAPIDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="apiserver"})

KubeAPIErrorBudgetBurn

Severity

CRITICAL

Message

The API server is burning too much error budget.

Query
sum by (cluster) (apiserver_request:burnrate1h) > (14.4 * 0.01) and on (cluster) sum by (cluster) (apiserver_request:burnrate5m) > (14.4 * 0.01)
Additional query
sum by (cluster) (apiserver_request:burnrate6h) > (6 * 0.01) and on (cluster) sum by (cluster) (apiserver_request:burnrate30m) > (6 * 0.01)

KubeAPIErrorBudgetBurn

Severity

WARNING

Message

The API server is burning too much error budget.

Query
sum by (cluster) (apiserver_request:burnrate1d) > (3 * 0.01) and on (cluster) sum by (cluster) (apiserver_request:burnrate2h) > (3 * 0.01)
Additional query
sum by (cluster) (apiserver_request:burnrate3d) > (1 * 0.01) and on (cluster) sum by (cluster) (apiserver_request:burnrate6h) > (1 * 0.01)

KubeAPITerminatedRequests

Severity

WARNING

Message

The kubernetes apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.

Query
sum by (cluster) (rate(apiserver_request_terminations_total{job="apiserver"}[10m])) / (sum by (cluster) (rate(apiserver_request_total{job="apiserver"}[10m])) + sum by (cluster) (rate(apiserver_request_terminations_total{job="apiserver"}[10m]))) > 0.2

KubeClientCertificateExpiration

Severity

CRITICAL

Message

Client certificate is about to expire.

Query
histogram_quantile(0.01, sum without (namespace, service, endpoint) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 and on (job, cluster, instance) apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0

KubeClientCertificateExpiration

Severity

WARNING

Message

Client certificate is about to expire.

Query
histogram_quantile(0.01, sum without (namespace, service, endpoint) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 and on (job, cluster, instance) apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0

KubeClientErrors

Severity

WARNING

Message

Kubernetes API server client is experiencing errors.

Query
(sum by (cluster, instance, job, namespace) (rate(rest_client_requests_total{code=~"5..",job="apiserver"}[5m])) / sum by (cluster, instance, job, namespace) (rate(rest_client_requests_total{job="apiserver"}[5m]))) > 0.01

KubeContainerWaiting

Severity

WARNING

Message

Pod container waiting longer than 1 hour

Query
kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*",reason!="CrashLoopBackOff"} > 0

KubeControllerManagerDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="kube-controller-manager"})

KubeCPUOvercommit

Severity

WARNING

Message

Cluster has overcommitted CPU resource requests.

Query
((sum by (cluster) (namespace_cpu:kube_pod_container_resource_requests:sum) - sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) > 0) and count by (cluster) (max by (cluster, node) (kube_node_role{job="kube-state-metrics",role="control-plane"})) < 3) or (sum by (cluster) (namespace_cpu:kube_pod_container_resource_requests:sum) - ((sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})) > 0) > 0)

KubeCPUQuotaOvercommit

Severity

WARNING

Message

Cluster has overcommitted CPU resource requests.

Query
sum by (cluster) (min without (resource) (kube_resourcequota{job="kube-state-metrics",resource=~"(cpu|requests.cpu)",type="hard"})) / sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) > 1.5

KubeCronJobOwnedJobFailed

Severity

WARNING

Message

Job owned by CronJob failed to complete.

Query
kube_job_failed{job="kube-state-metrics",namespace=~".*"} > 0 and on (job_name, namespace) (topk by (owner_name, namespace) (1, kube_job_created{job="kube-state-metrics"} * on (job_name, namespace) group_left (owner_name, owner_kind) kube_job_owner{job="kube-state-metrics",owner_kind="CronJob"}))

KubeDaemonSetMisScheduled

Severity

WARNING

Message

DaemonSet pods are misscheduled.

Query
kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} > 0

KubeDaemonSetNotScheduled

Severity

WARNING

Message

DaemonSet pods are not scheduled.

Query
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} > 0

KubeDaemonSetRolloutStuck

Severity

WARNING

Message

DaemonSet rollout is stuck.

Query
((kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} != 0) or (kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_available{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"}[5m]) == 0)

KubeDeploymentGenerationMismatch

Severity

WARNING

Message

Deployment generation mismatch due to possible roll-back

Query
kube_deployment_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_deployment_metadata_generation{job="kube-state-metrics",namespace=~".*"}

KubeDeploymentReplicasMismatch

Severity

WARNING

Message

Deployment has not matched the expected number of replicas.

Query
(kube_deployment_spec_replicas{job="kube-state-metrics",namespace=~".*"} > kube_deployment_status_replicas_available{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[10m]) == 0)

KubeDeploymentRolloutStuck

Severity

WARNING

Message

Deployment rollout is not progressing.

Query
kube_deployment_status_condition{condition="Progressing",job="kube-state-metrics",namespace=~".*",status="false"} != 0

KubeHpaMaxedOut

Severity

WARNING

Message

HPA is running at max replicas

Query
(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} == kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}) and on (namespace, horizontalpodautoscaler) (kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics",namespace=~".*"} != kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics",namespace=~".*"})

KubeHpaReplicasMismatch

Severity

WARNING

Message

HPA has not matched desired number of replicas.

Query
(kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics",namespace=~".*"} != kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} > kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} < kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}) and changes(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"}[15m]) == 0

KubeJobFailed

Severity

WARNING

Message

Job failed to complete.

Query
kube_job_failed{job="kube-state-metrics",namespace=~".*"} > 0 unless on (job_name, namespace) kube_job_owner{job="kube-state-metrics",owner_kind="CronJob"}

KubeJobNotCompleted

Severity

WARNING

Message

Job did not complete in time

Query
time() - max by (namespace, job_name, cluster) (kube_job_status_start_time{job="kube-state-metrics",namespace=~".*"} and kube_job_status_active{job="kube-state-metrics",namespace=~".*"} > 0) > (24 * 60 * 60)

KubeletClientCertificateExpiration

Severity

CRITICAL

Message

Kubelet client certificate is about to expire.

Query
kubelet_certificate_manager_client_ttl_seconds < 86400

KubeletClientCertificateExpiration

Severity

WARNING

Message

Kubelet client certificate is about to expire.

Query
kubelet_certificate_manager_client_ttl_seconds < 604800

KubeletClientCertificateRenewalErrors

Severity

WARNING

Message

Kubelet has failed to renew its client certificate.

Query
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0

KubeletDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
count by (cluster) (kube_node_info{job="kube-state-metrics"}) unless on (cluster) count by (cluster) (up{job="kubelet",metrics_path="/metrics"} == 1)

KubeletPlegDurationHigh

Severity

WARNING

Message

Kubelet Pod Lifecycle Event Generator is taking too long to relist.

Query
node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10

KubeletPodStartUpLatencyHigh

Severity

WARNING

Message

Kubelet Pod startup latency is too high.

Query
histogram_quantile(0.99, sum by (cluster, instance, le) (topk by (cluster, instance, le, operation_type) (1, rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",metrics_path="/metrics"}[5m])))) * on (cluster, instance) group_left (node) topk by (cluster, instance, node) (1, kubelet_node_name{job="kubelet",metrics_path="/metrics"}) > 60

KubeletServerCertificateExpiration

Severity

CRITICAL

Message

Kubelet server certificate is about to expire.

Query
kubelet_certificate_manager_server_ttl_seconds < 86400

KubeletServerCertificateExpiration

Severity

WARNING

Message

Kubelet server certificate is about to expire.

Query
kubelet_certificate_manager_server_ttl_seconds < 604800

KubeletServerCertificateRenewalErrors

Severity

WARNING

Message

Kubelet has failed to renew its server certificate.

Query
increase(kubelet_server_expiration_renew_errors[5m]) > 0

KubeletTooManyPods

Severity

INFO

Message

Kubelet is running at capacity.

Query
(max by (cluster, instance) (kubelet_running_pods{job="kubelet",metrics_path="/metrics"} > 1) * on (cluster, instance) group_left (node) max by (cluster, instance, node) (kubelet_node_name{job="kubelet",metrics_path="/metrics"})) / on (cluster, node) group_left () max by (cluster, node) (kube_node_status_capacity{job="kube-state-metrics",resource="pods"} != 1) > 0.95

KubeMemoryOvercommit

Severity

WARNING

Message

Cluster has overcommitted memory resource requests.

Query
((sum by (cluster) (namespace_memory:kube_pod_container_resource_requests:sum) - sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 0) and count by (cluster) (max by (cluster, node) (kube_node_role{job="kube-state-metrics",role="control-plane"})) < 3) or (sum by (cluster) (namespace_memory:kube_pod_container_resource_requests:sum) - ((sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="memory"})) > 0) > 0)

KubeMemoryQuotaOvercommit

Severity

WARNING

Message

Cluster has overcommitted memory resource requests.

Query
sum by (cluster) (min without (resource) (kube_resourcequota{job="kube-state-metrics",resource=~"(memory|requests.memory)",type="hard"})) / sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5

KubeNodeEviction

Severity

INFO

Message

Node is evicting pods.

Query
sum by (cluster, eviction_signal, instance) (rate(kubelet_evictions{job="kubelet",metrics_path="/metrics"}[15m])) * on (cluster, instance) group_left (node) max by (cluster, instance, node) (kubelet_node_name{job="kubelet",metrics_path="/metrics"}) > 0

KubeNodeNotReady

Severity

WARNING

Message

Node is not ready.

Query
kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0 and on (cluster, node) kube_node_spec_unschedulable{job="kube-state-metrics"} == 0

KubeNodePressure

Severity

INFO

Message

Node has as active Condition.

Query
kube_node_status_condition{condition=~"(MemoryPressure|DiskPressure|PIDPressure)",job="kube-state-metrics",status="true"} == 1 and on (cluster, node) kube_node_spec_unschedulable{job="kube-state-metrics"} == 0

KubeNodeReadinessFlapping

Severity

WARNING

Message

Node readiness status is flapping.

Query
sum by (cluster, node) (changes(kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"}[15m])) > 2 and on (cluster, node) kube_node_spec_unschedulable{job="kube-state-metrics"} == 0

KubeNodeUnreachable

Severity

WARNING

Message

Node is unreachable.

Query
(kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless ignoring (key, value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1

KubePdbNotEnoughHealthyPods

Severity

WARNING

Message

PDB does not have enough healthy pods.

Query
(kube_poddisruptionbudget_status_desired_healthy{job="kube-state-metrics",namespace=~".*"} - kube_poddisruptionbudget_status_current_healthy{job="kube-state-metrics",namespace=~".*"}) > 0

KubePersistentVolumeErrors

Severity

CRITICAL

Message

PersistentVolume is having issues with provisioning.

Query
kube_persistentvolume_status_phase{job="kube-state-metrics",phase=~"Failed|Pending"} > 0

KubePersistentVolumeFillingUp

Severity

CRITICAL

Message

PersistentVolume is filling up.

Query
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeFillingUp

Severity

WARNING

Message

PersistentVolume is filling up.

Query
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and kubelet_volume_stats_used_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

Severity

CRITICAL

Message

PersistentVolumeInodes are filling up.

Query
(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_inodes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.03 and kubelet_volume_stats_inodes_used{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePersistentVolumeInodesFillingUp

Severity

WARNING

Message

PersistentVolumeInodes are filling up.

Query
(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_inodes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and kubelet_volume_stats_inodes_used{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 and predict_linear(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on (cluster, namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1

KubePodCrashLooping

Severity

WARNING

Message

Pod is crash looping.

Query
max_over_time(kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*",reason="CrashLoopBackOff"}[5m]) >= 1

KubePodNotReady

Severity

WARNING

Message

Pod has been in a non-ready state for more than 15 minutes.

Query
sum by (namespace, pod, job, cluster) (max by (namespace, pod, job, cluster) (kube_pod_status_phase{job="kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on (namespace, pod, cluster) group_left (owner_kind) topk by (namespace, pod, cluster) (1, max by (namespace, pod, owner_kind, cluster) (kube_pod_owner{owner_kind!="Job"}))) > 0

KubeProxyDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="kube-proxy"})

KubeQuotaAlmostFull

Severity

INFO

Message

Namespace quota is going to be full.

Query
max without (instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="used"}) / on (cluster, namespace, resource, resourcequota) group_left () (max without (instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"}) > 0) > 0.9 < 1

KubeQuotaExceeded

Severity

WARNING

Message

Namespace quota has exceeded the limits.

Query
max without (instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="used"}) / on (cluster, namespace, resource, resourcequota) group_left () (max without (instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"}) > 0) > 1

KubeQuotaFullyUsed

Severity

INFO

Message

Namespace quota is fully used.

Query
max without (instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="used"}) / on (cluster, namespace, resource, resourcequota) group_left () (max without (instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"}) > 0) == 1

KubeSchedulerDown

Severity

CRITICAL

Message

Target disappeared from Prometheus target discovery.

Query
absent(up{job="kube-scheduler"})

KubeStatefulSetGenerationMismatch

Severity

WARNING

Message

StatefulSet generation mismatch due to possible roll-back

Query
kube_statefulset_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_metadata_generation{job="kube-state-metrics",namespace=~".*"}

KubeStatefulSetReplicasMismatch

Severity

WARNING

Message

StatefulSet has not matched the expected number of replicas.

Query
(kube_statefulset_status_replicas_ready{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_replicas{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[10m]) == 0)

KubeStatefulSetUpdateNotRolledOut

Severity

WARNING

Message

StatefulSet update has not been rolled out.

Query
(max by (namespace, statefulset, job, cluster) (kube_statefulset_status_current_revision{job="kube-state-metrics",namespace=~".*"} unless kube_statefulset_status_update_revision{job="kube-state-metrics",namespace=~".*"}) * on (namespace, statefulset, job, cluster) (kube_statefulset_replicas{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"})) and on (namespace, statefulset, job, cluster) (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0)

KubeStateMetricsListErrors

Severity

CRITICAL

Message

kube-state-metrics is experiencing errors in list operations.

Query
(sum by (cluster) (rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum by (cluster) (rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01

KubeStateMetricsShardingMismatch

Severity

CRITICAL

Message

kube-state-metrics sharding is misconfigured.

Query
stdvar by (cluster) (kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0

KubeStateMetricsShardsMissing

Severity

CRITICAL

Message

kube-state-metrics shards are missing.

Query
2 ^ max by (cluster) (kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1 - sum by (cluster) (2 ^ max by (cluster, shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"})) != 0

KubeStateMetricsWatchErrors

Severity

CRITICAL

Message

kube-state-metrics is experiencing errors in watch operations.

Query
(sum by (cluster) (rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum by (cluster) (rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01

KubeVersionMismatch

Severity

WARNING

Message

Different semantic versions of Kubernetes components running.

Query
count by (cluster) (count by (git_version, cluster) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"}, "git_version", "$1", "git_version", "(v[0-9]*.[0-9]*).*"))) > 1

NodeBondingDegraded

Severity

WARNING

Message

Bonding interface is degraded.

Query
(node_bonding_slaves{job="node-exporter"} - node_bonding_active{job="node-exporter"}) != 0

NodeClockNotSynchronising

Severity

WARNING

Message

Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.

Query
min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16

NodeClockSkewDetected

Severity

WARNING

Message

Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host.

Query
(node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)

NodeCPUHighUsage

Severity

INFO

Message

High CPU usage.

Query
sum without (mode) (avg without (cpu) (rate(node_cpu_seconds_total{job="node-exporter",mode!~"idle|iowait"}[2m]))) * 100 > 90

NodeDiskIOSaturation

Severity

WARNING

Message

Disk IO queue is high.

Query
rate(node_disk_io_time_weighted_seconds_total{device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)",job="node-exporter"}[5m]) > 10

NodeFileDescriptorLimit

Severity

CRITICAL

Message

Kernel is predicted to exhaust file descriptors limit soon.

Query
(node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90)

NodeFileDescriptorLimit

Severity

WARNING

Message

Kernel is predicted to exhaust file descriptors limit soon.

Query
(node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70)

NodeFilesystemAlmostOutOfFiles

Severity

CRITICAL

Message

Filesystem has less than 8% inodes left.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 8 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemAlmostOutOfFiles

Severity

WARNING

Message

Filesystem has less than 15% inodes left.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 15 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemAlmostOutOfSpace

Severity

CRITICAL

Message

Filesystem has less than 12% space left.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 12 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemAlmostOutOfSpace

Severity

WARNING

Message

Filesystem has less than 20% space left.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 20 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemFilesFillingUp

Severity

CRITICAL

Message

Filesystem is predicted to run out of inodes within the next 4 hours.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemFilesFillingUp

Severity

WARNING

Message

Filesystem is predicted to run out of inodes within the next 24 hours.

Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemSpaceFillingUp

Severity

CRITICAL

Message

Filesystem is predicted to run out of space within the next 4 hours.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeFilesystemSpaceFillingUp

Severity

WARNING

Message

Filesystem is predicted to run out of space within the next 24 hours.

Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)

NodeHighNumberConntrackEntriesUsed

Severity

WARNING

Message

Number of conntrack are getting close to the limit

Query
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75

NodeMemoryHighUtilization

Severity

WARNING

Message

Host is running out of memory.

Query
100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"} * 100) > 90

NodeMemoryMajorPagesFaults

Severity

WARNING

Message

Memory major page faults are occurring at very high rate.

Query
rate(node_vmstat_pgmajfault{job="node-exporter"}[5m]) > 500

NodeNetworkInterfaceFlapping

Severity

WARNING

Message

Network interface is often changing its status

Query
changes(node_network_up{device!~"veth.+",job="node-exporter"}[2m]) > 2

NodeNetworkReceiveErrs

Severity

WARNING

Message

Network interface is reporting many receive errors.

Query
increase(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01

NodeNetworkTransmitErrs

Severity

WARNING

Message

Network interface is reporting many transmit errors.

Query
increase(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01

NodeRAIDDegraded

Severity

CRITICAL

Message

RAID Array is degraded

Query
node_md_disks_required - ignoring (state) (node_md_disks{state="active"}) >= 1

NodeRAIDDiskFailure

Severity

WARNING

Message

Failed device in RAID array

Query
node_md_disks{state="failed"} >= 1

NodeSystemdServiceCrashlooping

Severity

WARNING

Message

Systemd service keeps restaring, possibly crash looping.

Query
increase(node_systemd_service_restart_total{job="node-exporter"}[5m]) > 2

NodeSystemdServiceFailed

Severity

WARNING

Message

Systemd service has entered failed state.

Query
node_systemd_unit_state{job="node-exporter",state="failed"} == 1

NodeSystemSaturation

Severity

WARNING

Message

System saturated, load per core is very high.

Query
node_load1{job="node-exporter"} / count without (cpu, mode) (node_cpu_seconds_total{job="node-exporter",mode="idle"}) > 2

NodeTextFileCollectorScrapeError

Severity

WARNING

Message

Node Exporter text file collector failed to scrape.

Query
node_textfile_scrape_error{job="node-exporter"} == 1

PrometheusBadConfig

Severity

CRITICAL

Message

Failed Prometheus configuration reload.

Query
max_over_time(prometheus_config_last_reload_successful{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) == 0

PrometheusDuplicateTimestamps

Severity

WARNING

Message

Prometheus is dropping samples with duplicate timestamps.

Query
rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusErrorSendingAlertsToAnyAlertmanager

Severity

CRITICAL

Message

Prometheus encounters more than 3% errors sending alerts to any Alertmanager.

Query
min without (alertmanager) (rate(prometheus_notifications_errors_total{alertmanager!~"",job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) / rate(prometheus_notifications_sent_total{alertmanager!~"",job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) * 100 > 3

PrometheusErrorSendingAlertsToSomeAlertmanagers

Severity

WARNING

Message

More than 1% of alerts sent by Prometheus to a specific Alertmanager were affected by errors.

Query
(rate(prometheus_notifications_errors_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) * 100 > 1

PrometheusHighQueryLoad

Severity

WARNING

Message

Prometheus is reaching its maximum capacity serving concurrent requests.

Query
avg_over_time(prometheus_engine_queries{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) / max_over_time(prometheus_engine_queries_concurrent_max{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0.8

PrometheusKubernetesListWatchFailures

Severity

WARNING

Message

Requests in Kubernetes SD are failing.

Query
increase(prometheus_sd_kubernetes_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusLabelLimitHit

Severity

WARNING

Message

Prometheus has dropped targets because some scrape configs have exceeded the labels limit.

Query
increase(prometheus_target_scrape_pool_exceeded_label_limits_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusMissingRuleEvaluations

Severity

WARNING

Message

Prometheus is missing rule evaluations due to slow rule group evaluation.

Query
increase(prometheus_rule_group_iterations_missed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusNotConnectedToAlertmanagers

Severity

WARNING

Message

Prometheus is not connected to any Alertmanagers.

Query
max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) < 1

PrometheusNotificationQueueRunningFull

Severity

WARNING

Message

Prometheus alert notification queue predicted to run full in less than 30m.

Query
(predict_linear(prometheus_notifications_queue_length{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]))

PrometheusNotIngestingSamples

Severity

WARNING

Message

Prometheus is not ingesting samples.

Query
(sum without (type) (rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) <= 0 and (sum without (scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}) > 0 or sum without (rule_group) (prometheus_rule_group_rules{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}) > 0))

PrometheusOperatorListErrors

Severity

WARNING

Message

Errors while performing list operations in controller.

Query
(sum by (cluster, controller, namespace) (rate(prometheus_operator_list_operations_failed_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[10m])) / sum by (cluster, controller, namespace) (rate(prometheus_operator_list_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[10m]))) > 0.4

PrometheusOperatorNodeLookupErrors

Severity

WARNING

Message

Errors while reconciling Prometheus.

Query
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]) > 0.1

PrometheusOperatorNotReady

Severity

WARNING

Message

Prometheus operator not ready

Query
min by (cluster, controller, namespace) (max_over_time(prometheus_operator_ready{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]) == 0)

PrometheusOperatorReconcileErrors

Severity

WARNING

Message

Errors while reconciling objects.

Query
(sum by (cluster, controller, namespace) (rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) / (sum by (cluster, controller, namespace) (rate(prometheus_operator_reconcile_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) > 0.1

PrometheusOperatorRejectedResources

Severity

WARNING

Message

Resources rejected by Prometheus operator

Query
min_over_time(prometheus_operator_managed_resources{job="prometheus-operator-operator",namespace="metalk8s-monitoring",state="rejected"}[5m]) > 0

PrometheusOperatorStatusUpdateErrors

Severity

WARNING

Message

Errors while updating objects status.

Query
(sum by (cluster, controller, namespace) (rate(prometheus_operator_status_update_errors_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) / (sum by (cluster, controller, namespace) (rate(prometheus_operator_status_update_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) > 0.1

PrometheusOperatorSyncFailed

Severity

WARNING

Message

Last controller reconciliation failed

Query
min_over_time(prometheus_operator_syncs{job="prometheus-operator-operator",namespace="metalk8s-monitoring",status="failed"}[5m]) > 0

PrometheusOperatorWatchErrors

Severity

WARNING

Message

Errors while performing watch operations in controller.

Query
(sum by (cluster, controller, namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m])) / sum by (cluster, controller, namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) > 0.4

PrometheusOutOfOrderTimestamps

Severity

WARNING

Message

Prometheus drops samples with out-of-order timestamps.

Query
rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusRemoteStorageFailures

Severity

CRITICAL

Message

Prometheus fails to send samples to remote storage.

Query
((rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) / ((rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) + (rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])))) * 100 > 1

PrometheusRemoteWriteBehind

Severity

CRITICAL

Message

Prometheus remote write is behind.

Query
(max_over_time(prometheus_remote_storage_queue_highest_timestamp_seconds{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) - max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) > 120

PrometheusRemoteWriteDesiredShards

Severity

WARNING

Message

Prometheus remote write desired shards calculation wants to run more than configured max shards.

Query
(max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]))

PrometheusRuleFailures

Severity

CRITICAL

Message

Prometheus is failing rule evaluations.

Query
increase(prometheus_rule_evaluation_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusScrapeBodySizeLimitHit

Severity

WARNING

Message

Prometheus has dropped some targets that exceeded body size limit.

Query
increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusScrapeSampleLimitHit

Severity

WARNING

Message

Prometheus has failed scrapes that have exceeded the configured sample limit.

Query
increase(prometheus_target_scrapes_exceeded_sample_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusSDRefreshFailure

Severity

WARNING

Message

Failed Prometheus SD refresh.

Query
increase(prometheus_sd_refresh_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[10m]) > 0

PrometheusTargetLimitHit

Severity

WARNING

Message

Prometheus has dropped targets because some scrape configs have exceeded the targets limit.

Query
increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0

PrometheusTargetSyncFailure

Severity

CRITICAL

Message

Prometheus has failed to sync targets.

Query
increase(prometheus_target_sync_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[30m]) > 0

PrometheusTSDBCompactionsFailing

Severity

WARNING

Message

Prometheus has issues compacting blocks.

Query
increase(prometheus_tsdb_compactions_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[3h]) > 0

PrometheusTSDBReloadsFailing

Severity

WARNING

Message

Prometheus has issues reloading blocks from disk.

Query
increase(prometheus_tsdb_reloads_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[3h]) > 0

TargetDown

Severity

WARNING

Message

One or more targets are unreachable.

Query
100 * (count by (cluster, job, namespace, service) (up == 0) / count by (cluster, job, namespace, service) (up)) > 10

Watchdog

Severity

NONE

Message

An alert that should always be firing to certify that Alertmanager is working properly.

Query
vector(1)

Excluding PersistentVolumeClaims from Storage Alerts

MetalK8s monitors PersistentVolumeClaim (PVC) storage usage and generates alerts when volumes are filling up. To exclude specific PVCs from these storage-related alerts (such as KubePersistentVolumeFillingUp and KubePersistentVolumeInodesFillingUp), add the excluded-from-alerts label with the value "true" to the PVC.

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <pvc-name>
  namespace: <namespace-name>
  labels:
    excluded-from-alerts: "true"
spec:
  # ... PVC specification ...

To add this label to an existing PVC:

root@bootstrap $ kubectl --kubeconfig=/etc/kubernetes/admin.conf \
                   label pvc <pvc-name> -n <namespace-name> \
                   excluded-from-alerts=true

Note

This feature is particularly useful for data disks that are expected to be nearly full or for volumes where storage alerts are managed by application-specific monitoring.