Alerts¶
MetalK8s monitoring stack can generate alerts from observed statistics and a set of rules. These alerts are then persisted for providing a historical view in MetalK8s UI, and can be routed to custom receivers if needed (see Alertmanager Configuration Customization for how to do this).
Predefined Alerting Rules¶
There are two categories of alerting rules deployed with MetalK8s: simple and composite.
Simple alerting rules define an expression only from “standard” Prometheus
metrics, while composite rules rely on the special ALERTS
metric to
generate alerts from the state of other alerts.
Composite rules are used to build a hierarchy of alerts, encoding the relationship between low-level components and their simple alerting rules into higher level services.
Hierarchy¶
The alerts have a severity level attached, either WARNING or CRITICAL. We can represent the full hierarchy by describing the parent-child relationships of each composite alert, building a tree from one of the root-level alerts:
ClusterDegraded (WARNING)
ClusterAtRisk (CRITICAL)
ClusterDegraded¶
ClusterDegraded (WARNING)
NetworkDegraded (WARNING)
NodeHighNumberConntrackEntriesUsed (WARNING)
NodeNetworkInterfaceFlapping (WARNING)
NodeNetworkReceiveErrs (WARNING)
NodeNetworkTransmitErrs (WARNING)
NodeDegraded (WARNING)
KubeNodeNotReady (WARNING)
KubeNodeReadinessFlapping (WARNING)
KubeNodeUnreachable (WARNING)
KubeletClientCertificateExpiration (WARNING)
KubeletClientCertificateRenewalErrors (WARNING)
KubeletPlegDurationHigh (WARNING)
KubeletPodStartUpLatencyHigh (WARNING)
KubeletServerCertificateExpiration (WARNING)
KubeletServerCertificateRenewalErrors (WARNING)
NodeClockNotSynchronising (WARNING)
NodeClockSkewDetected (WARNING)
NodeRAIDDiskFailure (WARNING)
NodeTextFileCollectorScrapeError (WARNING)
SystemPartitionDegraded (WARNING)
NodeFileDescriptorLimit (WARNING)
NodeFilesystemAlmostOutOfFiles (WARNING)
NodeFilesystemAlmostOutOfSpace (WARNING)
NodeFilesystemFilesFillingUp (WARNING)
NodeFilesystemSpaceFillingUp (WARNING)
PlatformServicesDegraded (WARNING)
AccessServicesDegraded (WARNING)
AuthenticationServiceDegraded (WARNING)
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
metalk8s-auth
, Deployment:dex
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
metalk8s-auth
, Deployment:dex
IngressControllerServicesDegraded (WARNING)
KubeDaemonSetMisScheduled (WARNING)
Namespace:
metalk8s-ingress
, Daemonset:ingress-nginx-control-plane-controller
KubeDaemonSetMisScheduled (WARNING)
Namespace:
metalk8s-ingress
, Daemonset:ingress-nginx-controller
KubeDaemonSetNotScheduled (WARNING)
Namespace:
metalk8s-ingress
, Daemonset:ingress-nginx-control-plane-controller
KubeDaemonSetNotScheduled (WARNING)
Namespace:
metalk8s-ingress
, Daemonset:ingress-nginx-controller
KubeDaemonSetRolloutStuck (WARNING)
Namespace:
metalk8s-ingress
, Daemonset:ingress-nginx-control-plane-controller
KubeDaemonSetRolloutStuck (WARNING)
Namespace:
metalk8s-ingress
, Daemonset:ingress-nginx-controller
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
metalk8s-ingress
, Deployment:ingress-nginx-defaultbackend
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
metalk8s-ingress
, Deployment:ingress-nginx-defaultbackend
CoreServicesDegraded (WARNING)
BootstrapServicesDegraded (WARNING)
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
metalk8s-ui
, Deployment:metalk8s-ui
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
kube-system
, Deployment:storage-operator
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
metalk8s-ui
, Deployment:metalk8s-ui
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
kube-system
, Deployment:storage-operator
KubePodCrashLooping (WARNING)
Namespace:
kube-system
, Pod:repositories-.*
KubePodCrashLooping (WARNING)
Namespace:
kube-system
, Pod:salt-master-.*
KubePodNotReady (WARNING)
Namespace:
kube-system
, Pod:repositories-.*
KubePodNotReady (WARNING)
Namespace:
kube-system
, Pod:salt-master-.*
KubernetesControlPlaneDegraded (WARNING)
KubeAPIErrorBudgetBurn (WARNING)
KubeAPITerminatedRequests (WARNING)
KubeCPUOvercommit (WARNING)
KubeCPUQuotaOvercommit (WARNING)
KubeClientCertificateExpiration (WARNING)
KubeClientErrors (WARNING)
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
kube-system
, Deployment:coredns
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-adapter
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-operator-kube-state-metrics
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
kube-system
, Deployment:coredns
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-adapter
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-operator-kube-state-metrics
KubeMemoryOvercommit (WARNING)
KubeMemoryQuotaOvercommit (WARNING)
KubeVersionMismatch (WARNING)
etcdHTTPRequestsSlow (WARNING)
etcdHighCommitDurations (WARNING)
etcdHighFsyncDurations (WARNING)
etcdHighNumberOfFailedGRPCRequests (WARNING)
etcdHighNumberOfFailedHTTPRequests (WARNING)
etcdHighNumberOfFailedProposals (WARNING)
etcdHighNumberOfLeaderChanges (WARNING)
etcdMemberCommunicationSlow (WARNING)
ObservabilityServicesDegraded (WARNING)
AlertingServiceDegraded (WARNING)
AlertmanagerClusterFailedToSendAlerts (WARNING)
AlertmanagerFailedToSendAlerts (WARNING)
KubeStatefulSetGenerationMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Statefulset:alertmanager-prometheus-operator-alertmanager
KubeStatefulSetReplicasMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Statefulset:alertmanager-prometheus-operator-alertmanager
KubeStatefulSetUpdateNotRolledOut (WARNING)
Namespace:
metalk8s-monitoring
, Statefulset:alertmanager-prometheus-operator-alertmanager
DashboardingServiceDegraded (WARNING)
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-operator-grafana
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-operator-grafana
LoggingServiceDegraded (WARNING)
KubeDaemonSetMisScheduled (WARNING)
Namespace:
metalk8s-logging
, Daemonset:fluentbit
KubeDaemonSetNotScheduled (WARNING)
Namespace:
metalk8s-logging
, Daemonset:fluentbit
KubeDaemonSetRolloutStuck (WARNING)
Namespace:
metalk8s-logging
, Daemonset:fluentbit
KubeStatefulSetGenerationMismatch (WARNING)
Namespace:
metalk8s-logging
, Statefulset:loki
KubeStatefulSetReplicasMismatch (WARNING)
Namespace:
metalk8s-logging
, Statefulset:loki
KubeStatefulSetUpdateNotRolledOut (WARNING)
Namespace:
metalk8s-logging
, Statefulset:loki
MonitoringServiceDegraded (WARNING)
KubeDaemonSetMisScheduled (WARNING)
Namespace:
metalk8s-monitoring
, Daemonset:prometheus-operator-prometheus-node-exporter
KubeDaemonSetNotScheduled (WARNING)
Namespace:
metalk8s-monitoring
, Daemonset:prometheus-operator-prometheus-node-exporter
KubeDaemonSetRolloutStuck (WARNING)
Namespace:
metalk8s-monitoring
, Daemonset:prometheus-operator-prometheus-node-exporter
KubeDeploymentGenerationMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-operator-operator
KubeDeploymentReplicasMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Deployment:prometheus-operator-operator
KubeStatefulSetGenerationMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Statefulset:prometheus-prometheus-operator-prometheus
KubeStatefulSetReplicasMismatch (WARNING)
Namespace:
metalk8s-monitoring
, Statefulset:prometheus-prometheus-operator-prometheus
KubeStatefulSetUpdateNotRolledOut (WARNING)
Namespace:
metalk8s-monitoring
, Statefulset:prometheus-prometheus-operator-prometheus
PrometheusDuplicateTimestamps (WARNING)
PrometheusLabelLimitHit (WARNING)
PrometheusMissingRuleEvaluations (WARNING)
PrometheusNotConnectedToAlertmanagers (WARNING)
PrometheusNotIngestingSamples (WARNING)
PrometheusNotificationQueueRunningFull (WARNING)
PrometheusOperatorListErrors (WARNING)
PrometheusOperatorNodeLookupErrors (WARNING)
PrometheusOperatorNotReady (WARNING)
PrometheusOperatorReconcileErrors (WARNING)
PrometheusOperatorRejectedResources (WARNING)
PrometheusOperatorSyncFailed (WARNING)
PrometheusOperatorWatchErrors (WARNING)
PrometheusOutOfOrderTimestamps (WARNING)
PrometheusRemoteWriteDesiredShards (WARNING)
PrometheusTSDBCompactionsFailing (WARNING)
PrometheusTSDBReloadsFailing (WARNING)
PrometheusTargetLimitHit (WARNING)
VolumeDegraded (WARNING)
KubePersistentVolumeFillingUp (WARNING)
ClusterAtRisk¶
ClusterAtRisk (CRITICAL)
NodeAtRisk (CRITICAL)
KubeletClientCertificateExpiration (CRITICAL)
NodeRAIDDegraded (CRITICAL)
SystemPartitionAtRisk (CRITICAL)
NodeFileDescriptorLimit (CRITICAL)
NodeFilesystemAlmostOutOfFiles (CRITICAL)
NodeFilesystemAlmostOutOfSpace (CRITICAL)
NodeFilesystemFilesFillingUp (CRITICAL)
NodeFilesystemSpaceFillingUp (CRITICAL)
PlatformServicesAtRisk (CRITICAL)
CoreServicesAtRisk (CRITICAL)
KubernetesControlPlaneAtRisk (CRITICAL)
KubeAPIDown (CRITICAL)
KubeAPIErrorBudgetBurn (CRITICAL)
KubeClientCertificateExpiration (CRITICAL)
KubeControllerManagerDown (CRITICAL)
KubeSchedulerDown (CRITICAL)
KubeStateMetricsListErrors (CRITICAL)
KubeStateMetricsWatchErrors (CRITICAL)
KubeletDown (CRITICAL)
etcdGRPCRequestsSlow (CRITICAL)
etcdHighNumberOfFailedGRPCRequests (CRITICAL)
etcdHighNumberOfFailedHTTPRequests (CRITICAL)
etcdInsufficientMembers (CRITICAL)
etcdNoLeader (CRITICAL)
ObservabilityServicesAtRisk (CRITICAL)
AlertingServiceAtRisk (CRITICAL)
AlertmanagerClusterCrashlooping (CRITICAL)
AlertmanagerClusterDown (CRITICAL)
AlertmanagerClusterFailedToSendAlerts (CRITICAL)
AlertmanagerConfigInconsistent (CRITICAL)
AlertmanagerFailedReload (CRITICAL)
AlertmanagerMembersInconsistent (CRITICAL)
MonitoringServiceAtRisk (CRITICAL)
KubeStateMetricsShardingMismatch (CRITICAL)
KubeStateMetricsShardsMissing (CRITICAL)
PrometheusBadConfig (CRITICAL)
PrometheusRemoteStorageFailures (CRITICAL)
PrometheusRemoteWriteBehind (CRITICAL)
PrometheusRuleFailures (CRITICAL)
PrometheusTargetSyncFailure (CRITICAL)
VolumeAtRisk (CRITICAL)
KubePersistentVolumeErrors (CRITICAL)
KubePersistentVolumeFillingUp (CRITICAL)
Composite Rules¶
AccessServicesDegraded¶
- Severity
WARNING
- Message
The Access services are degraded.
- Relationship
ANY
- Children
AlertingServiceAtRisk¶
- Severity
CRITICAL
- Message
The alerting service is at risk.
- Relationship
ANY
- Children
AlertingServiceDegraded¶
- Severity
WARNING
- Message
The alerting service is degraded.
- Relationship
ANY
- Children
AuthenticationServiceDegraded¶
- Severity
WARNING
- Message
The Authentication service for K8S API is degraded.
- Relationship
ANY
- Children
BootstrapServicesDegraded¶
- Severity
WARNING
- Message
The MetalK8s Bootstrap services are degraded.
- Relationship
ANY
- Children
KubePodNotReady{namespace=~’kube-system’, pod=~’repositories-.*’, severity=’warning’}
KubePodCrashLooping{namespace=~’kube-system’, pod=~’repositories-.*’, severity=’warning’}
KubePodNotReady{namespace=~’kube-system’, pod=~’salt-master-.*’, severity=’warning’}
KubePodCrashLooping{namespace=~’kube-system’, pod=~’salt-master-.*’, severity=’warning’}
ClusterAtRisk¶
- Severity
CRITICAL
- Message
The cluster is at risk.
- Relationship
ANY
- Children
ClusterDegraded¶
- Severity
WARNING
- Message
The cluster is degraded.
- Relationship
ANY
- Children
CoreServicesAtRisk¶
- Severity
CRITICAL
- Message
The Core services are at risk.
- Relationship
ANY
- Children
CoreServicesDegraded¶
- Severity
WARNING
- Message
The Core services are degraded.
- Relationship
ANY
- Children
DashboardingServiceDegraded¶
- Severity
WARNING
- Message
The dashboarding service is degraded.
- Relationship
ANY
- Children
IngressControllerServicesDegraded¶
- Severity
WARNING
- Message
The Ingress Controllers for control plane and workload plane are degraded.
- Relationship
ANY
- Children
KubernetesControlPlaneAtRisk¶
- Severity
CRITICAL
- Message
The Kubernetes control plane is at risk.
- Relationship
ANY
- Children
KubernetesControlPlaneDegraded¶
- Severity
WARNING
- Message
The Kubernetes control plane is degraded.
- Relationship
ANY
- Children
LoggingServiceDegraded¶
- Severity
WARNING
- Message
The logging service is degraded.
- Relationship
ANY
- Children
KubeDaemonSetNotScheduled{daemonset=~’fluentbit’, namespace=~’metalk8s-logging’, severity=’warning’}
KubeDaemonSetMisScheduled{daemonset=~’fluentbit’, namespace=~’metalk8s-logging’, severity=’warning’}
KubeDaemonSetRolloutStuck{daemonset=~’fluentbit’, namespace=~’metalk8s-logging’, severity=’warning’}
MonitoringServiceAtRisk¶
- Severity
CRITICAL
- Message
The monitoring service is at risk.
- Relationship
ANY
- Children
MonitoringServiceDegraded¶
- Severity
WARNING
- Message
The monitoring service is degraded.
- Relationship
ANY
- Children
NetworkDegraded¶
- Severity
WARNING
- Message
The network is degraded.
- Relationship
ANY
- Children
NodeAtRisk¶
- Severity
CRITICAL
- Message
The node {{ $labels.instance }} is at risk.
- Relationship
ANY
- Children
NodeDegraded¶
- Severity
WARNING
- Message
The node {{ $labels.instance }} is degraded.
- Relationship
ANY
- Children
ObservabilityServicesAtRisk¶
- Severity
CRITICAL
- Message
The observability services are at risk.
- Relationship
ANY
- Children
ObservabilityServicesDegraded¶
- Severity
WARNING
- Message
The observability services are degraded.
- Relationship
ANY
- Children
PlatformServicesAtRisk¶
- Severity
CRITICAL
- Message
The Platform services are at risk.
- Relationship
ANY
- Children
PlatformServicesDegraded¶
- Severity
WARNING
- Message
The Platform services are degraded.
- Relationship
ANY
- Children
SystemPartitionAtRisk¶
- Severity
CRITICAL
- Message
The system partition {{ $labels.mountpoint }} on node {{ $labels.instance }} is at risk.
- Relationship
ANY
- Children
SystemPartitionDegraded¶
- Severity
WARNING
- Message
The system partition {{ $labels.mountpoint }} on node {{ $labels.instance }} is degraded.
- Relationship
ANY
- Children
VolumeAtRisk¶
- Severity
CRITICAL
- Message
The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is at risk.
- Relationship
ANY
- Children
VolumeDegraded¶
- Severity
WARNING
- Message
The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is degraded.
- Relationship
ANY
- Children
Simple Rules¶
AlertmanagerClusterCrashlooping¶
- Severity
CRITICAL
- Message
Half or more of the Alertmanager instances within the same cluster are crashlooping.
- Query
(count by(namespace, service) (changes(process_start_time_seconds{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[10m]) > 4) / count by(namespace, service) (up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) >= 0.5
AlertmanagerClusterDown¶
- Severity
CRITICAL
- Message
Half or more of the Alertmanager instances within the same cluster are down.
- Query
(count by(namespace, service) (avg_over_time(up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) < 0.5) / count by(namespace, service) (up{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) >= 0.5
AlertmanagerClusterFailedToSendAlerts¶
- Severity
CRITICAL
- Message
All Alertmanager instances in a cluster failed to send notifications to a critical integration.
- Query
min by(namespace, service, integration) (rate(alertmanager_notifications_failed_total{integration=~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) / rate(alertmanager_notifications_total{integration=~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m])) > 0.01
AlertmanagerClusterFailedToSendAlerts¶
- Severity
WARNING
- Message
All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.
- Query
min by(namespace, service, integration) (rate(alertmanager_notifications_failed_total{integration!~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) / rate(alertmanager_notifications_total{integration!~".*",job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m])) > 0.01
AlertmanagerConfigInconsistent¶
- Severity
CRITICAL
- Message
Alertmanager instances within the same cluster have different configurations.
- Query
count by(namespace, service) (count_values by(namespace, service) ("config_hash", alertmanager_config_hash{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"})) != 1
AlertmanagerFailedReload¶
- Severity
CRITICAL
- Message
Reloading an Alertmanager configuration has failed.
- Query
max_over_time(alertmanager_config_last_reload_successful{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) == 0
AlertmanagerFailedToSendAlerts¶
- Severity
WARNING
- Message
An Alertmanager instance failed to send notifications.
- Query
(rate(alertmanager_notifications_failed_total{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) / rate(alertmanager_notifications_total{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m])) > 0.01
AlertmanagerMembersInconsistent¶
- Severity
CRITICAL
- Message
A member of an Alertmanager cluster has not found all other cluster members.
- Query
max_over_time(alertmanager_cluster_members{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]) < on(namespace, service) group_left() count by(namespace, service) (max_over_time(alertmanager_cluster_members{job="prometheus-operator-alertmanager",namespace="metalk8s-monitoring"}[5m]))
ConfigReloaderSidecarErrors¶
- Severity
WARNING
- Message
config-reloader sidecar has not had a successful reload for 10m
- Query
max_over_time(reloader_last_reload_successful{namespace=~".+"}[5m]) == 0
CPUThrottlingHigh¶
- Severity
INFO
- Message
Processes experience elevated CPU throttling.
- Query
sum by(container, pod, namespace) (increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) / sum by(container, pod, namespace) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100)
etcdGRPCRequestsSlow¶
- Severity
CRITICAL
- Message
etcd cluster “{{ $labels.job }}”: gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}.
- Query
histogram_quantile(0.99, sum by(job, instance, grpc_service, grpc_method, le) (rate(grpc_server_handling_seconds_bucket{grpc_type="unary",job=~".*etcd.*"}[5m]))) > 0.15
etcdHighCommitDurations¶
- Severity
WARNING
- Message
etcd cluster “{{ $labels.job }}”: 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.
- Query
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.25
etcdHighFsyncDurations¶
- Severity
WARNING
- Message
etcd cluster “{{ $labels.job }}”: 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.
- Query
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.5
etcdHighNumberOfFailedGRPCRequests¶
- Severity
CRITICAL
- Message
etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.
- Query
100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5
etcdHighNumberOfFailedGRPCRequests¶
- Severity
WARNING
- Message
etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.
- Query
100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 1
etcdHighNumberOfFailedHTTPRequests¶
- Severity
CRITICAL
- Message
{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}.
- Query
sum by(method) (rate(etcd_http_failed_total{code!="404",job=~".*etcd.*"}[5m])) / sum by(method) (rate(etcd_http_received_total{job=~".*etcd.*"}[5m])) > 0.05
etcdHighNumberOfFailedHTTPRequests¶
- Severity
WARNING
- Message
{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}
- Query
sum by(method) (rate(etcd_http_failed_total{code!="404",job=~".*etcd.*"}[5m])) / sum by(method) (rate(etcd_http_received_total{job=~".*etcd.*"}[5m])) > 0.01
etcdHighNumberOfFailedProposals¶
- Severity
WARNING
- Message
etcd cluster “{{ $labels.job }}”: {{ $value }} proposal failures within the last hour on etcd instance {{ $labels.instance }}.
- Query
rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5
etcdHighNumberOfLeaderChanges¶
- Severity
WARNING
- Message
etcd cluster “{{ $labels.job }}”: instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last hour.
- Query
rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}[15m]) > 3
etcdHTTPRequestsSlow¶
- Severity
WARNING
- Message
etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow.
- Query
histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15
etcdInsufficientMembers¶
- Severity
CRITICAL
- Message
etcd cluster “{{ $labels.job }}”: insufficient members ({{ $value }}).
- Query
sum by(job) (up{job=~".*etcd.*"} == bool 1) < ((count by(job) (up{job=~".*etcd.*"}) + 1) / 2)
etcdMemberCommunicationSlow¶
- Severity
WARNING
- Message
etcd cluster “{{ $labels.job }}”: member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}.
- Query
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.15
etcdNoLeader¶
- Severity
CRITICAL
- Message
etcd cluster “{{ $labels.job }}”: member {{ $labels.instance }} has no leader.
- Query
etcd_server_has_leader{job=~".*etcd.*"} == 0
KubeAggregatedAPIDown¶
- Severity
WARNING
- Message
Kubernetes aggregated API is down.
- Query
(1 - max by(name, namespace) (avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85
KubeAggregatedAPIErrors¶
- Severity
WARNING
- Message
Kubernetes aggregated API has reported errors.
- Query
sum by(name, namespace) (increase(aggregator_unavailable_apiservice_total[10m])) > 4
KubeAPIDown¶
- Severity
CRITICAL
- Message
Target disappeared from Prometheus target discovery.
- Query
absent(up{job="apiserver"} == 1)
KubeAPIErrorBudgetBurn¶
- Severity
CRITICAL
- Message
The API server is burning too much error budget.
- Query
sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01)
- Additional query
sum(apiserver_request:burnrate6h) > (6 * 0.01) and sum(apiserver_request:burnrate30m) > (6 * 0.01)
KubeAPIErrorBudgetBurn¶
- Severity
WARNING
- Message
The API server is burning too much error budget.
- Query
sum(apiserver_request:burnrate1d) > (3 * 0.01) and sum(apiserver_request:burnrate2h) > (3 * 0.01)
- Additional query
sum(apiserver_request:burnrate3d) > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01)
KubeAPITerminatedRequests¶
- Severity
WARNING
- Message
The kubernetes apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.
- Query
sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m])) / (sum(rate(apiserver_request_total{job="apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="apiserver"}[10m]))) > 0.2
KubeClientCertificateExpiration¶
- Severity
CRITICAL
- Message
Client certificate is about to expire.
- Query
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
KubeClientCertificateExpiration¶
- Severity
WARNING
- Message
Client certificate is about to expire.
- Query
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800
KubeClientErrors¶
- Severity
WARNING
- Message
Kubernetes API server client is experiencing errors.
- Query
(sum by(cluster, instance, job, namespace) (rate(rest_client_requests_total{code=~"5.."}[5m])) / sum by(cluster, instance, job, namespace) (rate(rest_client_requests_total[5m]))) > 0.01
KubeContainerWaiting¶
- Severity
WARNING
- Message
Pod container waiting longer than 1 hour
- Query
sum by(namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*"}) > 0
KubeControllerManagerDown¶
- Severity
CRITICAL
- Message
Target disappeared from Prometheus target discovery.
- Query
absent(up{job="kube-controller-manager"} == 1)
KubeCPUOvercommit¶
- Severity
WARNING
- Message
Cluster has overcommitted CPU resource requests.
- Query
sum(namespace_cpu:kube_pod_container_resource_requests:sum) - (sum(kube_node_status_allocatable{resource="cpu"}) - max(kube_node_status_allocatable{resource="cpu"})) > 0 and (sum(kube_node_status_allocatable{resource="cpu"}) - max(kube_node_status_allocatable{resource="cpu"})) > 0
KubeCPUQuotaOvercommit¶
- Severity
WARNING
- Message
Cluster has overcommitted CPU resource requests.
- Query
sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",resource=~"(cpu|requests.cpu)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) > 1.5
KubeDaemonSetMisScheduled¶
- Severity
WARNING
- Message
DaemonSet pods are misscheduled.
- Query
kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} > 0
KubeDaemonSetNotScheduled¶
- Severity
WARNING
- Message
DaemonSet pods are not scheduled.
- Query
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} > 0
KubeDaemonSetRolloutStuck¶
- Severity
WARNING
- Message
DaemonSet rollout is stuck.
- Query
((kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} != 0) or (kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_available{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"}[5m]) == 0)
KubeDeploymentGenerationMismatch¶
- Severity
WARNING
- Message
Deployment generation mismatch due to possible roll-back
- Query
kube_deployment_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_deployment_metadata_generation{job="kube-state-metrics",namespace=~".*"}
KubeDeploymentReplicasMismatch¶
- Severity
WARNING
- Message
Deployment has not matched the expected number of replicas.
- Query
(kube_deployment_spec_replicas{job="kube-state-metrics",namespace=~".*"} > kube_deployment_status_replicas_available{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[10m]) == 0)
KubeHpaMaxedOut¶
- Severity
WARNING
- Message
HPA is running at max replicas
- Query
kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} == kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}
KubeHpaReplicasMismatch¶
- Severity
WARNING
- Message
HPA has not matched descired number of replicas.
- Query
(kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics",namespace=~".*"} != kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} > kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"} < kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}) and changes(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics",namespace=~".*"}[15m]) == 0
KubeJobFailed¶
- Severity
WARNING
- Message
Job failed to complete.
- Query
kube_job_failed{job="kube-state-metrics",namespace=~".*"} > 0
KubeJobNotCompleted¶
- Severity
WARNING
- Message
Job did not complete in time
- Query
time() - max by(namespace, job_name) (kube_job_status_start_time{job="kube-state-metrics",namespace=~".*"} and kube_job_status_active{job="kube-state-metrics",namespace=~".*"} > 0) > 43200
KubeletClientCertificateExpiration¶
- Severity
CRITICAL
- Message
Kubelet client certificate is about to expire.
- Query
kubelet_certificate_manager_client_ttl_seconds < 86400
KubeletClientCertificateExpiration¶
- Severity
WARNING
- Message
Kubelet client certificate is about to expire.
- Query
kubelet_certificate_manager_client_ttl_seconds < 604800
KubeletClientCertificateRenewalErrors¶
- Severity
WARNING
- Message
Kubelet has failed to renew its client certificate.
- Query
increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0
KubeletDown¶
- Severity
CRITICAL
- Message
Target disappeared from Prometheus target discovery.
- Query
absent(up{job="kubelet",metrics_path="/metrics"} == 1)
KubeletPlegDurationHigh¶
- Severity
WARNING
- Message
Kubelet Pod Lifecycle Event Generator is taking too long to relist.
- Query
node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10
KubeletPodStartUpLatencyHigh¶
- Severity
WARNING
- Message
Kubelet Pod startup latency is too high.
- Query
histogram_quantile(0.99, sum by(cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",metrics_path="/metrics"}[5m]))) * on(cluster, instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"} > 60
KubeletServerCertificateExpiration¶
- Severity
CRITICAL
- Message
Kubelet server certificate is about to expire.
- Query
kubelet_certificate_manager_server_ttl_seconds < 86400
KubeletServerCertificateExpiration¶
- Severity
WARNING
- Message
Kubelet server certificate is about to expire.
- Query
kubelet_certificate_manager_server_ttl_seconds < 604800
KubeletServerCertificateRenewalErrors¶
- Severity
WARNING
- Message
Kubelet has failed to renew its server certificate.
- Query
increase(kubelet_server_expiration_renew_errors[5m]) > 0
KubeletTooManyPods¶
- Severity
INFO
- Message
Kubelet is running at capacity.
- Query
count by(cluster, node) ((kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on(instance, pod, namespace, cluster) group_left(node) topk by(instance, pod, namespace, cluster) (1, kube_pod_info{job="kube-state-metrics"})) / max by(cluster, node) (kube_node_status_capacity{job="kube-state-metrics",resource="pods"} != 1) > 0.95
KubeMemoryOvercommit¶
- Severity
WARNING
- Message
Cluster has overcommitted memory resource requests.
- Query
sum(namespace_memory:kube_pod_container_resource_requests:sum) - (sum(kube_node_status_allocatable{resource="memory"}) - max(kube_node_status_allocatable{resource="memory"})) > 0 and (sum(kube_node_status_allocatable{resource="memory"}) - max(kube_node_status_allocatable{resource="memory"})) > 0
KubeMemoryQuotaOvercommit¶
- Severity
WARNING
- Message
Cluster has overcommitted memory resource requests.
- Query
sum(min without(resource) (kube_resourcequota{job="kube-state-metrics",resource=~"(memory|requests.memory)",type="hard"})) / sum(kube_node_status_allocatable{job="kube-state-metrics",resource="memory"}) > 1.5
KubeNodeNotReady¶
- Severity
WARNING
- Message
Node is not ready.
- Query
kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0
KubeNodeReadinessFlapping¶
- Severity
WARNING
- Message
Node readiness status is flapping.
- Query
sum by(cluster, node) (changes(kube_node_status_condition{condition="Ready",status="true"}[15m])) > 2
KubeNodeUnreachable¶
- Severity
WARNING
- Message
Node is unreachable.
- Query
(kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless ignoring(key, value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1
KubePersistentVolumeErrors¶
- Severity
CRITICAL
- Message
PersistentVolume is having issues with provisioning.
- Query
kube_persistentvolume_status_phase{job="kube-state-metrics",phase=~"Failed|Pending"} > 0
KubePersistentVolumeFillingUp¶
- Severity
CRITICAL
- Message
PersistentVolume is filling up.
- Query
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
KubePersistentVolumeFillingUp¶
- Severity
WARNING
- Message
PersistentVolume is filling up.
- Query
(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and kubelet_volume_stats_used_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
KubePersistentVolumeInodesFillingUp¶
- Severity
CRITICAL
- Message
PersistentVolumeInodes are filling up.
- Query
(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_inodes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.03 and kubelet_volume_stats_inodes_used{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
KubePersistentVolumeInodesFillingUp¶
- Severity
WARNING
- Message
PersistentVolumeInodes are filling up.
- Query
(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_inodes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and kubelet_volume_stats_inodes_used{job="kubelet",metrics_path="/metrics",namespace=~".*"} > 0 and predict_linear(kubelet_volume_stats_inodes_free{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_access_mode{access_mode="ReadOnlyMany"} == 1 unless on(namespace, persistentvolumeclaim) kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
KubePodCrashLooping¶
- Severity
WARNING
- Message
Pod is crash looping.
- Query
max_over_time(kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*",reason="CrashLoopBackOff"}[5m]) >= 1
KubePodNotReady¶
- Severity
WARNING
- Message
Pod has been in a non-ready state for more than 15 minutes.
- Query
sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))) > 0
KubeProxyDown¶
- Severity
CRITICAL
- Message
Target disappeared from Prometheus target discovery.
- Query
absent(up{job="kube-proxy"} == 1)
KubeQuotaAlmostFull¶
- Severity
INFO
- Message
Namespace quota is going to be full.
- Query
kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 0.9 < 1
KubeQuotaExceeded¶
- Severity
WARNING
- Message
Namespace quota has exceeded the limits.
- Query
kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 1
KubeQuotaFullyUsed¶
- Severity
INFO
- Message
Namespace quota is fully used.
- Query
kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) == 1
KubeSchedulerDown¶
- Severity
CRITICAL
- Message
Target disappeared from Prometheus target discovery.
- Query
absent(up{job="kube-scheduler"} == 1)
KubeStatefulSetGenerationMismatch¶
- Severity
WARNING
- Message
StatefulSet generation mismatch due to possible roll-back
- Query
kube_statefulset_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_metadata_generation{job="kube-state-metrics",namespace=~".*"}
KubeStatefulSetReplicasMismatch¶
- Severity
WARNING
- Message
Deployment has not matched the expected number of replicas.
- Query
(kube_statefulset_status_replicas_ready{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[10m]) == 0)
KubeStatefulSetUpdateNotRolledOut¶
- Severity
WARNING
- Message
StatefulSet update has not been rolled out.
- Query
(max without(revision) (kube_statefulset_status_current_revision{job="kube-state-metrics",namespace=~".*"} unless kube_statefulset_status_update_revision{job="kube-state-metrics",namespace=~".*"}) * (kube_statefulset_replicas{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0)
KubeStateMetricsListErrors¶
- Severity
CRITICAL
- Message
kube-state-metrics is experiencing errors in list operations.
- Query
(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01
KubeStateMetricsShardingMismatch¶
- Severity
CRITICAL
- Message
kube-state-metrics sharding is misconfigured.
- Query
stdvar(kube_state_metrics_total_shards{job="kube-state-metrics"}) != 0
KubeStateMetricsShardsMissing¶
- Severity
CRITICAL
- Message
kube-state-metrics shards are missing.
- Query
2 ^ max(kube_state_metrics_total_shards{job="kube-state-metrics"}) - 1 - sum(2 ^ max by(shard_ordinal) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"})) != 0
KubeStateMetricsWatchErrors¶
- Severity
CRITICAL
- Message
kube-state-metrics is experiencing errors in watch operations.
- Query
(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01
KubeVersionMismatch¶
- Severity
WARNING
- Message
Different semantic versions of Kubernetes components running.
- Query
count(count by(git_version) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"}, "git_version", "$1", "git_version", "(v[0-9]*.[0-9]*).*"))) > 1
NodeClockNotSynchronising¶
- Severity
WARNING
- Message
Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.
- Query
min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16
NodeClockSkewDetected¶
- Severity
WARNING
- Message
Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host.
- Query
(node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
NodeFileDescriptorLimit¶
- Severity
CRITICAL
- Message
Kernel is predicted to exhaust file descriptors limit soon.
- Query
(node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90)
NodeFileDescriptorLimit¶
- Severity
WARNING
- Message
Kernel is predicted to exhaust file descriptors limit soon.
- Query
(node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70)
NodeFilesystemAlmostOutOfFiles¶
- Severity
CRITICAL
- Message
Filesystem has less than 8% inodes left.
- Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 8 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeFilesystemAlmostOutOfFiles¶
- Severity
WARNING
- Message
Filesystem has less than 15% inodes left.
- Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 15 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeFilesystemAlmostOutOfSpace¶
- Severity
CRITICAL
- Message
Filesystem has less than 12% space left.
- Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 12 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeFilesystemAlmostOutOfSpace¶
- Severity
WARNING
- Message
Filesystem has less than 20% space left.
- Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 20 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeFilesystemFilesFillingUp¶
- Severity
CRITICAL
- Message
Filesystem is predicted to run out of inodes within the next 4 hours.
- Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeFilesystemFilesFillingUp¶
- Severity
WARNING
- Message
Filesystem is predicted to run out of inodes within the next 24 hours.
- Query
(node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeFilesystemSpaceFillingUp¶
- Severity
CRITICAL
- Message
Filesystem is predicted to run out of space within the next 4 hours.
- Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeFilesystemSpaceFillingUp¶
- Severity
WARNING
- Message
Filesystem is predicted to run out of space within the next 24 hours.
- Query
(node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0)
NodeHighNumberConntrackEntriesUsed¶
- Severity
WARNING
- Message
Number of conntrack are getting close to the limit
- Query
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
NodeNetworkInterfaceFlapping¶
- Severity
WARNING
- Message
Network interface is often changing its status
- Query
changes(node_network_up{device!~"veth.+",job="node-exporter"}[2m]) > 2
NodeNetworkReceiveErrs¶
- Severity
WARNING
- Message
Network interface is reporting many receive errors.
- Query
increase(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
NodeNetworkTransmitErrs¶
- Severity
WARNING
- Message
Network interface is reporting many transmit errors.
- Query
increase(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
NodeRAIDDegraded¶
- Severity
CRITICAL
- Message
RAID Array is degraded
- Query
node_md_disks_required - ignoring(state) (node_md_disks{state="active"}) >= 1
NodeRAIDDiskFailure¶
- Severity
WARNING
- Message
Failed device in RAID array
- Query
node_md_disks{state="failed"} >= 1
NodeTextFileCollectorScrapeError¶
- Severity
WARNING
- Message
Node Exporter text file collector failed to scrape.
- Query
node_textfile_scrape_error{job="node-exporter"} == 1
PrometheusBadConfig¶
- Severity
CRITICAL
- Message
Failed Prometheus configuration reload.
- Query
max_over_time(prometheus_config_last_reload_successful{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) == 0
PrometheusDuplicateTimestamps¶
- Severity
WARNING
- Message
Prometheus is dropping samples with duplicate timestamps.
- Query
rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusErrorSendingAlertsToAnyAlertmanager¶
- Severity
CRITICAL
- Message
Prometheus encounters more than 3% errors sending alerts to any Alertmanager.
- Query
min without(alertmanager) (rate(prometheus_notifications_errors_total{alertmanager!~"",job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) / rate(prometheus_notifications_sent_total{alertmanager!~"",job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) * 100 > 3
PrometheusErrorSendingAlertsToSomeAlertmanagers¶
- Severity
WARNING
- Message
Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.
- Query
(rate(prometheus_notifications_errors_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) * 100 > 1
PrometheusLabelLimitHit¶
- Severity
WARNING
- Message
Prometheus has dropped targets because some scrape configs have exceeded the labels limit.
- Query
increase(prometheus_target_scrape_pool_exceeded_label_limits_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusMissingRuleEvaluations¶
- Severity
WARNING
- Message
Prometheus is missing rule evaluations due to slow rule group evaluation.
- Query
increase(prometheus_rule_group_iterations_missed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusNotConnectedToAlertmanagers¶
- Severity
WARNING
- Message
Prometheus is not connected to any Alertmanagers.
- Query
max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) < 1
PrometheusNotificationQueueRunningFull¶
- Severity
WARNING
- Message
Prometheus alert notification queue predicted to run full in less than 30m.
- Query
(predict_linear(prometheus_notifications_queue_length{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]))
PrometheusNotIngestingSamples¶
- Severity
WARNING
- Message
Prometheus is not ingesting samples.
- Query
(rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) <= 0 and (sum without(scrape_job) (prometheus_target_metadata_cache_entries{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}) > 0 or sum without(rule_group) (prometheus_rule_group_rules{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}) > 0))
PrometheusOperatorListErrors¶
- Severity
WARNING
- Message
Errors while performing list operations in controller.
- Query
(sum by(controller, namespace) (rate(prometheus_operator_list_operations_failed_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[10m])) / sum by(controller, namespace) (rate(prometheus_operator_list_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[10m]))) > 0.4
PrometheusOperatorNodeLookupErrors¶
- Severity
WARNING
- Message
Errors while reconciling Prometheus.
- Query
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]) > 0.1
PrometheusOperatorNotReady¶
- Severity
WARNING
- Message
Prometheus operator not ready
- Query
min by(namespace, controller) (max_over_time(prometheus_operator_ready{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]) == 0)
PrometheusOperatorReconcileErrors¶
- Severity
WARNING
- Message
Errors while reconciling controller.
- Query
(sum by(controller, namespace) (rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) / (sum by(controller, namespace) (rate(prometheus_operator_reconcile_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) > 0.1
PrometheusOperatorRejectedResources¶
- Severity
WARNING
- Message
Resources rejected by Prometheus operator
- Query
min_over_time(prometheus_operator_managed_resources{job="prometheus-operator-operator",namespace="metalk8s-monitoring",state="rejected"}[5m]) > 0
PrometheusOperatorSyncFailed¶
- Severity
WARNING
- Message
Last controller reconciliation failed
- Query
min_over_time(prometheus_operator_syncs{job="prometheus-operator-operator",namespace="metalk8s-monitoring",status="failed"}[5m]) > 0
PrometheusOperatorWatchErrors¶
- Severity
WARNING
- Message
Errors while performing watch operations in controller.
- Query
(sum by(controller, namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m])) / sum by(controller, namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator-operator",namespace="metalk8s-monitoring"}[5m]))) > 0.4
PrometheusOutOfOrderTimestamps¶
- Severity
WARNING
- Message
Prometheus drops samples with out-of-order timestamps.
- Query
rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusRemoteStorageFailures¶
- Severity
CRITICAL
- Message
Prometheus fails to send samples to remote storage.
- Query
((rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) / ((rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) + (rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) or rate(prometheus_remote_storage_samples_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])))) * 100 > 1
PrometheusRemoteWriteBehind¶
- Severity
CRITICAL
- Message
Prometheus remote write is behind.
- Query
(max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) - ignoring(remote_name, url) group_right() max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m])) > 120
PrometheusRemoteWriteDesiredShards¶
- Severity
WARNING
- Message
Prometheus remote write desired shards calculation wants to run more than configured max shards.
- Query
(max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]))
PrometheusRuleFailures¶
- Severity
CRITICAL
- Message
Prometheus is failing rule evaluations.
- Query
increase(prometheus_rule_evaluation_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusScrapeBodySizeLimitHit¶
- Severity
WARNING
- Message
Prometheus has dropped some targets that exceeded body size limit.
- Query
increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusScrapeSampleLimitHit¶
- Severity
WARNING
- Message
Prometheus has failed scrapes that have exceeded the configured sample limit.
- Query
increase(prometheus_target_scrapes_exceeded_sample_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusTargetLimitHit¶
- Severity
WARNING
- Message
Prometheus has dropped targets because some scrape configs have exceeded the targets limit.
- Query
increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[5m]) > 0
PrometheusTargetSyncFailure¶
- Severity
CRITICAL
- Message
Prometheus has failed to sync targets.
- Query
increase(prometheus_target_sync_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[30m]) > 0
PrometheusTSDBCompactionsFailing¶
- Severity
WARNING
- Message
Prometheus has issues compacting blocks.
- Query
increase(prometheus_tsdb_compactions_failed_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[3h]) > 0
PrometheusTSDBReloadsFailing¶
- Severity
WARNING
- Message
Prometheus has issues reloading blocks from disk.
- Query
increase(prometheus_tsdb_reloads_failures_total{job="prometheus-operator-prometheus",namespace="metalk8s-monitoring"}[3h]) > 0
TargetDown¶
- Severity
WARNING
- Message
One or more targets are unreachable.
- Query
100 * (count by(job, namespace, service) (up == 0) / count by(job, namespace, service) (up)) > 10
Watchdog¶
- Severity
NONE
- Message
An alert that should always be firing to certify that Alertmanager is working properly.
- Query
vector(1)