Prometheus

In a MetalK8s cluster, the Prometheus service records real-time metrics in a time series database. Prometheus can query a list of data sources called “exporters” at a specific polling frequency, and aggregate this data across the various sources.

Prometheus uses a special language, Prometheus Query Language (PromQL), to write alerting and recording rules.

Default Alert Rules

Alert rules enable a user to specify a condition that must occur before an external system like Slack is notified. For example, a MetalK8s administrator might want to raise an alert for any node that is unreachable for more than one minute.

Out of the box, MetalK8s ships with preconfigured alert rules, which are written as PromQL queries. The table below outlines all the preconfigured alert rules exposed from a newly deployed MetalK8s cluster.

To customize predefined alert rules, refer to Prometheus Configuration Customization.

Default Prometheus Alerting rules

Name

Severity

Description

AlertingServiceAtRisk

critical

The alerting service is at risk.

ClusterAtRisk

critical

The cluster is at risk.

CoreServicesAtRisk

critical

The Core services are at risk.

KubernetesControlPlaneAtRisk

critical

The Kubernetes control plane is at risk.

MonitoringServiceAtRisk

critical

The monitoring service is at risk.

NodeAtRisk

critical

The node {{ $labels.instance }} is at risk.

ObservabilityServicesAtRisk

critical

The observability services are at risk.

PlatformServicesAtRisk

critical

The Platform services are at risk.

SystemPartitionAtRisk

critical

The system partition {{ $labels.mountpoint }} on node {{ $labels.instance }} is at risk.

VolumeAtRisk

critical

The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is at risk.

AccessServicesDegraded

warning

The Access services are degraded.

AlertingServiceDegraded

warning

The alerting service is degraded.

AuthenticationServiceDegraded

warning

The Authentication service for K8S API is degraded.

BootstrapServicesDegraded

warning

The MetalK8s Bootstrap services are degraded.

ClusterDegraded

warning

The cluster is degraded.

CoreServicesDegraded

warning

The Core services are degraded.

DashboardingServiceDegraded

warning

The dashboarding service is degraded.

IngressControllerServicesDegraded

warning

The Ingress Controllers for control plane and workload plane are degraded.

KubernetesControlPlaneDegraded

warning

The Kubernetes control plane is degraded.

LoggingServiceDegraded

warning

The logging service is degraded.

MonitoringServiceDegraded

warning

The monitoring service is degraded.

NetworkDegraded

warning

The network is degraded.

NodeDegraded

warning

The node {{ $labels.instance }} is degraded.

ObservabilityServicesDegraded

warning

The observability services are degraded.

PlatformServicesDegraded

warning

The Platform services are degraded.

SystemPartitionDegraded

warning

The system partition {{ $labels.mountpoint }} on node {{ $labels.instance }} is degraded.

VolumeDegraded

warning

The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is degraded.

AlertmanagerClusterCrashlooping

critical

Half or more of the Alertmanager instances within the same cluster are crashlooping.

AlertmanagerClusterDown

critical

Half or more of the Alertmanager instances within the same cluster are down.

AlertmanagerClusterFailedToSendAlerts

critical

All Alertmanager instances in a cluster failed to send notifications to a critical integration.

AlertmanagerClusterFailedToSendAlerts

warning

All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.

AlertmanagerConfigInconsistent

critical

Alertmanager instances within the same cluster have different configurations.

AlertmanagerFailedReload

critical

Reloading an Alertmanager configuration has failed.

AlertmanagerFailedToSendAlerts

warning

An Alertmanager instance failed to send notifications.

AlertmanagerMembersInconsistent

critical

A member of an Alertmanager cluster has not found all other cluster members.

etcdGRPCRequestsSlow

critical

etcd cluster “{{ $labels.job }}”: gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}.

etcdHTTPRequestsSlow

warning

etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow.

etcdHighCommitDurations

warning

etcd cluster “{{ $labels.job }}”: 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.

etcdHighFsyncDurations

warning

etcd cluster “{{ $labels.job }}”: 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.

etcdHighNumberOfFailedGRPCRequests

critical

etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.

etcdHighNumberOfFailedGRPCRequests

warning

etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.

etcdHighNumberOfFailedHTTPRequests

critical

{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}.

etcdHighNumberOfFailedHTTPRequests

warning

{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}

etcdHighNumberOfFailedProposals

warning

etcd cluster “{{ $labels.job }}”: {{ $value }} proposal failures within the last hour on etcd instance {{ $labels.instance }}.

etcdHighNumberOfLeaderChanges

warning

etcd cluster “{{ $labels.job }}”: instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last hour.

etcdInsufficientMembers

critical

etcd cluster “{{ $labels.job }}”: insufficient members ({{ $value }}).

etcdMemberCommunicationSlow

warning

etcd cluster “{{ $labels.job }}”: member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}.

etcdNoLeader

critical

etcd cluster “{{ $labels.job }}”: member {{ $labels.instance }} has no leader.

TargetDown

warning

One or more targets are unreachable.

Watchdog

none

An alert that should always be firing to certify that Alertmanager is working properly.

KubeAPIErrorBudgetBurn

warning

The API server is burning too much error budget.

KubeAPIErrorBudgetBurn

critical

The API server is burning too much error budget.

KubeAPIErrorBudgetBurn

warning

The API server is burning too much error budget.

KubeAPIErrorBudgetBurn

critical

The API server is burning too much error budget.

KubeStateMetricsListErrors

critical

kube-state-metrics is experiencing errors in list operations.

KubeStateMetricsShardingMismatch

critical

kube-state-metrics sharding is misconfigured.

KubeStateMetricsShardsMissing

critical

kube-state-metrics shards are missing.

KubeStateMetricsWatchErrors

critical

kube-state-metrics is experiencing errors in watch operations.

KubeContainerWaiting

warning

Pod container waiting longer than 1 hour

KubeDaemonSetMisScheduled

warning

DaemonSet pods are misscheduled.

KubeDaemonSetNotScheduled

warning

DaemonSet pods are not scheduled.

KubeDaemonSetRolloutStuck

warning

DaemonSet rollout is stuck.

KubeDeploymentGenerationMismatch

warning

Deployment generation mismatch due to possible roll-back

KubeDeploymentReplicasMismatch

warning

Deployment has not matched the expected number of replicas.

KubeHpaMaxedOut

warning

HPA is running at max replicas

KubeHpaReplicasMismatch

warning

HPA has not matched descired number of replicas.

KubeJobCompletion

warning

Job did not complete in time

KubeJobFailed

warning

Job failed to complete.

KubePodCrashLooping

warning

Pod is crash looping.

KubePodNotReady

warning

Pod has been in a non-ready state for more than 15 minutes.

KubeStatefulSetGenerationMismatch

warning

StatefulSet generation mismatch due to possible roll-back

KubeStatefulSetReplicasMismatch

warning

Deployment has not matched the expected number of replicas.

KubeStatefulSetUpdateNotRolledOut

warning

StatefulSet update has not been rolled out.

CPUThrottlingHigh

info

Processes experience elevated CPU throttling.

KubeCPUOvercommit

warning

Cluster has overcommitted CPU resource requests.

KubeCPUQuotaOvercommit

warning

Cluster has overcommitted CPU resource requests.

KubeMemoryOvercommit

warning

Cluster has overcommitted memory resource requests.

KubeMemoryQuotaOvercommit

warning

Cluster has overcommitted memory resource requests.

KubeQuotaAlmostFull

info

Namespace quota is going to be full.

KubeQuotaExceeded

warning

Namespace quota has exceeded the limits.

KubeQuotaFullyUsed

info

Namespace quota is fully used.

KubePersistentVolumeErrors

critical

PersistentVolume is having issues with provisioning.

KubePersistentVolumeFillingUp

critical

PersistentVolume is filling up.

KubePersistentVolumeFillingUp

warning

PersistentVolume is filling up.

KubeClientErrors

warning

Kubernetes API server client is experiencing errors.

KubeVersionMismatch

warning

Different semantic versions of Kubernetes components running.

AggregatedAPIDown

warning

An aggregated API is down.

AggregatedAPIErrors

warning

An aggregated API has reported errors.

KubeAPIDown

critical

Target disappeared from Prometheus target discovery.

KubeAPITerminatedRequests

warning

The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests.

KubeClientCertificateExpiration

critical

Client certificate is about to expire.

KubeClientCertificateExpiration

warning

Client certificate is about to expire.

KubeControllerManagerDown

critical

Target disappeared from Prometheus target discovery.

KubeNodeNotReady

warning

Node is not ready.

KubeNodeReadinessFlapping

warning

Node readiness status is flapping.

KubeNodeUnreachable

warning

Node is unreachable.

KubeletClientCertificateExpiration

critical

Kubelet client certificate is about to expire.

KubeletClientCertificateExpiration

warning

Kubelet client certificate is about to expire.

KubeletClientCertificateRenewalErrors

warning

Kubelet has failed to renew its client certificate.

KubeletDown

critical

Target disappeared from Prometheus target discovery.

KubeletPlegDurationHigh

warning

Kubelet Pod Lifecycle Event Generator is taking too long to relist.

KubeletPodStartUpLatencyHigh

warning

Kubelet Pod startup latency is too high.

KubeletServerCertificateExpiration

critical

Kubelet server certificate is about to expire.

KubeletServerCertificateExpiration

warning

Kubelet server certificate is about to expire.

KubeletServerCertificateRenewalErrors

warning

Kubelet has failed to renew its server certificate.

KubeletTooManyPods

warning

Kubelet is running at capacity.

KubeSchedulerDown

critical

Target disappeared from Prometheus target discovery.

NodeClockNotSynchronising

warning

Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.

NodeClockSkewDetected

warning

Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host.

NodeFileDescriptorLimit

critical

Kernel is predicted to exhaust file descriptors limit soon.

NodeFileDescriptorLimit

warning

Kernel is predicted to exhaust file descriptors limit soon.

NodeFilesystemAlmostOutOfFiles

critical

Filesystem has less than 8% inodes left.

NodeFilesystemAlmostOutOfFiles

warning

Filesystem has less than 15% inodes left.

NodeFilesystemAlmostOutOfSpace

critical

Filesystem has less than 12% space left.

NodeFilesystemAlmostOutOfSpace

warning

Filesystem has less than 20% space left.

NodeFilesystemFilesFillingUp

critical

Filesystem is predicted to run out of inodes within the next 4 hours.

NodeFilesystemFilesFillingUp

warning

Filesystem is predicted to run out of inodes within the next 24 hours.

NodeFilesystemSpaceFillingUp

critical

Filesystem is predicted to run out of space within the next 4 hours.

NodeFilesystemSpaceFillingUp

warning

Filesystem is predicted to run out of space within the next 24 hours.

NodeHighNumberConntrackEntriesUsed

warning

Number of conntrack are getting close to the limit

NodeNetworkReceiveErrs

warning

Network interface is reporting many receive errors.

NodeNetworkTransmitErrs

warning

Network interface is reporting many transmit errors.

NodeRAIDDegraded

critical

RAID Array is degraded

NodeRAIDDiskFailure

warning

Failed device in RAID array

NodeTextFileCollectorScrapeError

warning

Node Exporter text file collector failed to scrape.

NodeNetworkInterfaceFlapping

warning

Network interface is often changin it’s status

PrometheusBadConfig

critical

Failed Prometheus configuration reload.

PrometheusDuplicateTimestamps

warning

Prometheus is dropping samples with duplicate timestamps.

PrometheusErrorSendingAlertsToAnyAlertmanager

critical

Prometheus encounters more than 3% errors sending alerts to any Alertmanager.

PrometheusErrorSendingAlertsToSomeAlertmanagers

warning

Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.

PrometheusLabelLimitHit

warning

Prometheus has dropped targets because some scrape configs have exceeded the labels limit.

PrometheusMissingRuleEvaluations

warning

Prometheus is missing rule evaluations due to slow rule group evaluation.

PrometheusNotConnectedToAlertmanagers

warning

Prometheus is not connected to any Alertmanagers.

PrometheusNotIngestingSamples

warning

Prometheus is not ingesting samples.

PrometheusNotificationQueueRunningFull

warning

Prometheus alert notification queue predicted to run full in less than 30m.

PrometheusOutOfOrderTimestamps

warning

Prometheus drops samples with out-of-order timestamps.

PrometheusRemoteStorageFailures

critical

Prometheus fails to send samples to remote storage.

PrometheusRemoteWriteBehind

critical

Prometheus remote write is behind.

PrometheusRemoteWriteDesiredShards

warning

Prometheus remote write desired shards calculation wants to run more than configured max shards.

PrometheusRuleFailures

critical

Prometheus is failing rule evaluations.

PrometheusTSDBCompactionsFailing

warning

Prometheus has issues compacting blocks.

PrometheusTSDBReloadsFailing

warning

Prometheus has issues reloading blocks from disk.

PrometheusTargetLimitHit

warning

Prometheus has dropped targets because some scrape configs have exceeded the targets limit.

PrometheusTargetSyncFailure

critical

Prometheus has failed to sync targets.

PrometheusOperatorListErrors

warning

Errors while performing list operations in controller.

PrometheusOperatorNodeLookupErrors

warning

Errors while reconciling Prometheus.

PrometheusOperatorNotReady

warning

Prometheus operator not ready

PrometheusOperatorReconcileErrors

warning

Errors while reconciling controller.

PrometheusOperatorRejectedResources

warning

Resources rejected by Prometheus operator

PrometheusOperatorSyncFailed

warning

Last controller reconciliation failed

PrometheusOperatorWatchErrors

warning

Errors while performing watch operations in controller.

Snapshot Prometheus Database

To snapshot the database, you must first enable the Prometheus admin API.

To generate a snapshot, use the sosreport utility with the following options:

root@host # sosreport --batch --build -o metalk8s -kmetalk8s.prometheus-snapshot=True

The name of the generated archive is printed on the console output and the Prometheus snapshot can be found under prometheus_snapshot directory.

Warning

You must ensure you have sufficient disk space (at least the size of the Prometheus volume) under /var/tmp or change the archive destination with --tmp-dir=<new_dest> option.