Prometheus¶
In a MetalK8s cluster, the Prometheus service records real-time metrics in a time series database. Prometheus can query a list of data sources called “exporters” at a specific polling frequency, and aggregate this data across the various sources.
Prometheus uses a special language, Prometheus Query Language (PromQL), to write alerting and recording rules.
Default Alert Rules¶
Alert rules enable a user to specify a condition that must occur before an external system like Slack is notified. For example, a MetalK8s administrator might want to raise an alert for any node that is unreachable for more than one minute.
Out of the box, MetalK8s ships with preconfigured alert rules, which are written as PromQL queries. The table below outlines all the preconfigured alert rules exposed from a newly deployed MetalK8s cluster.
To customize predefined alert rules, refer to Prometheus Configuration Customization.
Name |
Severity |
Description |
---|---|---|
AlertingServiceAtRisk |
critical |
The alerting service is at risk. |
ClusterAtRisk |
critical |
The cluster is at risk. |
CoreServicesAtRisk |
critical |
The Core services are at risk. |
KubernetesControlPlaneAtRisk |
critical |
The Kubernetes control plane is at risk. |
MonitoringServiceAtRisk |
critical |
The monitoring service is at risk. |
NodeAtRisk |
critical |
The node {{ $labels.instance }} is at risk. |
ObservabilityServicesAtRisk |
critical |
The observability services are at risk. |
PlatformServicesAtRisk |
critical |
The Platform services are at risk. |
SystemPartitionAtRisk |
critical |
The system partition {{ $labels.mountpoint }} on node {{ $labels.instance }} is at risk. |
VolumeAtRisk |
critical |
The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is at risk. |
AccessServicesDegraded |
warning |
The Access services are degraded. |
AlertingServiceDegraded |
warning |
The alerting service is degraded. |
AuthenticationServiceDegraded |
warning |
The Authentication service for K8S API is degraded. |
BootstrapServicesDegraded |
warning |
The MetalK8s Bootstrap services are degraded. |
ClusterDegraded |
warning |
The cluster is degraded. |
CoreServicesDegraded |
warning |
The Core services are degraded. |
DashboardingServiceDegraded |
warning |
The dashboarding service is degraded. |
IngressControllerServicesDegraded |
warning |
The Ingress Controllers for control plane and workload plane are degraded. |
KubernetesControlPlaneDegraded |
warning |
The Kubernetes control plane is degraded. |
LoggingServiceDegraded |
warning |
The logging service is degraded. |
MonitoringServiceDegraded |
warning |
The monitoring service is degraded. |
NetworkDegraded |
warning |
The network is degraded. |
NodeDegraded |
warning |
The node {{ $labels.instance }} is degraded. |
ObservabilityServicesDegraded |
warning |
The observability services are degraded. |
PlatformServicesDegraded |
warning |
The Platform services are degraded. |
SystemPartitionDegraded |
warning |
The system partition {{ $labels.mountpoint }} on node {{ $labels.instance }} is degraded. |
VolumeDegraded |
warning |
The volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} on node {{ $labels.instance }} is degraded. |
AlertmanagerClusterCrashlooping |
critical |
Half or more of the Alertmanager instances within the same cluster are crashlooping. |
AlertmanagerClusterDown |
critical |
Half or more of the Alertmanager instances within the same cluster are down. |
AlertmanagerClusterFailedToSendAlerts |
critical |
All Alertmanager instances in a cluster failed to send notifications to a critical integration. |
AlertmanagerClusterFailedToSendAlerts |
warning |
All Alertmanager instances in a cluster failed to send notifications to a non-critical integration. |
AlertmanagerConfigInconsistent |
critical |
Alertmanager instances within the same cluster have different configurations. |
AlertmanagerFailedReload |
critical |
Reloading an Alertmanager configuration has failed. |
AlertmanagerFailedToSendAlerts |
warning |
An Alertmanager instance failed to send notifications. |
AlertmanagerMembersInconsistent |
critical |
A member of an Alertmanager cluster has not found all other cluster members. |
etcdGRPCRequestsSlow |
critical |
etcd cluster “{{ $labels.job }}”: gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdHTTPRequestsSlow |
warning |
etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow. |
etcdHighCommitDurations |
warning |
etcd cluster “{{ $labels.job }}”: 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdHighFsyncDurations |
warning |
etcd cluster “{{ $labels.job }}”: 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfFailedGRPCRequests |
critical |
etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfFailedGRPCRequests |
warning |
etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfFailedHTTPRequests |
critical |
{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfFailedHTTPRequests |
warning |
{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }} |
etcdHighNumberOfFailedProposals |
warning |
etcd cluster “{{ $labels.job }}”: {{ $value }} proposal failures within the last hour on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfLeaderChanges |
warning |
etcd cluster “{{ $labels.job }}”: instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last hour. |
etcdInsufficientMembers |
critical |
etcd cluster “{{ $labels.job }}”: insufficient members ({{ $value }}). |
etcdMemberCommunicationSlow |
warning |
etcd cluster “{{ $labels.job }}”: member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdNoLeader |
critical |
etcd cluster “{{ $labels.job }}”: member {{ $labels.instance }} has no leader. |
TargetDown |
warning |
One or more targets are unreachable. |
Watchdog |
none |
An alert that should always be firing to certify that Alertmanager is working properly. |
KubeAPIErrorBudgetBurn |
warning |
The API server is burning too much error budget. |
KubeAPIErrorBudgetBurn |
critical |
The API server is burning too much error budget. |
KubeAPIErrorBudgetBurn |
warning |
The API server is burning too much error budget. |
KubeAPIErrorBudgetBurn |
critical |
The API server is burning too much error budget. |
KubeStateMetricsListErrors |
critical |
kube-state-metrics is experiencing errors in list operations. |
KubeStateMetricsShardingMismatch |
critical |
kube-state-metrics sharding is misconfigured. |
KubeStateMetricsShardsMissing |
critical |
kube-state-metrics shards are missing. |
KubeStateMetricsWatchErrors |
critical |
kube-state-metrics is experiencing errors in watch operations. |
KubeContainerWaiting |
warning |
Pod container waiting longer than 1 hour |
KubeDaemonSetMisScheduled |
warning |
DaemonSet pods are misscheduled. |
KubeDaemonSetNotScheduled |
warning |
DaemonSet pods are not scheduled. |
KubeDaemonSetRolloutStuck |
warning |
DaemonSet rollout is stuck. |
KubeDeploymentGenerationMismatch |
warning |
Deployment generation mismatch due to possible roll-back |
KubeDeploymentReplicasMismatch |
warning |
Deployment has not matched the expected number of replicas. |
KubeHpaMaxedOut |
warning |
HPA is running at max replicas |
KubeHpaReplicasMismatch |
warning |
HPA has not matched descired number of replicas. |
KubeJobCompletion |
warning |
Job did not complete in time |
KubeJobFailed |
warning |
Job failed to complete. |
KubePodCrashLooping |
warning |
Pod is crash looping. |
KubePodNotReady |
warning |
Pod has been in a non-ready state for more than 15 minutes. |
KubeStatefulSetGenerationMismatch |
warning |
StatefulSet generation mismatch due to possible roll-back |
KubeStatefulSetReplicasMismatch |
warning |
Deployment has not matched the expected number of replicas. |
KubeStatefulSetUpdateNotRolledOut |
warning |
StatefulSet update has not been rolled out. |
CPUThrottlingHigh |
info |
Processes experience elevated CPU throttling. |
KubeCPUOvercommit |
warning |
Cluster has overcommitted CPU resource requests. |
KubeCPUQuotaOvercommit |
warning |
Cluster has overcommitted CPU resource requests. |
KubeMemoryOvercommit |
warning |
Cluster has overcommitted memory resource requests. |
KubeMemoryQuotaOvercommit |
warning |
Cluster has overcommitted memory resource requests. |
KubeQuotaAlmostFull |
info |
Namespace quota is going to be full. |
KubeQuotaExceeded |
warning |
Namespace quota has exceeded the limits. |
KubeQuotaFullyUsed |
info |
Namespace quota is fully used. |
KubePersistentVolumeErrors |
critical |
PersistentVolume is having issues with provisioning. |
KubePersistentVolumeFillingUp |
critical |
PersistentVolume is filling up. |
KubePersistentVolumeFillingUp |
warning |
PersistentVolume is filling up. |
KubeClientErrors |
warning |
Kubernetes API server client is experiencing errors. |
KubeVersionMismatch |
warning |
Different semantic versions of Kubernetes components running. |
AggregatedAPIDown |
warning |
An aggregated API is down. |
AggregatedAPIErrors |
warning |
An aggregated API has reported errors. |
KubeAPIDown |
critical |
Target disappeared from Prometheus target discovery. |
KubeAPITerminatedRequests |
warning |
The apiserver has terminated {{ $value | humanizePercentage }} of its incoming requests. |
KubeClientCertificateExpiration |
critical |
Client certificate is about to expire. |
KubeClientCertificateExpiration |
warning |
Client certificate is about to expire. |
KubeControllerManagerDown |
critical |
Target disappeared from Prometheus target discovery. |
KubeNodeNotReady |
warning |
Node is not ready. |
KubeNodeReadinessFlapping |
warning |
Node readiness status is flapping. |
KubeNodeUnreachable |
warning |
Node is unreachable. |
KubeletClientCertificateExpiration |
critical |
Kubelet client certificate is about to expire. |
KubeletClientCertificateExpiration |
warning |
Kubelet client certificate is about to expire. |
KubeletClientCertificateRenewalErrors |
warning |
Kubelet has failed to renew its client certificate. |
KubeletDown |
critical |
Target disappeared from Prometheus target discovery. |
KubeletPlegDurationHigh |
warning |
Kubelet Pod Lifecycle Event Generator is taking too long to relist. |
KubeletPodStartUpLatencyHigh |
warning |
Kubelet Pod startup latency is too high. |
KubeletServerCertificateExpiration |
critical |
Kubelet server certificate is about to expire. |
KubeletServerCertificateExpiration |
warning |
Kubelet server certificate is about to expire. |
KubeletServerCertificateRenewalErrors |
warning |
Kubelet has failed to renew its server certificate. |
KubeletTooManyPods |
warning |
Kubelet is running at capacity. |
KubeSchedulerDown |
critical |
Target disappeared from Prometheus target discovery. |
NodeClockNotSynchronising |
warning |
Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host. |
NodeClockSkewDetected |
warning |
Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host. |
NodeFileDescriptorLimit |
critical |
Kernel is predicted to exhaust file descriptors limit soon. |
NodeFileDescriptorLimit |
warning |
Kernel is predicted to exhaust file descriptors limit soon. |
NodeFilesystemAlmostOutOfFiles |
critical |
Filesystem has less than 8% inodes left. |
NodeFilesystemAlmostOutOfFiles |
warning |
Filesystem has less than 15% inodes left. |
NodeFilesystemAlmostOutOfSpace |
critical |
Filesystem has less than 12% space left. |
NodeFilesystemAlmostOutOfSpace |
warning |
Filesystem has less than 20% space left. |
NodeFilesystemFilesFillingUp |
critical |
Filesystem is predicted to run out of inodes within the next 4 hours. |
NodeFilesystemFilesFillingUp |
warning |
Filesystem is predicted to run out of inodes within the next 24 hours. |
NodeFilesystemSpaceFillingUp |
critical |
Filesystem is predicted to run out of space within the next 4 hours. |
NodeFilesystemSpaceFillingUp |
warning |
Filesystem is predicted to run out of space within the next 24 hours. |
NodeHighNumberConntrackEntriesUsed |
warning |
Number of conntrack are getting close to the limit |
NodeNetworkReceiveErrs |
warning |
Network interface is reporting many receive errors. |
NodeNetworkTransmitErrs |
warning |
Network interface is reporting many transmit errors. |
NodeRAIDDegraded |
critical |
RAID Array is degraded |
NodeRAIDDiskFailure |
warning |
Failed device in RAID array |
NodeTextFileCollectorScrapeError |
warning |
Node Exporter text file collector failed to scrape. |
NodeNetworkInterfaceFlapping |
warning |
Network interface is often changin it’s status |
PrometheusBadConfig |
critical |
Failed Prometheus configuration reload. |
PrometheusDuplicateTimestamps |
warning |
Prometheus is dropping samples with duplicate timestamps. |
PrometheusErrorSendingAlertsToAnyAlertmanager |
critical |
Prometheus encounters more than 3% errors sending alerts to any Alertmanager. |
PrometheusErrorSendingAlertsToSomeAlertmanagers |
warning |
Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. |
PrometheusLabelLimitHit |
warning |
Prometheus has dropped targets because some scrape configs have exceeded the labels limit. |
PrometheusMissingRuleEvaluations |
warning |
Prometheus is missing rule evaluations due to slow rule group evaluation. |
PrometheusNotConnectedToAlertmanagers |
warning |
Prometheus is not connected to any Alertmanagers. |
PrometheusNotIngestingSamples |
warning |
Prometheus is not ingesting samples. |
PrometheusNotificationQueueRunningFull |
warning |
Prometheus alert notification queue predicted to run full in less than 30m. |
PrometheusOutOfOrderTimestamps |
warning |
Prometheus drops samples with out-of-order timestamps. |
PrometheusRemoteStorageFailures |
critical |
Prometheus fails to send samples to remote storage. |
PrometheusRemoteWriteBehind |
critical |
Prometheus remote write is behind. |
PrometheusRemoteWriteDesiredShards |
warning |
Prometheus remote write desired shards calculation wants to run more than configured max shards. |
PrometheusRuleFailures |
critical |
Prometheus is failing rule evaluations. |
PrometheusTSDBCompactionsFailing |
warning |
Prometheus has issues compacting blocks. |
PrometheusTSDBReloadsFailing |
warning |
Prometheus has issues reloading blocks from disk. |
PrometheusTargetLimitHit |
warning |
Prometheus has dropped targets because some scrape configs have exceeded the targets limit. |
PrometheusTargetSyncFailure |
critical |
Prometheus has failed to sync targets. |
PrometheusOperatorListErrors |
warning |
Errors while performing list operations in controller. |
PrometheusOperatorNodeLookupErrors |
warning |
Errors while reconciling Prometheus. |
PrometheusOperatorNotReady |
warning |
Prometheus operator not ready |
PrometheusOperatorReconcileErrors |
warning |
Errors while reconciling controller. |
PrometheusOperatorRejectedResources |
warning |
Resources rejected by Prometheus operator |
PrometheusOperatorSyncFailed |
warning |
Last controller reconciliation failed |
PrometheusOperatorWatchErrors |
warning |
Errors while performing watch operations in controller. |
Snapshot Prometheus Database¶
To snapshot the database, you must first enable the Prometheus admin API.
To generate a snapshot, use the sosreport utility with the following options:
root@host # sosreport --batch --build -o metalk8s -kmetalk8s.prometheus-snapshot=True
The name of the generated archive is printed on the console output and
the Prometheus snapshot can be found under prometheus_snapshot
directory.
Warning
You must ensure you have sufficient disk space (at least the size
of the Prometheus volume) under /var/tmp
or change the archive
destination with --tmp-dir=<new_dest>
option.