Cluster Monitoring¶
This section contains information describing the MetalK8s monitoring and alerting stack, metric resources that are automatically monitored once MetalK8s is deployed, a list of alerting and recording rules which are pre-configured and much more.
Monitoring stack¶
MetalK8s ships with a monitoring stack that provides a cluster-wide view of cluster health, pod status, node status, network traffic status and much more. These view points are usually represented as charts, counts and graphs. For a closer look, access the Grafana Service to get more insights on monitoring stats provided once MetalK8s is deployed.
The MetalK8s monitoring stack consist of the following main components;
Todo
For each of the components listed above, provide a detailed description of its role within the Monitoring stack.
Default alerting & recording rules are available as a JSON file, we should use this file to generate a corresponding rst table as below
Prometheus¶
In a MetalK8s cluster, the Prometheus service is responsible for recording real-time metrics in a time series database. Prometheus is capable of querying a list of datasources called exporters at specific polling frequency and then aggregating this data across the various sources. Prometheus makes use of a special language Prometheus Query Language - PromQL for writing alerting and recording rules which we will later see.
Default Alert Rules¶
Alert rules enable a user to specify a condition that must occur before an external system like slack is notified. For example, MetalK8s administrators could want to raise an alert for any node that is unreachable for a duration >1 minutes.
Out-of-the-box, MetalK8s ships with preconfigured alert rules. These are typically written as PromQL queries. The table below outlines all the preconfigured alert rules exposed from a newly deployed MetalK8s cluster.
For predefined alert rules customization, see Prometheus Configuration Customization.
Name |
Severity |
Description |
---|---|---|
AlertmanagerConfigInconsistent |
critical |
The configuration of the instances of the Alertmanager cluster {{$labels.service}} are out of sync. |
AlertmanagerFailedReload |
warning |
Reloading Alertmanager’s configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}. |
AlertmanagerMembersInconsistent |
critical |
Alertmanager has not found all other members of the cluster. |
etcdInsufficientMembers |
critical |
etcd cluster “{{ $labels.job }}”: insufficient members ({{ $value }}). |
etcdNoLeader |
critical |
etcd cluster “{{ $labels.job }}”: member {{ $labels.instance }} has no leader. |
etcdHighNumberOfLeaderChanges |
warning |
etcd cluster “{{ $labels.job }}”: instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last hour. |
etcdHighNumberOfFailedGRPCRequests |
warning |
etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfFailedGRPCRequests |
critical |
etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}. |
etcdGRPCRequestsSlow |
critical |
etcd cluster “{{ $labels.job }}”: gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdMemberCommunicationSlow |
warning |
etcd cluster “{{ $labels.job }}”: member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfFailedProposals |
warning |
etcd cluster “{{ $labels.job }}”: {{ $value }} proposal failures within the last hour on etcd instance {{ $labels.instance }}. |
etcdHighFsyncDurations |
warning |
etcd cluster “{{ $labels.job }}”: 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdHighCommitDurations |
warning |
etcd cluster “{{ $labels.job }}”: 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}. |
etcdHighNumberOfFailedHTTPRequests |
warning |
{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }} |
etcdHighNumberOfFailedHTTPRequests |
critical |
{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}. |
etcdHTTPRequestsSlow |
warning |
etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow. |
TargetDown |
warning |
{{ printf “%.4g” $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down. |
Watchdog |
none |
This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the “DeadMansSnitch” integration in PagerDuty. |
KubeAPIErrorBudgetBurn |
critical |
The API server is burning too much error budget |
KubeAPIErrorBudgetBurn |
critical |
The API server is burning too much error budget |
KubeAPIErrorBudgetBurn |
warning |
The API server is burning too much error budget |
KubeAPIErrorBudgetBurn |
warning |
The API server is burning too much error budget |
KubeStateMetricsListErrors |
critical |
kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. |
KubeStateMetricsWatchErrors |
critical |
kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. |
KubePodCrashLooping |
critical |
Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf “%.2f” $value }} times / 5 minutes. |
KubePodNotReady |
critical |
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes. |
KubeDeploymentGenerationMismatch |
critical |
Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back. |
KubeDeploymentReplicasMismatch |
critical |
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes. |
KubeStatefulSetReplicasMismatch |
critical |
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes. |
KubeStatefulSetGenerationMismatch |
critical |
StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back. |
KubeStatefulSetUpdateNotRolledOut |
critical |
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. |
KubeDaemonSetRolloutStuck |
critical |
Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready. |
KubeContainerWaiting |
warning |
Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour. |
KubeDaemonSetNotScheduled |
warning |
{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled. |
KubeDaemonSetMisScheduled |
warning |
{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run. |
KubeCronJobRunning |
warning |
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. |
KubeJobCompletion |
warning |
Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete. |
KubeJobFailed |
warning |
Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. |
KubeHpaReplicasMismatch |
warning |
HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes. |
KubeHpaMaxedOut |
warning |
HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes. |
KubeCPUOvercommit |
warning |
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. |
KubeMemoryOvercommit |
warning |
Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure. |
KubeCPUQuotaOvercommit |
warning |
Cluster has overcommitted CPU resource requests for Namespaces. |
KubeMemoryQuotaOvercommit |
warning |
Cluster has overcommitted memory resource requests for Namespaces. |
KubeQuotaExceeded |
warning |
Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. |
CPUThrottlingHigh |
warning |
{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}. |
KubePersistentVolumeFillingUp |
critical |
The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. |
KubePersistentVolumeFillingUp |
warning |
Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available. |
KubePersistentVolumeErrors |
critical |
The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. |
KubeAPILatencyHigh |
warning |
The API server has an abnormal latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. |
KubeAPIErrorsHigh |
warning |
API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. |
KubeClientCertificateExpiration |
warning |
A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days. |
KubeClientCertificateExpiration |
critical |
A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. |
AggregatedAPIErrors |
warning |
An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often. |
AggregatedAPIDown |
warning |
An aggregated API {{ $labels.name }}/{{ $labels.namespace }} is down. It has not been available at least for the past five minutes. |
KubeAPIDown |
critical |
KubeAPI has disappeared from Prometheus target discovery. |
KubeControllerManagerDown |
critical |
KubeControllerManager has disappeared from Prometheus target discovery. |
KubeNodeNotReady |
warning |
{{ $labels.node }} has been unready for more than 15 minutes. |
KubeNodeUnreachable |
warning |
{{ $labels.node }} is unreachable and some workloads may be rescheduled. |
KubeletTooManyPods |
warning |
Kubelet ‘{{ $labels.node }}’ is running at {{ $value | humanizePercentage }} of its Pod capacity. |
KubeNodeReadinessFlapping |
warning |
The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes. |
KubeletPlegDurationHigh |
warning |
The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}. |
KubeletPodStartUpLatencyHigh |
warning |
Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}. |
KubeletDown |
critical |
Kubelet has disappeared from Prometheus target discovery. |
KubeSchedulerDown |
critical |
KubeScheduler has disappeared from Prometheus target discovery. |
KubeVersionMismatch |
warning |
There are {{ $value }} different semantic versions of Kubernetes components running. |
KubeClientErrors |
warning |
Kubernetes API server client ‘{{ $labels.job }}/{{ $labels.instance }}’ is experiencing {{ $value | humanizePercentage }} errors.’ |
NodeFilesystemSpaceFillingUp |
warning |
Filesystem is predicted to run out of space within the next 24 hours. |
NodeFilesystemSpaceFillingUp |
critical |
Filesystem is predicted to run out of space within the next 4 hours. |
NodeFilesystemAlmostOutOfSpace |
warning |
Filesystem has less than 5% space left. |
NodeFilesystemAlmostOutOfSpace |
critical |
Filesystem has less than 3% space left. |
NodeFilesystemFilesFillingUp |
warning |
Filesystem is predicted to run out of inodes within the next 24 hours. |
NodeFilesystemFilesFillingUp |
critical |
Filesystem is predicted to run out of inodes within the next 4 hours. |
NodeFilesystemAlmostOutOfFiles |
warning |
Filesystem has less than 5% inodes left. |
NodeFilesystemAlmostOutOfFiles |
critical |
Filesystem has less than 3% inodes left. |
NodeNetworkReceiveErrs |
warning |
Network interface is reporting many receive errors. |
NodeNetworkTransmitErrs |
warning |
Network interface is reporting many transmit errors. |
NodeHighNumberConntrackEntriesUsed |
warning |
Number of conntrack are getting close to the limit |
NodeClockSkewDetected |
warning |
Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host. |
NodeClockNotSynchronising |
warning |
Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host. |
NodeNetworkInterfaceFlapping |
warning |
Network interface “{{ $labels.device }}” changing it’s up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}” |
PrometheusOperatorReconcileErrors |
warning |
Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace }} Namespace. |
PrometheusOperatorNodeLookupErrors |
warning |
Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. |
PrometheusBadConfig |
critical |
Failed Prometheus configuration reload. |
PrometheusNotificationQueueRunningFull |
warning |
Prometheus alert notification queue predicted to run full in less than 30m. |
PrometheusErrorSendingAlertsToSomeAlertmanagers |
warning |
Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. |
PrometheusErrorSendingAlertsToAnyAlertmanager |
critical |
Prometheus encounters more than 3% errors sending alerts to any Alertmanager. |
PrometheusNotConnectedToAlertmanagers |
warning |
Prometheus is not connected to any Alertmanagers. |
PrometheusTSDBReloadsFailing |
warning |
Prometheus has issues reloading blocks from disk. |
PrometheusTSDBCompactionsFailing |
warning |
Prometheus has issues compacting blocks. |
PrometheusNotIngestingSamples |
warning |
Prometheus is not ingesting samples. |
PrometheusDuplicateTimestamps |
warning |
Prometheus is dropping samples with duplicate timestamps. |
PrometheusOutOfOrderTimestamps |
warning |
Prometheus drops samples with out-of-order timestamps. |
PrometheusRemoteStorageFailures |
critical |
Prometheus fails to send samples to remote storage. |
PrometheusRemoteWriteBehind |
critical |
Prometheus remote write is behind. |
PrometheusRemoteWriteDesiredShards |
warning |
Prometheus remote write desired shards calculation wants to run more than configured max shards. |
PrometheusRuleFailures |
critical |
Prometheus is failing rule evaluations. |
PrometheusMissingRuleEvaluations |
warning |
Prometheus is missing rule evaluations due to slow rule group evaluation. |
Monitoring Configuration¶
To configure monitoring, see the Cluster and Services Configurations page.