Cluster Monitoring¶

This section contains information describing the MetalK8s monitoring and alerting stack, metric resources that are automatically monitored once MetalK8s is deployed, a list of alerting and recording rules which are pre-configured and much more.

Monitoring stack¶

MetalK8s ships with a monitoring stack that provides a cluster-wide view of cluster health, pod status, node status, network traffic status and much more. These view points are usually represented as charts, counts and graphs. For a closer look, access the Grafana Service to get more insights on monitoring stats provided once MetalK8s is deployed.

The MetalK8s monitoring stack consist of the following main components;

Alertmanager

Grafana

Kube-state-metrics

Prometheus

Prometheus Node-exporter

Prometheus¶

In a MetalK8s cluster, the Prometheus service is responsible for recording real-time metrics in a time series database. Prometheus is capable of querying a list of datasources called exporters at specific polling frequency and then aggregating this data across the various sources. Prometheus makes use of a special language Prometheus Query Language - PromQL for writing alerting and recording rules which we will later see.

Default Alert Rules¶

Alert rules enable a user to specify a condition that must occur before an external system like slack is notified. For example, MetalK8s administrators could want to raise an alert for any node that is unreachable for a duration >1 minutes.

Out-of-the-box, MetalK8s ships with preconfigured alert rules. These are typically written as PromQL queries. The table below outlines all the preconfigured alert rules exposed from a newly deployed MetalK8s cluster.

For predefined alert rules customization, see Prometheus Configuration Customization.

Default Prometheus Alerting rules¶
Name	Severity	Description
AlertmanagerConfigInconsistent	critical	The configuration of the instances of the Alertmanager cluster {{ $labels.namespace }}/{{ $labels.service }} are out of sync. {{ range printf “alertmanager_config_hash{namespace="%s",service="%s"}” $labels.namespace $labels.service \| query }} Configuration hash for pod {{ .Labels.pod }} is “{{ printf “%.f” .Value }}” {{ end }}
AlertmanagerFailedReload	warning	Reloading Alertmanager’s configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}.
AlertmanagerMembersInconsistent	critical	Alertmanager has not found all other members of the cluster.
etcdMembersDown	critical	etcd cluster “{{ $labels.job }}”: members are down ({{ $value }}).
etcdInsufficientMembers	critical	etcd cluster “{{ $labels.job }}”: insufficient members ({{ $value }}).
etcdNoLeader	critical	etcd cluster “{{ $labels.job }}”: member {{ $labels.instance }} has no leader.
etcdHighNumberOfLeaderChanges	warning	etcd cluster “{{ $labels.job }}”: {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.
etcdHighNumberOfFailedGRPCRequests	warning	etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.
etcdHighNumberOfFailedGRPCRequests	critical	etcd cluster “{{ $labels.job }}”: {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.
etcdGRPCRequestsSlow	critical	etcd cluster “{{ $labels.job }}”: gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}.
etcdMemberCommunicationSlow	warning	etcd cluster “{{ $labels.job }}”: member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}.
etcdHighNumberOfFailedProposals	warning	etcd cluster “{{ $labels.job }}”: {{ $value }} proposal failures within the last 30 minutes on etcd instance {{ $labels.instance }}.
etcdHighFsyncDurations	warning	etcd cluster “{{ $labels.job }}”: 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.
etcdHighCommitDurations	warning	etcd cluster “{{ $labels.job }}”: 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.
etcdHighNumberOfFailedHTTPRequests	warning	{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}
etcdHighNumberOfFailedHTTPRequests	critical	{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}.
etcdHTTPRequestsSlow	warning	etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow.
TargetDown	warning	{{ printf “%.4g” $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.
Watchdog	none	This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the “DeadMansSnitch” integration in PagerDuty.
KubeAPIErrorBudgetBurn	critical	The API server is burning too much error budget.
KubeAPIErrorBudgetBurn	critical	The API server is burning too much error budget.
KubeAPIErrorBudgetBurn	warning	The API server is burning too much error budget.
KubeAPIErrorBudgetBurn	warning	The API server is burning too much error budget.
KubeStateMetricsListErrors	critical	kube-state-metrics is experiencing errors in list operations.
KubeStateMetricsWatchErrors	critical	kube-state-metrics is experiencing errors in watch operations.
KubePodCrashLooping	warning	Pod is crash looping.
KubePodNotReady	warning	Pod has been in a non-ready state for more than 15 minutes.
KubeDeploymentGenerationMismatch	warning	Deployment generation mismatch due to possible roll-back
KubeDeploymentReplicasMismatch	warning	Deployment has not matched the expected number of replicas.
KubeStatefulSetReplicasMismatch	warning	Deployment has not matched the expected number of replicas.
KubeStatefulSetGenerationMismatch	warning	StatefulSet generation mismatch due to possible roll-back
KubeStatefulSetUpdateNotRolledOut	warning	StatefulSet update has not been rolled out.
KubeDaemonSetRolloutStuck	warning	DaemonSet rollout is stuck.
KubeContainerWaiting	warning	Pod container waiting longer than 1 hour
KubeDaemonSetNotScheduled	warning	DaemonSet pods are not scheduled.
KubeDaemonSetMisScheduled	warning	DaemonSet pods are misscheduled.
KubeJobCompletion	warning	Job did not complete in time
KubeJobFailed	warning	Job failed to complete.
KubeHpaReplicasMismatch	warning	HPA has not matched descired number of replicas.
KubeHpaMaxedOut	warning	HPA is running at max replicas
KubeCPUOvercommit	warning	Cluster has overcommitted CPU resource requests.
KubeMemoryOvercommit	warning	Cluster has overcommitted memory resource requests.
KubeCPUQuotaOvercommit	warning	Cluster has overcommitted CPU resource requests.
KubeMemoryQuotaOvercommit	warning	Cluster has overcommitted memory resource requests.
KubeQuotaAlmostFull	info	Namespace quota is going to be full.
KubeQuotaFullyUsed	info	Namespace quota is fully used.
KubeQuotaExceeded	warning	Namespace quota has exceeded the limits.
CPUThrottlingHigh	info	Processes experience elevated CPU throttling.
KubePersistentVolumeFillingUp	critical	PersistentVolume is filling up.
KubePersistentVolumeFillingUp	warning	PersistentVolume is filling up.
KubePersistentVolumeErrors	critical	PersistentVolume is having issues with provisioning.
KubeClientCertificateExpiration	warning	Client certificate is about to expire.
KubeClientCertificateExpiration	critical	Client certificate is about to expire.
AggregatedAPIErrors	warning	An aggregated API has reported errors.
AggregatedAPIDown	warning	An aggregated API is down.
KubeAPIDown	critical	Target disappeared from Prometheus target discovery.
KubeControllerManagerDown	critical	Target disappeared from Prometheus target discovery.
KubeNodeNotReady	warning	Node is not ready.
KubeNodeUnreachable	warning	Node is unreachable.
KubeletTooManyPods	warning	Kubelet is running at capacity.
KubeNodeReadinessFlapping	warning	Node readiness status is flapping.
KubeletPlegDurationHigh	warning	Kubelet Pod Lifecycle Event Generator is taking too long to relist.
KubeletPodStartUpLatencyHigh	warning	Kubelet Pod startup latency is too high.
KubeletClientCertificateExpiration	warning	Kubelet client certificate is about to expire.
KubeletClientCertificateExpiration	critical	Kubelet client certificate is about to expire.
KubeletServerCertificateExpiration	warning	Kubelet server certificate is about to expire.
KubeletServerCertificateExpiration	critical	Kubelet server certificate is about to expire.
KubeletClientCertificateRenewalErrors	warning	Kubelet has failed to renew its client certificate.
KubeletServerCertificateRenewalErrors	warning	Kubelet has failed to renew its server certificate.
KubeletDown	critical	Target disappeared from Prometheus target discovery.
KubeSchedulerDown	critical	Target disappeared from Prometheus target discovery.
KubeVersionMismatch	warning	Different semantic versions of Kubernetes components running.
KubeClientErrors	warning	Kubernetes API server client is experiencing errors.
NodeFilesystemSpaceFillingUp	warning	Filesystem is predicted to run out of space within the next 24 hours.
NodeFilesystemSpaceFillingUp	critical	Filesystem is predicted to run out of space within the next 4 hours.
NodeFilesystemAlmostOutOfSpace	warning	Filesystem has less than 5% space left.
NodeFilesystemAlmostOutOfSpace	critical	Filesystem has less than 3% space left.
NodeFilesystemFilesFillingUp	warning	Filesystem is predicted to run out of inodes within the next 24 hours.
NodeFilesystemFilesFillingUp	critical	Filesystem is predicted to run out of inodes within the next 4 hours.
NodeFilesystemAlmostOutOfFiles	warning	Filesystem has less than 5% inodes left.
NodeFilesystemAlmostOutOfFiles	critical	Filesystem has less than 3% inodes left.
NodeNetworkReceiveErrs	warning	Network interface is reporting many receive errors.
NodeNetworkTransmitErrs	warning	Network interface is reporting many transmit errors.
NodeHighNumberConntrackEntriesUsed	warning	Number of conntrack are getting close to the limit
NodeClockSkewDetected	warning	Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host.
NodeClockNotSynchronising	warning	Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.
NodeTextFileCollectorScrapeError	warning	Node Exporter text file collector failed to scrape.
NodeRAIDDegraded	critical	RAID Array is degraded
NodeRAIDDiskFailure	warning	Failed device in RAID array
NodeNetworkInterfaceFlapping	warning	Network interface “{{ $labels.device }}” changing it’s up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}”
PrometheusOperatorListErrors	warning	Errors while performing list operations in controller.
PrometheusOperatorWatchErrors	warning	Errors while performing watch operations in controller.
PrometheusOperatorSyncFailed	warning	Last controller reconciliation failed
PrometheusOperatorReconcileErrors	warning	Errors while reconciling controller.
PrometheusOperatorNodeLookupErrors	warning	Errors while reconciling Prometheus.
PrometheusOperatorNotReady	warning	Prometheus operator not ready
PrometheusOperatorRejectedResources	warning	Resources rejected by Prometheus operator
PrometheusBadConfig	critical	Failed Prometheus configuration reload.
PrometheusNotificationQueueRunningFull	warning	Prometheus alert notification queue predicted to run full in less than 30m.
PrometheusErrorSendingAlertsToSomeAlertmanagers	warning	Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.
PrometheusErrorSendingAlertsToAnyAlertmanager	critical	Prometheus encounters more than 3% errors sending alerts to any Alertmanager.
PrometheusNotConnectedToAlertmanagers	warning	Prometheus is not connected to any Alertmanagers.
PrometheusTSDBReloadsFailing	warning	Prometheus has issues reloading blocks from disk.
PrometheusTSDBCompactionsFailing	warning	Prometheus has issues compacting blocks.
PrometheusNotIngestingSamples	warning	Prometheus is not ingesting samples.
PrometheusDuplicateTimestamps	warning	Prometheus is dropping samples with duplicate timestamps.
PrometheusOutOfOrderTimestamps	warning	Prometheus drops samples with out-of-order timestamps.
PrometheusRemoteStorageFailures	critical	Prometheus fails to send samples to remote storage.
PrometheusRemoteWriteBehind	critical	Prometheus remote write is behind.
PrometheusRemoteWriteDesiredShards	warning	Prometheus remote write desired shards calculation wants to run more than configured max shards.
PrometheusRuleFailures	critical	Prometheus is failing rule evaluations.
PrometheusMissingRuleEvaluations	warning	Prometheus is missing rule evaluations due to slow rule group evaluation.
PrometheusTargetLimitHit	warning	Prometheus has dropped targets because some scrape configs have exceeded the targets limit.

Monitoring Configuration¶

To configure monitoring, see the Cluster and Services Configurations page.