MetalK8s value is to provide, out of the box, some services in order to ease the monitoring and operation of the platform as well as workloads running on top of it.
It currently provides monitoring service, powered by Prometheus and AlertManager, in order to expose metrics and alerts for both control-plane and workload-plane components as well as for cluster nodes HW and OS.
The logs generated by the platform and the workloads constitute an essential piece of information when it comes to understanding the root cause of a failure or a performance degradation. Because of the distributed nature of MetalK8s and workloads running on top of it, the administrators need some tooling to ease the analysis of near real time and past logs, from a central endpoint. As such these logs should be stored on the platform, for a configurable period. Browsing the logs is accessible through an API and a UI. The UI should ease correlation between logs and health/performance KPIs as well as alerts.
For organisations having their own Log centralization system (like Splunk or Elasticsearch), MetalK8s should provide some documentation to guide the customer to deploy and configure its own log collection agent.
The following requirements focus on application logs. Audit logs are not part of the requirements.
A lightweight tool to store and expose logs is required in order to minimize the HW footprints (CPU, RAM, Disks):
limited history: Storing the logs for very large period (3 years) is not something metalK8s needs to provide as a feature. This can be achieved using external log centralization system.
stable ingestion: It is important to guarantee stable ingestion of the logs and less important to guarantee stable performances when browsing/searching the logs. However, peak loads related to complex logs queries should not impact the application workloads and deploying log storage and search service on infra nodes might help achieving this isolation.
stream indexing: It is not required to have automatic indexing of logs content. Instead, the log centralization service should offer basic features to group/filter logs per tag/metadata defining the log stream.
Accessible from a central UI/API
Platform Admin or Storage Admin can visualize logs from all containers in all namespaces as well as journal logs, including (kubelet, containerd, salt-minion , kernel, initrd, services, etc …) in Grafana. One can correlate logs and metrics or alerts in one single Grafana Dashboard. Browsing logs can be achieved through a documented API in order to expose logs in MetalK8s UIs or other workloads UIs if needed.
Logs should be stored on a persistent storage. Platform Administrator can configure a max retention period. Some automatic purging mechanism is triggered when logs are older than the retention period or when the persistent storage is about to reach its capacity limit. Purging jobs are logged. A typical and default retention period is of 2 weeks. A formula can be used by solution developers in order to properly size the persistent storage for log centralization (cf documentation requirement).
Horizontally scalable (capacity and performance)
The Platform Administrator can scale the service in order to ingest/query and store more logs. It can be because more workloads are running on the platform or because there is a need to keep bigger history of logs.
Log collection, ingestion, storage and query services can be replicated in order to ensure that we can lose at least one server in the cluster without impacting availability and reliability of the service.
Get all logs for a given period, node(s), pod regex, limited list of predefined labels and free keywords text. Typical Zenko use case: collecting all logs across several components, related to a S3 uniq request. Typical predefined labels are severity and namespace.
Log statistics (nice to have)
The Log centralization service also offers the ability to consume statistics about the logs like the number of occurrences of one type of log during a certain period of time.
Monitorable/Observable service (health, performances and alerts)
The Platform Administrator can monitor capacity usage, ingestion rate, IOPS, latency and bandwidth of the Log centralization service. He can also monitor the health of the service (i.e. if some active alerts exist). He is notified through an alert notification when the service is degraded or unavailable. It can be because the persistent storage is full or unhealthy or because the service does not manage to ingest logs at the requested pace.
Here are few example of situations we would like to detect through those KPIs:
a workload generating a crazy amount of logs
a burst of ingested logs
the log persistent storage getting full
very slow api responses (impacting usability in Grafana dashboards)
the ingestion of logs working too slowly
Typical workloads can generate around 1000 logs per second per node.
As a Platform Administrator, I want to browse all MetalK8s containers logs (from all servers) from a unique endpoint, in order to ease distributed K8s and MetalK8s services error investigation.
As a Platform Administrator, I want to browse non container (kubelet, containerd, salt-minion, cron) logs (from all servers) from a unique endpoint, in order to ease System error investigation.
As a Storage Administrator, I want to browse all Solution instance containers logs (from all servers) from a unique endpoint, in order to ease Solution instance error investigation.
As a Platform Administrator, I want to push all containers logs to an external log centralization system, In Order to archive it or aggregate it with other application logs.
(some other US extracted from Loki design doc)
After receiving an alert on my service and drilling into the query associated with said alert, I want to quickly see the logs associated with the jobs which produced those timeseries at the time of the alert.
After a pod or node disappears, I want to be able to retrieve logs from just before it died, so I can diagnose why it died.
After discovering an ongoing issue with my service, I want to extract a metric from some logs and combine it with my existing time series data.
I have a legacy job which does not expose metrics about errors - it only logs them. I want to build an alert based on the rate of occurrences of errors in the log.
Deployment & Configuration¶
The log centralization storage service is scheduled on infra nodes. A platform Administrator can operate the service as follows:
add persistent storage
configure max retention period
adjust the number of replicas
configure the system so that logs are pushed to an external log centralization service
configure log service alerts (IOPS or ingestion rate, latency, bandwidth, capacity usage) i.e. adjust the thresholds, silence some alerts or configure notifications.
Those operations are accessible from any host able to access the control plane network and are exposed through the centralised cli framework.
When installing or upgrading MetalK8s, the log centralization service is automatically scheduled (as soon as a persistent volume is provisioned) on one infra node.
All configurations of the log centralization service are part of the MetalK8s backup and remains unchanged when performing an upgrade.
During future MetalK8s upgrades, the service stays available (when replicated).
An alert rule is fired when the log centralization service is not healthy
The log centralization service is not healthy when log storage is getting full or when service is not able to ingest logs at the right pace.
IOPS, bandwidth, latency, capacity usage KPIs are available in Prometheus
Logs can be seen in Grafana. Log centralization monitoring information are displayed in the MetalK8s UI overview page. A Grafana dashboard gathering health/performance KPIs, as well as alerts for log centralization service is available when deploying/upgrading MetalK8s.
Our Log Centralization system can be split into several components as follows:
The collector is responsible for processing logs from all the sources (files, journal, containers, …), enriching the logs with metadata (labels) coming from parsing/filtering them (e.g. drop record on regexp match), from the context (e.g. file path, host) or by querying external sources such as APIs, then finally forwarding these logs to one or multiple distributors.
The distributor is the component that receives incoming streams from the collectors, it validates them (e.g. labels format, timestamp), then forwards them to the ingester. The distributor can also do parsing/filtering on the streams to enrich them with metadata (labels), route to a specific ingester or even drop them. It can as well do some buffering to avoid nagging the ingester with queries or to wait a bit in case the ingester would be unresponsive for a moment.
The ingester serves as a buffer between distributor and storage, because writing large chunks of data is more efficient than writing each event individually as it arrives. As such, a querier may need to ask an ingester about what is in the buffer.
The querier interprets the queries it receives from clients and then asks the ingesters for the corresponding data, then fallback on storage backend if not present in memory and returns it to the clients. It also takes care of deduplication of data because of the replication.
To choose which solution fits best our needs, we did a benchmark of the shortlisted collectors to compare their performances, on a 4 K8s nodes architecture (3 infra + 1 workload), with Loki as backend.
Our choice for the final design has been greatly motivated by these numbers, it represents the global resources consumption for the whole log centralization stack (Loki included).
With 10k events/sec, composed of 10 distinct streams:
CPU avg in m
RAM avg in MiB
fluent-bit + fluentd
With 10k events/sec, composed of 1000 distinct streams:
CPU avg in m
RAM avg in MiB
fluent-bit + fluentd
We can see that the Fluent Bit + Loki couple has the smallest impact on resources, but also that the Fluent Bit + Fluentd + Loki architecture seems to offer a better scaling, with the possiblity of keeping less pressure on workload nodes (Fluent Bit), relying more on dedicated infra nodes (Fluentd).
define sizing rules for CPU & RAM based on log streams and rate
Fluent Bit + Loki¶
We choose Fluent Bit as the collector because it allows to scrape all the logs we need (journal & containers), enriches them with the Kubernetes API and it has a very low resources footprint.
Moreover, it supports multiple backend such as Loki, ES, Splunk, etc. at the same time, which is a very important point if a user also wants to forward the logs to an external log centralization system (e.g. long term archiving).
Check here for the official list of supported outputs.
Loki has been chosen as the distributor, ingester & querier because, like Fluent Bit, it has a really small impact on resources and is very cost effective regarding storage needs.
It also uses the LogQL syntax for queries, which is pretty close to what we already have with Prometheus and PromQL, so it eases the integration in our tools.
Rejected Design Choices¶
Promtail + Loki¶
Promtail has been rejected because, even if it consumes very few resources, it can only integrate with Loki and we need something more versatile, with the ability to interact with different distributors.
Fluentd + Loki¶
This architecture has been rejected because it means we need 1 Fluentd instance per node, which increases a lot the resources consumption compared to other solutions.
Fluent Bit + Fluentd + Loki¶
We have considered this architecture as the Fluent Bit + Fluentd couple seems to be a standard in the industry, but we didn’t find any reason of keeping Fluentd, apart for its large panel of plugins which we don’t really need. Fluent Bit alone seems to be sufficient for what we want to achieve and adding an extra Fluentd means more resources consumption and add unnecessary complexity in the log centralization stack.
Logging Operator seemed to be a good candidate for the implementation we choose, offering the ability to deploy and configure Fluent Bit and Fluentd, but Fluentd is not optional and seems to have a central place as most of the parsing/filtering is done by this one, which means a bigger footprint on the hardware resources.
Logstash + Elasticsearch¶
This architecture is probably the most common one, but it has not been taken into consideration because we want to focus on having the minimum resources consumption and these components can really hog RAM & CPU. Beats could be used as the log collector to reduce the impact of Logstash, but Elasticsearch still consumes a lot of resources. Even if this solution offers a lot of powerful functionalities (e.g. distributed storage, full-text indexing), we don’t really need them and want to focus on the smallest hardware footprint.
All the components will be deployed using Kubernetes manifests through Salt
Fluent Bit will be deployed as a DaemonSet, because we need one collector on each node to be able to collect logs from both the Kubernetes platform, the applications that run on top of it and all the system daemons running alongside (e.g. Salt minion, Kubelet).
This DaemonSet and Fluent Bit configuration will be handled by the Fluent Bit Operator. For this, we need to deploy the manifests found here, using our Salt Kubernetes renderer.
We will then need to also automatically deploy CRs with Fluent Bit default configuration, examples can be found here.
Loki has 2 deployment mode, either as microservices (with distributor, ingester and querier in distinct pods), either as a monolith. We must use the monolithic mode, because we use filesystem as the storage backend and microservice mode does not support it.
Loki will be deployed as a
StatefulSet as it needs
to write the logs. It will be running only on infra nodes and we need at least
2 replicas of it (except on a single node platform), to ensure its high
As we already did for other components (e.g. Prometheus Operator), the manifests for Loki will be generated statically by rendering the loki helm chart:
helm repo add loki https://grafana.github.io/loki/charts helm repo update helm fetch -d charts --untar loki/loki ./charts/render.py loki --namespace metalk8s-logging \ charts/loki.yaml charts/loki/ \ > salt/metalk8s/addons/loki/deployed/charts.sls
Since we’re using filesystem to store Loki’s data, we have basically 2 ways for having multiple Loki instances running at the same time, with the same set of data.
Either we do like Prometheus and we store everything on every instance of Loki, but it means we raise the storage, RAM and CPU needs for each additionnal instance, either we use a new experimental feature from Loki 1.5.0 where Loki ingesters use a hash ring and talk between them to route the queries to the right one. This approach needs an external KV store to work such as etcd or Consul.
We choose to use the first approach as the second is not production ready and since we plan to only do short term retention on Loki, the impact on storage will not be that much important. Moreover, it eases the deployment and maintenance since there is not an extra component.
define loki storage sizing depending on retention, ingestion rate, average size of logs, number of streams, …
It is possible to set the hash ring store to memberlist, so it relies on it and it can discover and communicate with the other Loki instances. This way, it allows any instance to replicate, on some or all the other instances (depending on the replication factor), every logs it receives. Members will be automatically discovered at runtime, by using the Loki headless service as the only member.
Note that, if Loki cannot write enough replicas, it will be stuck (blocking
the whole logging stack), waiting to be able to do so.
It means, we need to configure the number of replicas depending on
how many nodes we want to be able to lose without impacting the logging
services and ensure the logging data resiliency.
For example, on a 5 nodes cluster, if we configure a replica on each node,
we can lose 2 nodes (
(number_of_nodes - 1) / 2), on a 3 nodes cluster,
we can lose 1 node.
This mode is required to be able to handle an instance failure, if we configure fluent-bit to talk to all Loki instances, if one fails, fluent-bit will be stuck, waiting to be able to send its data. This way, as we use the service to talk to one instance which then replicates the data to other instances, if a Loki instance fails, fluent-bit will not know about it and will keep running.
Since we use filesystem as Loki storage backend for the logs, we cannot ensure we will have the data if at some point an instance is down or the instance loses its volume. It means, if we query for this period of time and we hit this very instance we will have no data. Note that, if the instance is only down for few minutes (depending on the quantity of logs of the platform), it will be able to catch up what is still in other ingesters memory and not written yet to the storage backend.
To solve this problem, we either need to be able to share this storage between all the instances (e.g. NFS), but it is probably a bad idea, either we need an extra component in front of the queriers, which takes care of querying each instance, deduplicating the entries and returning them. We have already considered the Loki Query Frontend, but its goal is only to split queries, dispatch the load on the multiple instances and do some caching.
Fluent Bit needs to be configured to scrape and handle properly journal and containers logs by default.
For containers logs, we want to add the following labels:
node: the node it comes from
namespace: the namespace the pod is running in
instance: the name of the pod
container: the name of the container
For journal we want these labels:
node: the node it comes from
unit: the name of the unit generating these logs
This configuration will also be customizable by the user to be able to add new routes (Output) to push the log streams to.
This configuration will be done through various CRs provided by the Fluent Bit Operator:
FluentBit: Defines Fluent Bit instances and its associated config
FluentBitConfig: Select input/filter/output plugins and generates the final config into a Secret
Input: Defines input config sections
Filter: Defines filter config sections
Output: Defines output config sections
For example, if a user wants to forward all Kubernetes logs to an external log
centralization system (e.g. Elasticsearch), he will need to define an
Output CR as follows:
kind: Output metadata: name: my-output-to-external-es namespace: my-namespace spec: match: kube.* es: host: 10.0.0.1 port: 9200
Loki’s configuration is stored as a Secret.
we need to expose few parameters to the user for customization
Since we do not have Operator and CRs for Loki, we will use the CSC mechanism
to provide the interface for customization, with a ConfigMap
metalk8s-loki-config in the
CSC is not as powerful as an Operator with CRs (no watch and reconciliation on
resources and need to run Salt state manually), but the Loki configuration will
not change that much, probably during deployment and to tune few parameters
afterwards, so it does not worth to invest on an Operator.
ConfigMap will look like the followings:
apiVersion: v1 kind: ConfigMap metadata: name: metalk8s-loki-config namespace: metalk8s-logging data: config.yaml: |- apiVersion: addons.metalk8s.scality.com kind: LokiConfig spec: deployment: replicas: 1 config: auth_enabled: false chunk_store_config: max_look_back_period: 168h ingester: chunk_block_size: 262144 chunk_idle_period: 3m chunk_retain_period: 1m [...]
With default values fetched from a YAML file as it is already done for Dex, Alertmanager and Prometheus.
To monitor every services in our log centralization stack, we will need to
ServiceMonitor object and expose the
/metrics route of all
It will allow Prometheus Operator to configure Prometheus and automatically
start scraping these services.
For Loki, this can be achieved by adding the following configuration in its
serviceMonitor: enabled: true additionalLabels: release: "prometheus-operator"
For Fluent Bit, we will need to define a
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: fluent-bit metalk8s.scality.com/monitor: '' name: fluent-bit namespace: metalk8s-logging spec: endpoints: - path: /api/v1/metrics/prometheus port: http-metrics namespaceSelector: matchNames: - metalk8s-logging selector: matchLabels: app: fluent-bit
We also need to define alert rules based on metrics exposed by these services.
This is done deploying new
PrometheusRules object in
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: metalk8s.scality.com/monitor: '' name: loki.rules namespace: metalk8s-logging spec: groups: - name: loki.rules rules: <RULES DEFINITION>
There is some recording and alert rules defined in the Loki repository that could be used as a base, then we could enrich these rules later when we will have better operational knowledge.
To be able to query logs from Loki, we need to add a Grafana datasource,
this is done adding a ConfigMap
metalk8s-monitoring Namespace as follows:
apiVersion: v1 kind: ConfigMap metadata: name: loki-grafana-datasource namespace: metalk8s-monitoring labels: grafana_datasource: "1" data: loki-datasource.yaml: |- apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki.loki.svc.cluster.local:3100 version: 1
To display the logs we also need a dashboard, adding a ConfigMap
loki-logs-dashboard in Namespace
apiVersion: v1 kind: ConfigMap metadata: name: loki-logs-dashboard namespace: metalk8s-monitoring labels: grafana_dashboard: "1" data: loki-logs.json: <DASHBOARD DEFINITION>
For the dashboard we will use a view with a simple log panel and variables representing labels to filter on. Since journal and kubernetes logs will not have the same labels, we could either have 2 distinct dashboards or 2 log panels in the same. An example of what we want is Loki dashboard.
grafana_datasource: "1" and
grafana_dashboard: "1" labels are
what is used by the Prometheus Operator to retrieve datasource and dashboard
Resources may be deployed in any namespace so long as it contains the above labels.
The metalk8s.scality.com/grafana-folder-name annotation on dashboard resources provide control over the folder in which the dashboard is placed in Grafana.
Loki Volume Purge¶
Even with a max retention period, the logs could grow faster than what was expected and fill up the volume. Since there is no retention based on size in Loki yet, we need to add some specific monitoring (with prediction on volume usage) and alerting to ensure that an administrator will be warned if such a case would happen. The alert message should be clear and provide an URL to a run book to help the administrator resolving the issue.
To fix this issue, the administrator should purge oldest log chunk files from Loki volume, which can be achieved by connecting to the pod and manually removing them. If the growth of logs is not something transient, the administrator should also be advised to lower the retention period or replace the Loki volume by a bigger one.
Define what should be done for the purge, is it automatic (side-car pod) or is it only alerts with manual operations? If we go with an automatic purge, we should at least expose few parameters for the user to be able to customize this mechanism:
whether this mechanism is activated or not (defaulted to yes or no?)
% of space used for the purge to be triggered (90%?)
minimum days of retention (even if we are above the threshold) to ensure we are not removing everything in case of massive amount of logs
We also need to trigger alerts everytime this purge is activated to be sure that the operator is aware that Loki volume is undersized.
The goal is to have a working log centralization system, with logs accessible from Grafana:
Deploy Fluent Bit and Loki
Customization of Loki with CSC mechanisms
Document customization of Loki through CSC
Deploy Grafana datasource & dashboard
Document the log centralization system (sizing, configuration, …)
Simple pre-merge test to ensure the default log pipeline is working
Deployment of Fluent Bit with Fluent Bit Operator
Document customization of Fluent Bit through CRs
Define Prometheus record and alert rules
Deploy Loki volumes purge mechanism (TBD)
Display log centralization system status on the MetalK8s UI
Post-merge tests to ensure customization is working (replicas, custom parser/filter rules, …)
The sizing section in Introduction page is updated to include log centralization service impact. The sizing rule takes in account the retention period, workloads expected log rate and workload predefined indices. This rule is to be known by solution developers to properly size the service based on the workload properties.
The Post Installation page is updated to indicate that persistent storage is needed for log centralization service.
A new page should be added to explain how to operate the service and how to forward logs to an external log centralization system.
The Cluster Monitoring page is updated to describe the log centralization service.
Log centralization system will be deployed by default with MetalK8s, so its deployment will automatically be tested during pre-merge integration tests. However, we still need to develop specific pytest-bdd test scenario to ensure that the default logging pipeline is fully functionnal, and run it during these pre-merge tests. We will also add more complex tests in post-merge such as configuring specific parsers/filters, scaling the system, etc.