Volume Management

Abstract

To be able to run stateful services (such as Prometheus, Zenko or Hyperdrive), MetalK8s needs the ability to provide and manage persistent storage resources.

To do so we introduce the concept of MetalK8s Volume, using a Custom Resource Definition (CRD), built on top of the existing concept of Persistent Volume from Kubernetes. Those Custom Resources (CR) will be managed by a dedicated Kubernetes operator which will be responsible for the storage preparation (using Salt states) and lifetime management of the backing Persistent Volume.

Volume management will be available from the Platform UI (through a dedicated tab under the Node page). There, users will be able to create, monitor and delete MetalK8s volumes.

Scope

Goals

  • support two kinds of Volume:

    • sparseLoopDevice (backed by a sparse file)

    • rawBlockDevice (using whole disk)

  • add support for volume creation (one by one) in the Platform UI

  • add support for volume deletion (one by one) in the Platform UI

  • add support for volume listing/monitoring (show status, size, …) in the Platform UI

  • expose raw block device (unformated) as Volume

  • document how to create a volume

  • document how to create a StorageClass object

  • automated tests on volume workflow (creation, deletion, …)

Non-Goals

  • RAID support

  • LVM support

  • use an Admission Controller for semantic validation

  • auto-discovery of the disks

  • batch provisioning from the Platform UI

Proposal

To implement this feature we need to:

  • define and deploy a new CRD describing a MetalK8s Volume

  • develop and deploy a new Kubernetes operator to manage the MetalK8s volumes

  • develop new Salt states to prepare and cleanup underlying storage on the nodes

  • update the Platform UI to allow volume management

User Stories

Volume Creation

As a user I need to be able to create MetalK8s volume from the Platform UI.

At creation time I can specify the type of volume I want, and then either its size (for sparseLoopDevice) or the backing device (for rawBlockDevice).

I should be able monitor the progress of the volume creation from the Platform UI and see when the volume is ready to use (or if an error occured).

Volume Monitoring

As a user I should be able to see all the volumes existing on a specified node as well as their states.

Volume Deletion

As a user I need to be able to delete a MetalK8s volume from the Platform UI when I no longer need it.

The Platform UI should prevent me from deleting Volumes in use.

I should be able monitor the progress of the volume deletion from the Platform UI.

Component Interactions

User will create Metalk8s volumes through the Platform UI.

The Platform UI will create and delete Volume CRs from the API server.

The operator will watch events related to Volume CRs and PersistentVolume CRs owned by a Volume and react in order to update the state of the cluster to meet the desired state (prepare storage when a new Volume CR is created, clean up resources when a Volume CR is deleted). It will also be responsible for updating the states of the volumes.

To do its job, the operator will rely on Salt states that will be called asynchronously (to avoid blocking the reconciliation loop and keep a reactive system) through the Salt API. Authentication to the Salt API will be done though a dedicated Salt account (with limited privileges) using credentials from a dedicated cluster Service Account.

@startuml

title Volume Creation\n(nominal case)
skinparam titleBorderRoundCorner 15
skinparam titleBorderThickness 2
skinparam titleBorderColor red
skinparam titleBackgroundColor Aqua-CadetBlue

skinparam defaultTextAlignment center

actor User
participant "Platform UI" as UI
participant "API Server" as API
participant "Storage Operator" as Operator
participant "Salt API" as Salt
participant "Cluster Node" as Node

User->UI: Create a volume
UI->API: Create a **Volume** CR
API->UI: 200 OK
note left: The **Volume** now appears as **Unknown** in the UI

API->Operator: Notify: New **Volume** CR
Operator->Salt: Call **PrepareVolume**
Operator->API: Set **Volume** status to **Pending**
note left: The **Volume** now appears as **Pending** in the UI

Salt->Node: Send order to Salt minion
loop
  Operator->Salt: Poll Salt job status
  Salt->Operator: Job still in progress…

  ... ...
end
Node->Salt: Storage ready
Operator->Salt: Poll Salt job status
Salt->Operator: Job done
Operator->API: Create backing **PersistentVolume**
deactivate API
Operator->API: Set **Volume** status to **Available**
note left: The **Volume** now appears as **Available** in the UI

@enduml

@startuml

title Volume Deletion\n(nominal case)
skinparam titleBorderRoundCorner 15
skinparam titleBorderThickness 2
skinparam titleBorderColor red
skinparam titleBackgroundColor Aqua-CadetBlue

skinparam defaultTextAlignment center

actor User
participant "Platform UI" as UI
participant "API Server" as API
participant "Storage Operator" as Operator
participant "Salt API" as Salt
participant "Cluster Node" as Node

User->UI: Delete a volume
UI->API: Delete a **Volume** CR
API->UI: 200 OK
note left: The **Volume** is now marked for deletion

API->Operator: Notify: Delete **Volume** CR
Operator->API: Delete backing **PersistentVolume**
note left: The **PersistentVolume** is now marked for deletion

Operator->Salt: Call **UnprepareVolume**
Operator->API: Set **Volume** status to **Terminating**
note left: The **Volume** now appears as **Terminating** in the UI

Salt->Node: Send order to Salt minion
loop
  Operator->Salt: Poll Salt job status
  Salt->Operator: Job still in progress…

  ... ...
end
Node->Salt: Storage cleaned up
Operator->Salt: Poll Salt job status
Salt->Operator: Job done

Operator->API: Remove **PersistentVolume** finalizer
note left: The **PersistentVolume** object is really deleted

Operator->API: Remove **PersistentVolume** finalizer
note left: The **Volume** object is really deleted

@enduml

Implementation Details

Volume Status

A PersistentVolume from Kubernetes has the following states:

  • Pending: used for PersistentVolume that is not available

  • Available: a free resource that is not yet bound to a claim

  • Bound: the volume is bound to a claim

  • Released: the claim has been deleted, but the resource is not yet reclaimed by the cluster

  • Failed: the volume has failed its automatic reclamation

Similarly, our Volume object will have the following states:

  • Available: the backing storage is ready and the associated PersistentVolume was created

  • Pending: preparation of the backing storage in progress (e.g. an asynchronous Salt call is still running).

  • Failed: something is wrong with the volume (Salt state execution failed, invalid value in the CRD, …)

  • Terminating: cleanup of the backing storage in progress (e.g. an asynchronous Salt call is still running).

Persistent block device naming

In order to have a reliable automount through kubelet, we need to create the underlying PersistentVolume using a persistent name for the backing storage device. We use different strategies according to the Volume type:

  • sparseLoopDevice and rawBlockDevice with a filesystem: during the formatting, we set the filesystem UUID to the Volume UUID and use dev/disk/by-uuid/<volume-uuid> as device path.

  • sparseLoopDevice without filesystem: we create a GUID Partition Table on the sparse file and create a single partition encompassing the whole device, setting the GUID of the partition to the Volume UUID. We can then use /dev/disk/by-partuuid/<volume-uuid> as device path.

  • rawBlockDevice without filesystem:

    • the rawBlockDevice is a disk (e.g. /dev/sdb): we use the same strategy as above.

    • the rawBlockDevice is a partition (e.g. /dev/sdb1): we change the partition GUID using the Volume UUID and use /dev/disk/by-partuuid/<volume-uuid> as device path.

    • The rawBlockDevice is a LVM volume: we use the existing LVM UUID and use /dev/disk/by-id/dm-uuid-LVM-<lvm-uuid> as device path.

Operator Reconciliation Loop

Reconciliation Loop (Top Level)

When the operator receives a request, the first thing it does is to fetch the targeted Volume. If it doesn’t exist, which happens when a volume is Terminating and has no finalizer, then there nothing more to do.

If the volume does exist, the operator has to check its semantic validity.

Once pre-checks are done, there are four cases:

  1. the volume is marked for deletion: the operator will try to delete the volume (more details in Volume Finalization).

  2. the volume is stuck in an unrecoverable (automatically at least) error state: the operator can’t do anything here, the request is considered done and won’t be rescheduled.

  3. the volume doesn’t have a backing PersistentVolume (e.g. newly created volume): the operator will deploy the volume (more details in Volume Deployment).

  4. the backing PersistentVolume exists: the operator will check its status to update the volume’s status accordingly.

@startuml

title Reconciliation loop (top level)
skinparam titleBorderRoundCorner 15
skinparam titleBorderThickness 2
skinparam titleBorderColor red
skinparam titleBackgroundColor Aqua-CadetBlue

skinparam wrapWidth 75
skinparam defaultTextAlignment center

start

partition Reconciliate {
  if (**Volume** exists?) then (yes)
    if (**Volume** is valid?) then (yes)
      if (**Volume** is marked for deletion?) then (yes)
        #F000F0:Finalize **Volume**|
        stop
      else (no)
        if (**Volume** is **Failed**?) then (yes)
          #00FF00:Don't reschedule;
          note right: Nothing we can do here
          stop
        else (no)
          if (**Volume** is backed by **PersistentVolume**?) then (yes)
            if (**PersistentVolume** is healthy?) then (yes)
              #00FF00:Don't reschedule;
              stop
            else (no)
              #FF0000:Set **Volume** status to **Failed**;
              #00F0F0:Reschedule now;
              stop
            endif
          else (no)
            #F000F0:Deploy **Volume**|
            stop
          endif
        endif
      endif
    else (no)
      #FF0000:Set **Volume** status to **Failed**;
      #00F0F0:Reschedule now;
      stop
    endif
  else (no)
    #00FF00:Don't reschedule;
    note right: **Volume** has been deleted
    stop
  endif
}

@enduml

Volume Deployment

To deploy a volume, the operator needs to prepare its storage (using Salt) and create a backing PersistentVolume.

If the Volume object has no value in its Job field, it means that the deployment hasn’t started, thus the operator will set a finalizer on the Volume object and then start the preparation of the storage using an asynchronous Salt call (which gives a job ID) before rescheduling the request to monitor the evolution of the job.

If we do have a job ID, then something is in progress and we monitor it until it’s over. If it has ended with an error, we move the volume into a failed state.

Otherwise we make another asynchronous Salt call to get information (size, persistent path, …) on the backing storage device (the polling is done exactly as described above).

If we successfully retrieved the storage device information, we proceed with the PersistentVolume creation, taking care of putting a finalizer on the PersistentVolume (so that its lifetime is tied to ours) and setting ourself as the owner of the PersistentVolume.

Once the PersistentVolume is successfuly created, the operator will move the Volume to the Available state and reschedule the request (the next iteration will check the health of the PersistentVolume just created).

@startuml

title Volume Deployment
skinparam titleBorderRoundCorner 15
skinparam titleBorderThickness 2
skinparam titleBorderColor red
skinparam titleBackgroundColor Aqua-CadetBlue

skinparam wrapWidth 75
skinparam defaultTextAlignment center

partition DeployVolume {
  start

  :Check value of the field **Job**;
  split
    -> No value;
    :Add finalizer on **Volume**;
    :Spawn Salt job **PrepareVolume**;
    #FFA500:Set **Volume** status to **Pending**;
    #00F0F0:Reschedule now;
  split again
    -> Job ID;
    :Poll the status of the Salt job;
    split
      -> Job failed;
      #FF0000:Set **Volume** status to **Failed**;
      #00F0F0:Reschedule now;
    split again
      -> Job not found;
      :Unset the **Volume** field **Job**;
      note right: This will relaunch the job
      #00F0F0:Reschedule now;
    split again
      -> Job succeed;
      :Set the **Volume** field **Job** to "DONE";
      #00F0F0:Reschedule now;
    split again
      -> Job still in progress;
      #00F0F0:Reschedule in 10s;
    end split;
  split again
    -> "DONE";
    :Create the backing **PersistentVolume**;
    #00FF00:Set **Volume** status to **Available**;
    #00F0F0:Reschedule now;
  end split;

  stop
}

@enduml

Steady state

Once the volume is deployed, we update, with a synchronous Salt call, the deviceName status field at each reconciliation loop iteration. This field contains the name of the underlying block device (as found under /dev).

Volume Finalization

A Volume in state Pending cannot be deleted (because the operator doesn’t know where it is in the creation process). In such cases, the operator will we reschedule the request until the volume becomes either Failed or Available.

For volumes with no backing PersistentVolume, the operator will directly reclaim the storage on the node (using an asynchronous Salt job) and upon completion it will remove the Volume finalizer to let Kubernetes delete the object.

If there is a backing PersistentVolume, the operator will delete it (if it’s not already in a terminating state) and watch for the moment when it becomes unused (this is done by rescheduling). Once the backing PersistentVolume becomes unused, the operator will reclaim its storage and remove the finalizers to let the object be deleted.

@startuml

title Volume Finalization
skinparam titleBorderRoundCorner 15
skinparam titleBorderThickness 2
skinparam titleBorderColor red
skinparam titleBackgroundColor Aqua-CadetBlue

skinparam wrapWidth 75
skinparam defaultTextAlignment center

partition FinalizeVolume {
  start

  if (**Volume** is **Pending**?) then (YES)
     #00F0F0:Reschedule in 10s;
     note right: Wait for the creation to terminate
     stop
  else (NO)
    if (**Volume** is backed by **PersistentVolume**?) then (YES)
      if (**PersistentVolume** is **Terminating**?) then (YES)
        if (**PersistentVolume** is unused?) then (YES)
          #F000F0:**ReclaimStorage**|
          stop
        else (NO)
          #00F0F0:Reschedule in 10s;
          stop
        endif
      else (NO)
        :Delete **PersistentVolume**;
        note right: Will go in **Terminating** state
        #00F0F0:Reschedule now;
        stop
      endif
    else (NO)
      #F000F0:**ReclaimStorage**|
      stop
    endif
  endif
}

partition ReclaimStorage {
 start

 :Check value of the **Volume** field **Job**;
 split
   -> No value;
   :Spawn Salt job **UnprepareVolume**;
   #FFA500:Set **Volume** status to **Terminating**;
   #00F0F0:Reschedule now;
 split again
   -> Job ID;
   :Poll the status of the Salt job;
   split
     -> Job failed;
     #FF0000:Set **Volume** status to **Failed**;
     #00F0F0:Reschedule now;
   split again
     -> Job not found;
     :Unset the **Volume** field **Job**;
     note right: This will relaunch the job
     #00F0F0:Reschedule now;
   split again
     -> Job succeed;
     :Set the **Volume** field **Job** to "DONE";
     #00F0F0:Reschedule now;
   split again
     -> Job still in progress;
     #00F0F0:Reschedule in 10s;
   end split;
 split again
   -> "DONE";
   :Remove finalizer on the backing **PersistentVolume**;
   :Remove finalizer on the **Volume**;
   #00FF00:Do not reschedule;
   note right: The **Volume** object will be deleted by Kubernetes
 end split;

 stop
}
@enduml

Volume Deletion Criteria

A volume should be deletable from the UI when it’s deletable from a user point of view (you can always delete an object from the API), i.e. when deleting the object will trigger an “immediate” deletion (i.e. the object won’t be retained).

Here are the few rules that are followed to decide if a Volume can be deleted or not:

  • Pending states are left untouched: we wait for the completion of the pending action before deciding which action to take.

  • The lack of status information is a transient state (can happen between the Volume creation and the first iteration of the reconciliation loop) and thus we make no decision while the status is unset.

  • Volume objects whose PersistentVolume is bound cannot be deleted.

  • Volume objects in Terminating state cannot be deleted because their deletion is already in progress!

In the end, a Volume can be deleted in two cases:

  • it has no backing PersistentVolume

  • the backing PersistentVolume is not bound (Available, Released or Failed)

@startuml

title Volume deletion decision tree
skinparam titleBorderRoundCorner 15
skinparam titleBorderThickness 2
skinparam titleBorderColor red
skinparam titleBackgroundColor Aqua-CadetBlue

skinparam wrapWidth 75
skinparam defaultTextAlignment center

start

:**Volume** Status;
split
  -> **Unknown**;
  #FF0000:**CANNOT DELETE**]
  stop
split again
  -> **Pending**;
  #FF0000:**CANNOT DELETE**]
  stop
split again
  -> **Failed**/**Available**;
  if (**Volume** has **PersistentVolume**?) then (YES)
    :**PersistentVolume** Status;
    split
      -> **Pending**;
      #FF0000:**CANNOT DELETE**]
      stop
    split again
      -> **Available**;
      #00FF00:**CAN DELETE**]
      stop
    split again
      -> **Bound**;
      #FF0000:**CANNOT DELETE**]
      stop
    split again
      -> **Released**;
      #00FF00:**CAN DELETE**]
      stop
    split again
      -> **Failed**;
      #00FF00:**CAN DELETE**]
      stop
    end split;
  else (NO)
    #00FF00:**CAN DELETE**]
    stop
  endif
split again
  -> **Terminating**;
  #FF0000:**CANNOT DELETE**]
  stop
end split;

@enduml

Documentation

In the Operational Guide:

  • document how to create a volume from the CLI

  • document how to delete a volume from the CLI

  • document how to create a volume from the UI

  • document how to delete a volume from the UI

  • document how to create a StorageClass from the CLI (and mention that we should set VolumeBindingMode to WaitForFirstConsumer)

In the Developper Documentation:

  • document how to run the operator locally

  • document this design

Test Plan

We should have automated end-to-end tests of the feature (creation and deletion), from the CLI and maybe on the UI part as well.