Monitoring etcd backups

As part of a previous post, I described how I am running etcd in memory. This setup requires that regular backups are taken of etcd and this must be monitore.

When an etcd backup is taken, it writes out an etcd backup file to a known location, also it writes out the container image name that is used by etcd. The latter is important for restore to meke sure that exactly the same version i of etcd is used for restoring the backup to avoid possible compatibility issues. The code that does the backup and restore is listed here. The solution requires some monitoring to make sure that everything is going as intended, in particular:

  • the backup must be recent. Backups are taken every 15 minutes so it a backup is too old, then it could be a problem,
  • the backup must be of a minimal size. This detects issues with errors during the backup causing for instance an empty bacup file or a backup file that is on the small side.
  • disk space for the /var/lib/etcd directory, which is a ramdisk (tmpfs) must be monitored to ensure that etcd can continue to run.

The backup creates two files that are monitored, namely:

  • etcd-snapshot-latest.db: The latest backup of etcd
  • etcdimage: The full name of the container image that is used to run etcd.

Prometheus

Monitoring is done using prometheus. Prometheus works by scraping metrics from metrics endpoints, which means invoking an HTTP(S) endpoint which then returns metrics and their values in a simple text format. Prometheus can then provide alerting when certain conditions occur using prometheus query language.

Prometheus and grafana installation is done using helm which provides a prometheus installation based on custom resources. So this installation does not automatically find scraping endpoints by inspecting annotations but instead requires a ServiceMonitor resource to define prometheus jobs.

Monitoring file size and modification time

Metrics are provided by a prometheus exporter that collects:

  • file_time_seconds: the file modification time in seconds since 1st Jan 1970 or 0 if the file is not found.
  • file_size: the file size in bytes or 0 when the file does not exist

As far as I know, there are no current prometheus exporters that can provide these metrics so we will write our own. To simplify usage of the metrics, the metrics will include both a path and a type, where each path uniquely maps to a type. For instance:

file_size{app="controllerbackupmonitoring", 
          container="exporter", 
          endpoint="http", 
          instance="10.200.229.30:8080", 
          job="controllerbackupexporter", 
          namespace="monitoring", 
          path="/backup/etcd-snapshot-latest.db", 
          pod="controllerbackupmonitoring-77cf9d9675-vdlmz", 
          service="controllerbackupexporter", 
          type="backup"}

The metric above contains a number of labels added automatically by prometheus such as the job, and it contains the path and the type of the file that is monitored. The type is simply a shorthand for the file that should not change when the file path is changed and is also much shorter and easier to use in the configuration of alerts.

As a consequence of this, the prometheus exporter must be configured with the paths and types. This is how the prometheus exporter is started as follows for the current use case:

python3 -u exporter.py \
        backup:/backup/etcd-snapshot-latest.db \
        image:/backup/etcdimage

Here the -u flag is used to disable buffering of the output so we get to see the output immediately when it is running from a kubernetes pod. The exporter.py script is the python service that exposes the scraping endpoint. Finally, the files to be monitored are listed with their symbolic (short) name and full path. This way of configuring the exporter makes it reusable since it can be used to monitor and arbitrary number of files for different use cases.

The code can be found here (NOTE: docker repo name is redacted). To explain the setup,
a look at the deployment.yaml is interesting:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: controllerbackupmonitoring
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: controllerbackupmonitoring
  template:
    metadata:
      labels:
        app: controllerbackupmonitoring                   # A
    spec:
      terminationGracePeriodSeconds: 0
      tolerations:                                        # B
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
          operator: Exists
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Exists
      affinity:
        nodeAffinity:                                     # C
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
               - matchExpressions:
                   - key: node-role.kubernetes.io/control-plane
                     operator: Exists
      containers:
        - name: exporter
          image: docker.example.com/filemonitor:1.0
          args:
            - python3
            - -u
            - /exporter.py 
            # the backup and the image file to monitor
            - backup:/backup/etcd-snapshot-latest.db
            - image:/backup/etcdimage
          ports:
            - containerPort: 8080
              protocol: TCP
              name: http
          volumeMounts:
            - name: controllerbackup
              mountPath: /backup
              readOnly: true
      volumes:
        - name: controllerbackup                           # D
          hostPath:
            path: /var/lib/wamblee/etcd
  • # A: the label by which the service monitor is configured to find the deployment to monitor
  • # B: tolerations to allow it to run on the controller node where the backup is taken
  • # C: selecting the controller node to run on. In combination with B this forces the exporter to run on all controller nodes.
  • # D: access to the directory on the controller containing the backup files.

A ServiceMonitor resource is required to tell prometheus to scrape the new exporter:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: controllerbackupmonitoring
  namespace: monitoring
spec: 
  endpoints:
  - honorLabels: true
    path: /metrics
    port: http
    scheme: http
    scrapeTimeout: 30s
  selector:
    matchLabels:
      app: controllerbackupmonitoring
  targetLabels:
    - app                                       # A
  • # A: the app label from the kubernetes deployment is transferred to the prometheus metric.

Alerting

Disk space monitoring is done using the standard node_file_system_free_bytes and node_file_system_size_bytes metrics that are provided by the node exporter. The alerts for the backup use the new metric. See the alerting rules for the details.

Deployment

See the README file for more details.

 

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *