{"id":2572,"date":"2023-01-10T21:13:41","date_gmt":"2023-01-10T21:13:41","guid":{"rendered":"https:\/\/brakkee.org\/site\/?p=2572"},"modified":"2023-02-17T18:53:35","modified_gmt":"2023-02-17T18:53:35","slug":"monitoring-logs-on-k8s-with-loki-and-grafana","status":"publish","type":"post","link":"https:\/\/brakkee.org\/site\/2023\/01\/10\/monitoring-logs-on-k8s-with-loki-and-grafana\/","title":{"rendered":"Monitoring logs on k8s with loki and grafana"},"content":{"rendered":"<p>This post describes how to monitor logs in kubernetes with <a href=\"https:\/\/grafana.com\/\">grafana<\/a> and <a href=\"https:\/\/grafana.com\/oss\/loki\/\">loki<\/a>. This covers the use case of logging for troubleshooting purposes. That is, it allows analysing human readable logs coming from multiple systems in one aggregated log. Human readable logs are required for troubleshooting and optimization. It is the bare minimum of logging that is required.<br \/>\n<!--more--><\/p>\n<p>Another use case is event logging where log entries are parsed and put into a structured form before being stored. The latter type of logs can be used for analytics. Examples of that use case are:<\/p>\n<ul>\n<li>analyzing incoming external requests and plotting their (estimated) geographical locations on a map, perhaps even filtered on user agent and\/or URL.<\/li>\n<li>analyzing the actions taken by end users of a system to capture their use of a system, enhanced with non-functionals such as start time, end time, and allocated memory.<\/li>\n<li>analyzing the execution of data processing pipelines on a system over time.<\/li>\n<\/ul>\n<p>So this post is about the less-adventurous case of human readable logging. A follow-up post will discuss the analytic logging use case.<\/p>\n<p>As for tool choices, this post will focus on best of breed and most used tools for this use case on kubernetes, meaning <a href=\"https:\/\/fluentbit.io\/\">fluentbit<\/a> (instead of <a href=\"https:\/\/www.elastic.co\/beats\/filebeat\">filebeat<\/a>\/<a href=\"https:\/\/www.elastic.co\/logstash\/\">logstash<\/a>, <a href=\"https:\/\/grafana.com\/docs\/loki\/latest\/clients\/promtail\/\">promtail<\/a>, <a href=\"https:\/\/github.com\/grafana\/agent\">grafana agent<\/a>), loki (instead of <a href=\"https:\/\/www.elastic.co\/\">elastic search<\/a> or <a href=\"https:\/\/opensearch.org\/\">opensearch<\/a> or <a href=\"https:\/\/imply.io\/what-is-druid\/\">druid<\/a>), and grafana (instead of <a href=\"https:\/\/www.elastic.co\/kibana\/\">kibana<\/a>). So in a way it is just the ELK stack but without elastic, logstash, and kibana. Then there are also commercial tools such as e.g. <a href=\"https:\/\/www.datadoghq.com\/\">datadog<\/a>, <a href=\"https:\/\/www.dynatrace.com\/\">dynatrace<\/a>, <a href=\"https:\/\/www.splunk.com\/\">splunk <\/a>that are out of scope for my home environment.<\/p>\n<p>Fluentbit&#8217;s great advantage is its simplicity of configuration, its integration with kubernetes, and its support for various output destinations. Loki&#8217;s advantages are also its efficiency and small footprint. Loki is more basic in its functionality than for instance analytic tools since it only allows querying logs based on time and a limited predefined number of labels. Drill down is then done using full text search. So it does not require indexing of each log record and is thus more efficient than for instance elastic. Grafana&#8217;s advantage is its easy to understand UI and tight integration with Loki (The latter is understandable since Loki is also a grafana labs product). When we introduced grafana at work we immediately saw a productivity spike compared to kibana which has a more difficult UI. In addition, grafana is also used in combination with prometheus with is the de-facto standard for monitoring on kubernetes.<\/p>\n<p>As can be seen, the whole landscape regarding observability tooling is extermely large with the explanation above focussing on just a subset of open source and commercial tools. Also, some tools have overlappiong functionality, and some companies, like grafana labs, are trying to expand beyond a visualization tool and try to provide a full observability stack. The commercial tools also try to have a focus on security but for that open source such as <a href=\"https:\/\/wazuh.com\">wazuh<\/a> can also be used in combination with grafana.<\/p>\n<h2>Deployment<\/h2>\n<p>The deployment of this pipeline on kubernetes is as follows:<\/p>\n<p><img src=http:\/\/www.plantuml.com\/plantuml\/img\/RP9BRiGW38RtEGNAFgNBKHVTTDqJL5Nb04cP48XupBItBp0M0qmtaVtxymCx0oXbhb-x_95w7Hm39SZpqsdubbmG9YrJOk3WOGOdo4WlOAaowqZ0aMX7yo4gPyeDsuIbp6WFKWlhCb_ScENgD1jqAahF9bWDAiFkrtcnyupU6f69VyIfkD2VqsjNhF5Qi2vKzKmcNIqie24tHFjYr0EMSkDdanbCHgBCV1RCap7vBsz6Bt9zXwCWB2k5MbGGsi_ITsZHEtuxkKd1BDIqmt74s_GjKgpBmITq9Ah8ChCVHiUMrKZj83NIFBeE_BEveLGWSUtuSpHQUxCcL5-f2eGOwZVKXUj5f2b9dO8AJnoUxQO4L30Hdq1GlVifi_epYXxkEfRu0m00 alt=\"PlantUML Syntax:<br \/>\nallow_mixing<br \/>\nscale 0.8<br \/>\nhide circle<\/p>\n<p>database &#8220;\/var\/log&#8221; as varlog<br \/>\ndatabase &#8220;\/var\/lib\/docker\/containers&#8221; as containerlogs<br \/>\ndatabase &#8220;\/run\/log\/journal&#8221; as systemdlogs<br \/>\nvarlog -d[hidden]-&gt; containerlogs<br \/>\ncontainerlogs -d[hidden]-&gt; systemdlogs<\/p>\n<p>database &#8220;\/data\/grafana&#8221; as grafanavol<br \/>\ndatabase &#8220;\/data\/logs\/loki&#8221; as lokivol<\/p>\n<p>object &#8220;loki:ConfigMap&#8221; as lokiconfig<br \/>\nobject &#8220;fluentbit:ConfigMap&#8221; as fluentbitconfig<br \/>\nobject &#8220;k8s-label-mapping:ConfigMap&#8221; as labelconfig<br \/>\nobject &#8220;grafana:ConfigMap&#8221; as grafanaconfig<\/p>\n<p>component &#8220;fluentbit&#8221; as fluentbit<br \/>\ncomponent &#8220;loki&#8221; as loki<br \/>\ncomponent &#8220;grafana&#8221; as grafana<\/p>\n<p>fluentbit -&gt; varlog<br \/>\nfluentbit -&gt; containerlogs<br \/>\nfluentbit -&gt; systemdlogs<br \/>\nfluentbit -u-&gt; fluentbitconfig<br \/>\nfluentbit -u-&gt; labelconfig<\/p>\n<p>loki -&gt; fluentbit<br \/>\nloki -d-&gt; lokivol<br \/>\nloki -u-&gt; lokiconfig<\/p>\n<p>grafana -&gt; loki<br \/>\ngrafana -d-&gt; grafanavol<br \/>\ngrafana -u-&gt; grafanaconfig<\/p>\n<p>\" usemap=\"#plantuml_map\"><\/p>\n<p>The deployment is quite standard, the dependency structure between the components is clear and each componet requires some configuration in the form of configmaps. Fluentbit is deployed as a Daemonset and mounts several directories of the node on which it is running. It collects the logs from there and uses the kubernetes API server to obtain metadata for the log files it is tailing. Fluentbit follows the standard output of all running containers in the cluster.<\/p>\n<p>Each component is explain in more detail below.<\/p>\n<h2>Fluentbit<\/h2>\n<p>Fluentbit runs In kubernetes as a <em>DaemonSet<\/em> and mounts several standard logs directories such as <em>\/var\/log<\/em>. It then tails the logs found there based on its configuration and uses the API server to get the metadata for the collected log files. Each entry in a given log file is then enhanced with the metadata found. Most importantly, fluentbit adds labels to each log entries which can be used as search keys. As mentioned earlier, loki is not an analytics engine: querying is limited to time intervals and the assigned labels and full text search is used after that.In this way, the role of fluentbit is most similar to filebeat (part of the ELK stack) which also adds metadata to log entries. The result of this is a JSON file that contains the log entry as well as metadata and labels.<\/p>\n<p>For instance, the log line<\/p>\n<pre>11.22.33.44, 10.100.208.16 10.200.208.8 - user1 [09\/Jan\/2023:20:26:00 +0000] \"PROPFIND \/svn\/example\/!svn\/bc\/89\/trunk HTTP\/1.1\" 207 510 example.com\r\n<\/pre>\n<p>is transformed by fluentbit to the following and sent to loki:<\/p>\n<pre>{\r\n  \"date\": \"2023-01-09T20:26:00.927027Z\",\r\n  \"time\": \"2023-01-09T21:26:00.927027683+01:00\",\r\n  \"stream\": \"stdout\",\r\n  \"_p\": \"F\",\r\n  \"log\": \"11.22.33.44, 10.100.208.16 10.200.208.8 - user1  [09\/Jan\/2023:20:26:00 +0000] \\\"PROPFIND \/svn\/example\/!svn\/bc\/89\/trunk HTTP\/1.1\\\" 207 510 example.com\",\r\n  \"kubernetes\": {\r\n    \"pod_name\": \"httpd-vcs-0\",\r\n    \"namespace_name\": \"example-com\",\r\n    \"pod_id\": \"b78bbe7e-34d9-4266-9849-98b7e2981836\",\r\n    \"labels\": {\r\n      \"app\": \"httpd-vcs\",\r\n      \"controller-revision-hash\": \"httpd-vcs-69f4b5c7d5\",\r\n      \"statefulset.kubernetes.io\/pod-name\": \"httpd-vcs-0\"\r\n    },\r\n    \"annotations\": {\r\n      \"cni.projectcalico.org\/containerID\": \"ff97877298a26b9c378958fbfd7e8f82713e4c820f7fb476f9900baea1d7e9e4\",\r\n      \"cni.projectcalico.org\/podIP\": \"10.200.24.50\/32\",\r\n      \"cni.projectcalico.org\/podIPs\": \"10.200.24.50\/32\"\r\n    },\r\n    \"host\": \"baboon\",\r\n    \"container_name\": \"httpd\",\r\n    \"docker_id\": \"46a325b06e3b68c39e0a2388635ce8713fddfe128d4f8230615ecb8d4c6177e8\",\r\n    \"container_hash\": \"registry.example.com\/vcs@sha256:1107edf31f26b1188244b51c223e63a1afc5b9bc4284f85c755dae68db9c034c\",\r\n    \"container_image\": \"registry.example.com\/vcs:latest\"\r\n  }\r\n}\r\n<\/pre>\n<p>As can be seen, a lot of metadata is added ny loki. In paritcular, the kubernetes section is added automatically by loki.<\/p>\n<h3>Installation<\/h3>\n<p>Fluentbit is installed using helm as follows:<\/p>\n<pre># add the repo and inspect the chart\r\nhelm repo add fluent https:\/\/fluent.github.io\/helm-charts\r\nhelm search repo fluent-bit\r\nhelm show values fluent\/fluent-bit &gt; fluent-values.yaml\r\n\r\n# install it\r\nhelm upgrade --install --version 0.21.6 --namespace monitoring \\\r\n     --release-name fluent-bit fluent\/fluent-bit --values fluent-values.yaml \r\n<\/pre>\n<p>The above command can be used for initial install as well as for upgrade.<\/p>\n<p>The following values are used for configuration:<\/p>\n<pre>tolerations:                                             # A \r\n- key: node-role.kubernetes.io\/master\r\n  operator: Exists\r\n\r\nconfig:       \r\n  outputs: |\r\n    [OUTPUT]                                             # B \r\n        Name stdout\r\n        Match * \r\n        Format json_lines \r\n        json_date_format iso8601\r\n\r\n    [OUTPUT]                                             # C\r\n        Name loki\r\n        Match kube.* \r\n        host loki.monitoring.svc.cluster.local\r\n        #auto_kubernetes_labels on \r\n        #http_user loki\r\n        #http_passwd loki123\r\n        label_map_path \/etc\/k8s_label_mapping.yaml\r\n\r\n    [OUTPUT]                                             # D\r\n        Name loki\r\n        Match host.* \r\n        host loki.monitoring.svc.cluster.local\r\n        #http_user loki\r\n        #http_passwd loki123\r\n        labels host=$_HOSTNAME, service=$_SYSTEMD_UNIT\r\n\r\nvolumeMounts:                                            \r\n- name: config\r\n  mountPath: \/fluent-bit\/etc\/fluent-bit.conf\r\n  subPath: fluent-bit.conf\r\n- name: config\r\n  mountPath: \/fluent-bit\/etc\/custom_parsers.conf\r\n  subPath: custom_parsers.conf\r\n- name: k8s-label-mapping                                # E\r\n  mountPath: \/etc\/k8s_label_mapping.yaml\r\n  subPath: k8s_label_mapping.yaml\r\n\r\ndaemonSetVolumes:                                        \r\n- name: varlog\r\n  hostPath:\r\n    path: \/var\/log\r\n- name: runlogjournal                                    # F\r\n  hostPath:\r\n    path: \/run\/log\/journal\r\n- name: varlibdockercontainers                           # G\r\n  hostPath:\r\n    path: \/var\/lib\/docker\/containers\r\n- name: etcmachineid\r\n  hostPath:\r\n    path: \/etc\/machine-id\r\n    type: File\r\n- name: k8s-label-mapping                                # E\r\n  configMap:\r\n    name: k8s-label-mapping\r\n\r\ndaemonSetVolumeMounts:\r\n- name: varlog\r\n  mountPath: \/var\/log\r\n- name: runlogjournal                                    # F\r\n  mountPath: \/run\/log\/journal\r\n- name: varlibdockercontainers                           # G                           \r\n  mountPath: \/var\/lib\/docker\/containers\r\n  readOnly: true\r\n- name: etcmachineid\r\n  mountPath: \/etc\/machine-id\r\n  readOnly: true\r\n<\/pre>\n<ul>\n<li><em># A<\/em>: These tolerations make sure that <em>DaemonSet<\/em> pods also run on the controller nodes so we also get logs from pods running there<\/li>\n<li><em># B<\/em>: The standard output plugin can be used for debugging. It should be the first, and only, output plugin just after you have deployed fluentbit to check if it is actually finding the logs<\/li>\n<li><em># C<\/em>: The output used by kubernetes logs (<em>kube.*<\/em>). Note that I am not using the auto_kubernetes_labels option since the labels it adds are not convenient and are missing some essential details. Instead, I\u00a0 am using the <em>label_map_path<\/em> option which points to the following file that defines the labels to extract for kubernetes log entries. The label mapping JSON is defined as follows and is provided by a <em>ConfigMap<\/em>:\n<pre>{ \r\n  \"kubernetes\": {\r\n    \"pod_name\": \"pod\",\r\n    \"namespace_name\": \"namespace\",\r\n    \"pod_id\": \"pod_id\",\r\n    \"labels\": {\r\n      \"app.kubernetes.io\/instance\": \"instance\",\r\n      \"app.kubernetes.io\/name\": \"app\",\r\n      \"app\": \"app\"      \r\n    },\r\n    \"annotations\": {\r\n      \"cni.projectcalico.org\/containerID\": \"container_id\",\r\n      \"cni.projectcalico.org\/podIP\": \"pod_ip\",\r\n    },\r\n    \"host\": \"host\",\r\n    \"container_name\": \"container\",\r\n    \"container_image\": \"image\"\r\n  }\r\n}\r\n<\/pre>\n<p>The paths in the mapping file match those in the JSON output shown before, and the values are the names of the labels. So this mapping extracts basic information such as pod, namespace, pod id, app label (from two different labels), container_id, pod_ip, container name, and image.<\/li>\n<li><em># D<\/em>: The output of host level logs. By default these are the systemd logs of the kubelet service running on each node. It is possible to collect logs of other services as well.<\/li>\n<li><em># C\/ # D<\/em>: The host is <em>loki.monitoring.svc.cluster.local<\/em> which points to a loki service in the monitoring namespace. Note that I am not using user name and password to connect to loki. Instead I will use network policies to limit access. It is a good idea to configure authentication as soon as logs from outside the cluster are collected as well.<\/li>\n<li><em># E<\/em>: The k8s label mapping volume mount is for mounting the label mapping config file described above.<\/li>\n<li><em># F<\/em>: The<em> \/run\/log\/journal<\/em> host path is required on RHEL\/centos\/rocky distributions since the standard log location \/var\/log is not used to store journald logs on those systems. This mount is not present in the standard fluentbit helm chart.<\/li>\n<li><em># G<\/em>: The container logs. On my system, I am not using docker but containerd from a docker repository so the logs are at a docker path.<\/li>\n<\/ul>\n<h2>Loki<\/h2>\n<p>Loki uses either a monolithic, simple scalable, or microservices approach. Since there is not much traffic on my cluster and I want to keep things simple I am using the monolithic approach. For production the simple scalable approach would probably be a good choice. Note however that we running on kubernetes in the cloud, the cloud provider will already have logging facilities so it would be even better to use those instead of loki.<\/p>\n<p>Loki is installed using helm as follows:<\/p>\n<pre>helm repo add grafana https:\/\/grafana.github.io\/helm-charts\r\nhelm show values grafana\/loki &gt; loki-values.yaml\r\nhelm upgrade --install --version 3.8.0 --release-name loki \\\r\n  grafana\/loki --values loki-values.yaml\r\n<\/pre>\n<p>The values used for loki are as follows:<\/p>\n<pre>loki:\r\n  auth_enabled: false\r\n\r\n  commonConfig:\r\n    replication_factor: 1\r\n\r\n  storage:\r\n    type: filesystem\r\n  compactor:\r\n    retention_enabled: true\r\n\r\ntest:\r\n  enabled: false\r\n  \r\nmonitoring:\r\n  selfMonitoring:\r\n    enabled: false\r\n\r\n    grafanaAgent:\r\n      installOperator: false\r\n\r\nsingleBinary:\r\n  persistence:\r\n    selector:\r\n      matchLabels:\r\n        disktype: lokidata\r\n<\/pre>\n<p>These values are largely self-explanatory and their definition can be found in<br \/>\nthe output of helm show values. The persistent volume claim used by loki<br \/>\nhas a <em>disktype<\/em> label so I am able to bind it to a specific persistent volume that I use for loki. In addition, the persistent volume I use also has a <em>claimRef<\/em> field to bind it to the PVC generated by the loki helm chart.<\/p>\n<h2>Grafana<\/h2>\n<p>Grafana installation is the simplest of all and I simply followed the <a href=\"https:\/\/grafana.com\/docs\/grafana\/latest\/setup-grafana\/installation\/kubernetes\/ \">instructions<\/a>. This led to the following yaml file:<\/p>\n<pre>apiVersion: apps\/v1\r\nkind: StatefulSet\r\nmetadata:\r\n  labels:\r\n    app: grafana\r\n  name: grafana\r\n  namespace: monitoring\r\nspec:\r\n  serviceName: grafana\r\n  selector:\r\n    matchLabels:\r\n      app: grafana\r\n  template:\r\n    metadata:\r\n      labels:\r\n        app: grafana\r\n    spec:\r\n      securityContext:\r\n        fsGroup: 472\r\n        supplementalGroups:\r\n          - 0\r\n      containers:\r\n        - name: grafana\r\n          image: grafana\/grafana:9.1.0\r\n          imagePullPolicy: IfNotPresent\r\n          ports:\r\n            - containerPort: 3000\r\n              name: http-grafana\r\n              protocol: TCP\r\n          readinessProbe:\r\n            failureThreshold: 3\r\n            httpGet:\r\n              path: \/robots.txt\r\n              port: 3000\r\n              scheme: HTTP\r\n            initialDelaySeconds: 10\r\n            periodSeconds: 30\r\n            successThreshold: 1\r\n            timeoutSeconds: 2\r\n          livenessProbe:\r\n            failureThreshold: 3\r\n            initialDelaySeconds: 30\r\n            periodSeconds: 10\r\n            successThreshold: 1\r\n            tcpSocket:\r\n              port: 3000\r\n            timeoutSeconds: 1\r\n          resources:\r\n            requests:\r\n              cpu: 250m\r\n              memory: 750Mi\r\n          volumeMounts:\r\n            - mountPath: \/var\/lib\/grafana\r\n              name: grafana-pvc\r\n            - mountPath: \/etc\/grafana\/grafana.ini\r\n              name: grafana\r\n              subPath: grafana.ini\r\n      volumes:\r\n        - name: grafana\r\n          configMap:\r\n            name: grafana\r\n  volumeClaimTemplates:\r\n    - metadata:\r\n        name: grafana-pvc\r\n      spec:\r\n        volumeName: grafana\r\n        accessModes:\r\n          - ReadWriteOnce\r\n        resources:\r\n          requests:\r\n            storage: 10Gi\r\n---\r\napiVersion: v1\r\nkind: Service\r\nmetadata:\r\n  name: grafana\r\n  namespace: monitoring\r\nspec:\r\n  type: ClusterIP\r\n  ports:\r\n    - port: 3000\r\n      targetPort: 3000\r\n  selector:\r\n    app: grafana\r\n\r\n<\/pre>\n<p>The only special thing in the setup is that I extracted the <em>grafana.ini<\/em> file from the container and put it in a configmap so I could modify the <em>smtp<\/em> settings and the <em>root_url<\/em> parameter to match the hostname under which grafana is exposed externally. The SMTP settings are:<\/p>\n<pre>[smtp]\r\nenabled = true\r\nhost = mail.exposure:25\r\n;user =\r\n# If the password contains # or ; you have to wrap it with triple quotes. Ex \"\"\"#password;\"\"\"\r\n;password =\r\n;cert_file =\r\n;key_file =\r\n;skip_verify = false\r\nfrom_address = grafana@example.com\r\nfrom_name = grafana@example\r\n# EHLO identity in SMTP dialog (defaults to instance_name)\r\n;ehlo_identity = dashboard.example.com\r\n# SMTP startTLS policy (defaults to 'OpportunisticStartTLS')\r\nstartTLS_policy = NoStartTLS\r\n<\/pre>\n<p>In my setup, I allow mails to be sent from within the cluster using the mail service running in the exposure namespace. An essential setting was the <em>startTLS_policy<\/em> to disable TLS negotation. If you use an external mail server, it might be a good idea to configure security in a stronger way.<\/p>\n<h2>Viewing logs in grafana<\/h2>\n<p>Configure the data source using Settings\/Configuration\/Data Sources and then adding a Loki data source. The only setting that must be done is just <em>http:\/\/loki:3100<\/em> since loki is running as the service loki in the same namespace as grafana.<\/p>\n<p>Next, explore the dataset using the Explore menu, select a label and a value and click on run query. By default grafana will show the raw log records in JSON format with all the metadata. However, using a few filters the log line can be formatted in a more standard way:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-2620 size-full\" src=\"https:\/\/brakkee.org\/site\/wp-content\/uploads\/2023\/01\/lokiexplore.png\" alt=\"\" width=\"1606\" height=\"925\" srcset=\"https:\/\/brakkee.org\/site\/wp-content\/uploads\/2023\/01\/lokiexplore.png 1606w, https:\/\/brakkee.org\/site\/wp-content\/uploads\/2023\/01\/lokiexplore-300x173.png 300w, https:\/\/brakkee.org\/site\/wp-content\/uploads\/2023\/01\/lokiexplore-1024x590.png 1024w, https:\/\/brakkee.org\/site\/wp-content\/uploads\/2023\/01\/lokiexplore-768x442.png 768w, https:\/\/brakkee.org\/site\/wp-content\/uploads\/2023\/01\/lokiexplore-1536x885.png 1536w\" sizes=\"(max-width: 1606px) 100vw, 1606px\" \/><\/p>\n<p>Above, I am using the JSON filter and line operation filters, the details of which you can find in the <a href=\"https:\/\/grafana.com\/docs\/loki\/latest\/logql\/log_queries\/\">docs<\/a>.<\/p>\n<p>Use the JSON formatter to flatten the JSON into a set of labels. The JSON formatter flattens the JSON to flat list of key\/values and the line formatter can then use the flattened JSON to format the output. In the example above I am using a format <em>{{.namespace}}:{{.pod}} {{.log}}<\/em> to show each log line prefixed with namespace and pod. Live tailing of logs is possible using the top-right Live button in the UI (see below for setup).<\/p>\n<h2>Live log tailing with loki<\/h2>\n<p>One interesting feature of loki and grafana is that it allows live log tailing. To use this feature requires a websocket connection from the web browser to grafana and from grafana to loki. The last part is easy since communication from grafana to loki is cluster internal and does not require any proxies to be configured.<\/p>\n<p>For the exposure of grafana I am using my <a href=\"https:\/\/brakkee.org\/site\/2022\/05\/28\/hosting-services-on-google-kubernetes-engine\/\">standard setup <\/a>with SSL termination using ingress which then proxies to one HTTP server per domain running apache, which then demultiplexes traffic to the appropriate services using name based virtual hosts.<\/p>\n<p>Here I ran into a number of issues since in both cases websocket connections did not work through ingress and also did not work through apache. For nginx, I was using the <a href=\"https:\/\/docs.nginx.com\/nginx-ingress-controller\/\">nginx-stable ingress controller<\/a> which apparently did not support websockets out of the box. This problem was easily solved by using the <a href=\"https:\/\/github.com\/kubernetes\/ingress-nginx\">nginx ingress controller from the kubernetes project<\/a> instead which supports websocket without any additional configuration.<\/p>\n<p>For apache, I had to modify my configuration following the <a href=\"https:\/\/httpd.apache.org\/docs\/2.4\/mod\/mod_proxy_wstunnel.html\">docs<\/a> as follows to rewrite websocket connections to use the websocket URL. This leads to a configuration in apache like this:<\/p>\n<pre>&lt;VirtualHost *:80&gt;\r\n  ServerName grafana.example.com\r\n\r\n  ProxyRequests off\r\n  ProxyPreserveHost on\r\n  #AllowEncodedSlashes on\r\n\r\n  ProxyPass \/ http:\/\/grafana.monitoring.svc.cluster.local:3000\/ disablereuse=On\r\n  ProxyPassReverse \/ http:\/\/grafana.monitoring.svc.cluster.local:3000\/\r\n\r\n  RewriteEngine on\r\n  RewriteCond %{HTTP:Upgrade} websocket [NC]\r\n  RewriteCond %{HTTP:Connection} upgrade [NC]\r\n  RewriteRule ^\/?(.*) \"ws:\/\/grafana.monitoring.svc.cluster.local:3000\/$1\" [P,L]\r\n&lt;\/VirtualHost&gt;\r\n<\/pre>\n<p>To test websockets, I used a websocket echo server <a href=\"https:\/\/github.com\/jmalloc\/echo-server\">docker image<\/a> and deployed it directly behind the ingress controller using a separate domain name, eliminating apache using and ingress rule. After that, that, I deployed it behind apache to test acces through both nginx and apache:<\/p>\n<pre>apiVersion: apps\/v1\r\nkind: Deployment\r\nmetadata:\r\n  name: websocket-echo-server\r\nspec:\r\n  selector:\r\n    matchLabels:\r\n      app: websocket-echo-server\r\n  template:\r\n    metadata:\r\n      labels:\r\n        app: websocket-echo-server\r\n      name: websocket-echo-server\r\n      namespace: exposure\r\n    spec:\r\n      containers:\r\n      - image: jmalloc\/echo-server\r\n        name: websocket-echo-server\r\n        env:\r\n          - name: PORT\r\n            value: \"8080\"\r\n          - name: LOG_HTTP_BODY\r\n            value: yes\r\n          #- name: SEND_SERVER_HOSTNAME\r\n          #  value: false\r\n        ports:\r\n          - containerPort: 8080\r\n            name: websocket\r\n---\r\napiVersion: v1\r\nkind: Service\r\nmetadata:\r\n  creationTimestamp: null\r\n  labels:\r\n    app: websocket-echo-server\r\n  name: websocket-echo-server\r\nspec:\r\n  type: ClusterIP\r\n  ports:\r\n  - port: 8080\r\n    protocol: TCP\r\n    targetPort: 8080\r\n  selector:\r\n    app: websocket-echo-server\r\n<\/pre>\n<p>Testing is done by opening the websocket service in your browser using just the domain name <em>https:\/\/echotest.example.com<\/em> for basic HTTP and <em>https:\/\/echotest.example.com\/.<\/em>ws for the websocket connection.<\/p>\n<p>&nbsp;<\/p>\n<h2>Network policies<\/h2>\n<p>To obtain microsegmentation I used network policies again to allow precisely the traffic that is required. Since naming all network policy yaml definitions here would be tedious, I am listing the high level network policies I deployed in the monitoring namespace:<\/p>\n<ul>\n<li>default allow nothing rule as always<\/li>\n<li>allow DNS (port 53\u00a0 for both UDP and TCP) for all pods in this namespace (egress)<\/li>\n<li>allow grafana to connect to loki on port 3100 (egress) and allow loki to accept connections from grafana (ingress)<\/li>\n<li>allow grafana to access grafana.com for plugin downloads (egress)<\/li>\n<li>allow grafana access from tha apache server on port 3000 (ingress) and allow apache to connect to grafana (egress)<\/li>\n<li>allow grafana to connect to the mail server running in the cluster at port 25 (egress)<\/li>\n<li>allow fluentbit to connect to loki on port 3100 (egress) and allow loki to accept connections from fluentbit (ingress)<\/li>\n<li>allow fluentbit to connect to the API server (egress)<\/li>\n<\/ul>\n<h2>Final thoughts<\/h2>\n<p>The main issue in this setup was at first to understand the data that fluentbit provides and how it accesses the logs and in relation to that how to extract intuitive kubernetes labels.\u00a0 In particular, I had to add a new hostpath mount for <em>\/run\/log\/journal<\/em> to the <em>DaemonSet<\/em> to be able to monitor systemd services such as the kubelet. Additionally, I spent some hours getting websockets to work. This was challenging since the traffic had to pass through two proxies: the ingress controller and apache.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post describes how to monitor logs in kubernetes with grafana and loki. This covers the use case of logging for troubleshooting purposes. That is, it allows analysing human readable logs coming from multiple systems in one aggregated log. Human &hellip; <a href=\"https:\/\/brakkee.org\/site\/2023\/01\/10\/monitoring-logs-on-k8s-with-loki-and-grafana\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[10],"tags":[],"_links":{"self":[{"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/posts\/2572"}],"collection":[{"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/comments?post=2572"}],"version-history":[{"count":72,"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/posts\/2572\/revisions"}],"predecessor-version":[{"id":2685,"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/posts\/2572\/revisions\/2685"}],"wp:attachment":[{"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/media?parent=2572"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/categories?post=2572"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/brakkee.org\/site\/wp-json\/wp\/v2\/tags?post=2572"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}