mrnicegyu11 · mrnicegyu11 · May 12, 2025 · May 12, 2025 · May 12, 2025 · May 13, 2025
@@ -27,7 +27,15 @@ Source:
 
 ### How to configure disks for LH
 
-As of now, we follow the same approach we use for `/docker` folder (via ansible playbook) but we use `/longhorn` folder name
+Manual configuration performed (to be moved to ansible)
+1. Create partition on the disk
+    * e.g. via using `fdisk` https://phoenixnap.com/kb/linux-create-partition
+2. Format partition as XFS
+    * `sudo mkfs.xfs -f /dev/sda1`
+3. Mount partition `sudo mount -t xfs /dev/sda1 /longhorn`
+4. Persist mount in `/etc/fstab` by adding line
+    * `UUID=<partition's uuid> /longhorn xfs pquota 0 0`
+    * UUID can be received from `lsblk -f`
 
 Issue asking LH to clearly document requirements: https://github.com/longhorn/longhorn/issues/11125
 
@@ -54,3 +62,22 @@ Insights into LH's performance:
 
 Resource requirements:
 * https://github.com/longhorn/longhorn/issues/1691
+
+### (Kubernetes) Node maintenance
+
+https://longhorn.io/docs/1.8.1/maintenance/maintenance/
+
+Note: you can use Longhorn GUI to perform some operations
+
+### Zero downtime updating longhorn disks (procedure)
+Notes:
+* Update one node at a time so that other nodes can still serve data
+
+1. Go to LH GUI and select a Node
+    1. Disable scheduling
+    2. Request eviction
+1. Remove disk from the node
+    * If remove icon is disabled, disable eviction on disk to enable the remove button
+2. Perform disks updates on the node
+3. Make sure LH didn't pick up wrongly configured disk in the meantime and remove the wrong disk if it did so
+4. Wait till LH automatically adds the disk to the Node
@@ -0,0 +1,43 @@
+## topolvm components and architecture
+See diagram https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/design.md
+
+## Preqrequisites
+`topolvm` does not automatically creates Volume Groups (specified in device-classes). This needs to be configured additionally (e.g. manually, via ansible, ...)
+
+Manual example (Ubuntu 22.04):
+1. Create partition to use later (`sudo fdisk /dev/sda`)
+2. Create PV (`sudo pvcreate /dev/sda2`)
+    * Prerequisite: `sudo apt install lvm2`
+3. Create Volume group (`sudo vgcreate topovg-sdd /dev/sda2`)
+    * Note: Volume group's name must correspond to the setting of `volume-group` inside `lvmd.deviceClasses`
+4. Check volume group (`sudo vgdisplay`)
+
+Source: https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/getting-started.md#prerequisites
+
+## Deleting PV(C)s with `retain` reclaim policy
+1. Delete release (e.g. helm uninstall -n test test)
+2. Find LogicalVolume CR (`kubectl get logicalvolumes.topolvm.io`
+3. Delete LogicalVolume CR (`kubectl delete logicalvolumes.topolvm.io <lv-name>`)
+4. Delete PV (`kubectl delete PV <pv-name>`)
+
+## Backup / Snapshotting
+1. Only possible while using thin provisioning
+2. We use thick (non-thin provisioned) volumes --> no snapshot support
+
+   Track this feature request for changes https://github.com/topolvm/topolvm/issues/1070
+
+Note: there might be alternative not documented ways (e.g. via Velero)
+
+## Resizing PVs
+1. Update storage capacity in configuration
+2. Deploy changes
+
+Note: storage size can only be increased. Otherwise, one gets `Forbidden: field can not be less than previous value` error
+
+## Node maintenance
+
+Read https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/node-maintenance.md
+
+## Using topolvm. Notes
+* `topolvm` may not work with pods that define `spec.nodeName` Use node affinity instead
+  https://github.com/topolvm/topolvm/blob/main/docs/faq.md#the-pod-does-not-start-when-nodename-is-specified-in-the-pod-spec
@@ -0,0 +1,106 @@
+lvmd:
+  # set up lvmd service with DaemonSet
+  managed: true
+
+  # device classes (VGs) need to be created outside of topolvm (e.g. manually, via ansible, ...)
+  deviceClasses:
+    - name: ssd
+      volume-group: topovg-sdd
+      default: true
+      spare-gb: 5
+
+storageClasses:
+  - name: {{ .Values.topolvmStorageClassName }}
+    storageClass:
+      # Want to use non-default device class?
+      # See configuration example in
+      # https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/snapshot-and-restore.md#set-up-a-storage-class
+
+      fsType: xfs
+      isDefaultClass: false
+      # volumeBindingMode can be either WaitForFirstConsumer or Immediate. WaitForFirstConsumer is recommended because TopoLVM cannot schedule pods wisely if volumeBindingMode is Immediate.
+      volumeBindingMode: WaitForFirstConsumer
+      allowVolumeExpansion: true
+      # NOTE: On removal requires manual clean up of PVs, LVMs
+      # and Logical Volumes (CR logicalvolumes.topolvm.io).
+      # Removal Logical Volume (CR) would clean up the LVM on the node,
+      # but PV has still to be removed manually.
+      # Read more: https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/advanced-setup.md#storageclass
+      reclaimPolicy: Retain
+
+resources:
+  topolvm_node:
+   requests:
+     memory: 100Mi
+     cpu: 100m
+   limits:
+     memory: 500Mi
+     cpu: 500m
+
+  topolvm_controller:
+   requests:
+     memory: 50Mi
+     cpu: 50m
+   limits:
+     memory: 200Mi
+     cpu: 200m
+
+  lvmd:
+   requests:
+     memory: 100Mi
+     cpu: 100m
+   limits:
+     memory: 500Mi
+     cpu: 500m
+
+  csi_registrar:
+    requests:
+      cpu: 25m
+      memory: 10Mi
+    limits:
+      cpu: 200m
+      memory: 200Mi
+
+  csi_provisioner:
+   requests:
+     memory: 50Mi
+     cpu: 50m
+   limits:
+     memory: 200Mi
+     cpu: 200m
+
+  csi_resizer:
+    requests:
+      memory: 50Mi
+      cpu: 50m
+    limits:
+      memory: 200Mi
+      cpu: 200m
+
+  csi_snapshotter:
+    requests:
+      memory: 50Mi
+      cpu: 50m
+    limits:
+      memory: 200Mi
+      cpu: 200m
+
+  liveness_probe:
+    requests:
+      cpu: 25m
+      memory: 10Mi
+    limits:
+      cpu: 200m
+      memory: 200Mi
+
+# https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/topolvm-scheduler.md
+scheduler:
+  # start simple
+  enabled: false
+
+cert-manager:
+  # start simple
+  enabled: false
+
+snapshot:
+  enabled: true
@@ -243,7 +243,7 @@ if [ "$start_opsstack" -eq 0 ]; then
     call_make "." up-"$stack_target";
     popd
 
-    # -------------------------------- GRAYLOG -------------------------------
+    # -------------------------------- Graylog -------------------------------
     log_info "starting graylog..."
     service_dir="${repo_basedir}"/services/graylog
     pushd "${service_dir}"

@@ -1,4 +1,4 @@
-version: '3.7'
+version: '3.8'
 services:
   mongodb:
     deploy:

@@ -1,4 +1,3 @@
-version: "3.7"
 services:
   mongodb:
     deploy:

@@ -1,4 +1,4 @@
-version: '3.7'
+version: '3.8'
 services:
   graylog:
     deploy:

@@ -1,4 +1,4 @@
-version: '3.7'
+version: '3.8'
 services:
   graylog:
     deploy:

@@ -1,4 +1,3 @@
-version: "3.7"
 services:
   mongodb:
     deploy:

@@ -1,4 +1,3 @@
-version: "3.7"
 services:
   mongodb:
     deploy:

@@ -1,4 +1,3 @@
-version: "3.7"
 services:
   # MongoDB: https://hub.docker.com/_/mongo/
   mongodb:
@@ -19,7 +18,7 @@ services:
           memory: 300M
           cpus: "0.1"
     networks:
-      default:
+      graylog:
         aliases:
           - mongo # needed because of graylog configuration
 
@@ -46,6 +45,8 @@ services:
         reservations:
           memory: 1G
           cpus: "0.1"
+    networks:
+      graylog:
   # Graylog: https://hub.docker.com/r/graylog/graylog/
   graylog:
     image: graylog/graylog:6.0.5
@@ -69,8 +70,11 @@ services:
       - GRAYLOG_HTTP_EXTERNAL_URI=${GRAYLOG_HTTP_EXTERNAL_URI}
       - GRAYLOG_ELASTICSEARCH_HOSTS=http://elasticsearch:9200,
     networks:
-      - public
-      - default
+      public:
+      monitoring:
+      graylog:
+        aliases:
+          - graylog
     ports:
     - 12201:12201/udp
     - 12202:12202/udp
@@ -85,10 +89,9 @@ services:
         reservations:
           cpus: "0.1"
           memory: 1G
-
       labels:
         - traefik.enable=true
-        - traefik.swarm.network=${PUBLIC_NETWORK}
+        - traefik.docker.network=${PUBLIC_NETWORK}
         # direct access through port
         - traefik.http.services.graylog.loadbalancer.server.port=9000
         - traefik.http.routers.graylog.rule=Host(`${MONITORING_DOMAIN}`) && PathPrefix(`/graylog`)
@@ -97,18 +100,98 @@ services:
         - traefik.http.middlewares.graylog_replace_regex.replacepathregex.regex=^/graylog/?(.*)$$
         - traefik.http.middlewares.graylog_replace_regex.replacepathregex.replacement=/$${1}
         - traefik.http.routers.graylog.middlewares=ops_whitelist_ips@swarm, ops_gzip@swarm, graylog_replace_regex
+  fluentd:
+    image: itisfoundation/fluentd:v1.16.8-1.0
+    configs:
+      - source: fluentd_config
+        target: /fluentd/etc/fluent.conf
+    environment:
+      - GRAYLOG_HOST=graylog
+      - GRAYLOG_PORT=12201
+      - LOKI_URL=http://loki:3100
+      - FLUENTD_HOSTNAME={% raw %}{{.Node.Hostname}}{% endraw %}
+    ports:
+      - "24224:24224/tcp"
+    deploy:
+      #mode: global  # Run on all nodes
+      restart_policy:
+        condition: on-failure
+      resources:
+        limits:
+          cpus: '1.0'
+          memory: 1G
+        reservations:
+          cpus: '0.5'
+          memory: 512M
+      update_config:
+        parallelism: 1
+        delay: 10s
+        order: start-first
+    networks:
+      - monitoring
+      - graylog
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://0.0.0.0:24220/api/plugins"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+
+  loki:
+    image: grafana/loki:3.5.0
+    configs:
+      - source: loki_config
+        target: /etc/loki/loki.yaml
+    command: -config.file=/etc/loki/loki.yaml
+    deploy:
+      placement:
+        constraints: []
+      replicas: 1
+      restart_policy:
+        condition: any
+        delay: 5s
+      resources:
+        limits:
+          cpus: '1.0'
+          memory: 2G
+        reservations:
+          cpus: '0.5'
+          memory: 1G
+      update_config:
+        parallelism: 1
+        delay: 10s
+        order: start-first
+    networks:
+      - monitoring
+    healthcheck:
+      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://0.0.0.0:3100/ready"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+
 
 volumes:
+  loki-data:
   mongo_data:
   elasticsearch_data:
   graylog_journal:
 
 networks:
+  graylog:
   public:
     external: true
     name: ${PUBLIC_NETWORK}
-
+  monitoring:
+    external: true
+    name: ${MONITORED_NETWORK}
 configs:
   graylog_config:
     name: ${STACK_NAME}_graylog_config_{{ "./data/contentpacks/osparc-custom-content-pack-v2.json" | sha256file | substring(0,10) }}
     file: ./data/contentpacks/osparc-custom-content-pack-v2.json
+  fluentd_config:
+    name: ${STACK_NAME}_fluentd_config_{{ "./fluentd/fluent.conf" | sha256file | substring(0,10) }}
+    file: ./fluentd/fluent.conf
+  loki_config:
+    name: ${STACK_NAME}_loki_config_{{ "./loki.yaml" | sha256file | substring(0,10) }}
+    file: ./loki.yaml