Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
2249efc
Refactor ops-traefik - Remove custom docker compose overwrites, add c…
mrnicegyu11 May 12, 2025
df6482b
remove ununsed env-vars
mrnicegyu11 May 12, 2025
a35cea8
rename log-level env-vars
mrnicegyu11 May 12, 2025
fb5b138
Merge remote-tracking branch 'upstream/main' into 2025/refactor/traefik
mrnicegyu11 May 13, 2025
5234228
only bind ports when needed
mrnicegyu11 May 13, 2025
70c8908
Add env_file back to traefik compose spec
mrnicegyu11 May 13, 2025
1cf605d
Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd
mrnicegyu11 May 14, 2025
bcc67d4
wip
mrnicegyu11 May 14, 2025
4fa4f26
Merge remote-tracking branch 'upstream/main' into 2025/refactor/traefik
mrnicegyu11 May 20, 2025
858c82d
Merge remote-tracking branch 'upstream/main' into 2025/refactor/traefik
mrnicegyu11 May 21, 2025
57947e3
Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd
mrnicegyu11 May 21, 2025
88e4ed5
Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd
mrnicegyu11 May 28, 2025
1eba82c
Merge branch '2025/add/fluentd' into 2025/refactor/traefik
mrnicegyu11 May 28, 2025
236dda9
Merge branch '2025/add/traefikOpenTelemetry' into 2025/refactor/traefik
mrnicegyu11 Jul 3, 2025
de1e17e
Kubernetes: add local storage (#1100)
YuryHrytsuk Jul 3, 2025
87875e0
Merge branch '2025/add/traefikOpenTelemetry' into 2025/refactor/traefik
mrnicegyu11 Jul 3, 2025
b0f1710
Fix deploy_ops CD step - monitoring
mrnicegyu11 Jul 4, 2025
1970766
Merge remote-tracking branch 'upstream/main' into 2025/refactor/traefik
mrnicegyu11 Jul 4, 2025
221c930
wip
mrnicegyu11 Jul 8, 2025
0969ffe
wip
mrnicegyu11 Jul 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion charts/longhorn/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,15 @@ Source:

### How to configure disks for LH

As of now, we follow the same approach we use for `/docker` folder (via ansible playbook) but we use `/longhorn` folder name
Manual configuration performed (to be moved to ansible)
1. Create partition on the disk
* e.g. via using `fdisk` https://phoenixnap.com/kb/linux-create-partition
2. Format partition as XFS
* `sudo mkfs.xfs -f /dev/sda1`
3. Mount partition `sudo mount -t xfs /dev/sda1 /longhorn`
4. Persist mount in `/etc/fstab` by adding line
* `UUID=<partition's uuid> /longhorn xfs pquota 0 0`
* UUID can be received from `lsblk -f`

Issue asking LH to clearly document requirements: https://github.com/longhorn/longhorn/issues/11125

Expand All @@ -54,3 +62,22 @@ Insights into LH's performance:

Resource requirements:
* https://github.com/longhorn/longhorn/issues/1691

### (Kubernetes) Node maintenance

https://longhorn.io/docs/1.8.1/maintenance/maintenance/

Note: you can use Longhorn GUI to perform some operations

### Zero downtime updating longhorn disks (procedure)
Notes:
* Update one node at a time so that other nodes can still serve data

1. Go to LH GUI and select a Node
1. Disable scheduling
2. Request eviction
1. Remove disk from the node
* If remove icon is disabled, disable eviction on disk to enable the remove button
2. Perform disks updates on the node
3. Make sure LH didn't pick up wrongly configured disk in the meantime and remove the wrong disk if it did so
4. Wait till LH automatically adds the disk to the Node
43 changes: 43 additions & 0 deletions charts/topolvm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
## topolvm components and architecture
See diagram https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/design.md

## Preqrequisites
`topolvm` does not automatically creates Volume Groups (specified in device-classes). This needs to be configured additionally (e.g. manually, via ansible, ...)

Manual example (Ubuntu 22.04):
1. Create partition to use later (`sudo fdisk /dev/sda`)
2. Create PV (`sudo pvcreate /dev/sda2`)
* Prerequisite: `sudo apt install lvm2`
3. Create Volume group (`sudo vgcreate topovg-sdd /dev/sda2`)
* Note: Volume group's name must correspond to the setting of `volume-group` inside `lvmd.deviceClasses`
4. Check volume group (`sudo vgdisplay`)

Source: https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/getting-started.md#prerequisites

## Deleting PV(C)s with `retain` reclaim policy
1. Delete release (e.g. helm uninstall -n test test)
2. Find LogicalVolume CR (`kubectl get logicalvolumes.topolvm.io`
3. Delete LogicalVolume CR (`kubectl delete logicalvolumes.topolvm.io <lv-name>`)
4. Delete PV (`kubectl delete PV <pv-name>`)

## Backup / Snapshotting
1. Only possible while using thin provisioning
2. We use thick (non-thin provisioned) volumes --> no snapshot support

Track this feature request for changes https://github.com/topolvm/topolvm/issues/1070

Note: there might be alternative not documented ways (e.g. via Velero)

## Resizing PVs
1. Update storage capacity in configuration
2. Deploy changes

Note: storage size can only be increased. Otherwise, one gets `Forbidden: field can not be less than previous value` error

## Node maintenance

Read https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/node-maintenance.md

## Using topolvm. Notes
* `topolvm` may not work with pods that define `spec.nodeName` Use node affinity instead
https://github.com/topolvm/topolvm/blob/main/docs/faq.md#the-pod-does-not-start-when-nodename-is-specified-in-the-pod-spec
106 changes: 106 additions & 0 deletions charts/topolvm/values.yaml.gotmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
lvmd:
# set up lvmd service with DaemonSet
managed: true

# device classes (VGs) need to be created outside of topolvm (e.g. manually, via ansible, ...)
deviceClasses:
- name: ssd
volume-group: topovg-sdd
default: true
spare-gb: 5

storageClasses:
- name: {{ .Values.topolvmStorageClassName }}
storageClass:
# Want to use non-default device class?
# See configuration example in
# https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/snapshot-and-restore.md#set-up-a-storage-class

fsType: xfs
isDefaultClass: false
# volumeBindingMode can be either WaitForFirstConsumer or Immediate. WaitForFirstConsumer is recommended because TopoLVM cannot schedule pods wisely if volumeBindingMode is Immediate.
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# NOTE: On removal requires manual clean up of PVs, LVMs
# and Logical Volumes (CR logicalvolumes.topolvm.io).
# Removal Logical Volume (CR) would clean up the LVM on the node,
# but PV has still to be removed manually.
# Read more: https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/advanced-setup.md#storageclass
reclaimPolicy: Retain

resources:
topolvm_node:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 500Mi
cpu: 500m

topolvm_controller:
requests:
memory: 50Mi
cpu: 50m
limits:
memory: 200Mi
cpu: 200m

lvmd:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 500Mi
cpu: 500m

csi_registrar:
requests:
cpu: 25m
memory: 10Mi
limits:
cpu: 200m
memory: 200Mi

csi_provisioner:
requests:
memory: 50Mi
cpu: 50m
limits:
memory: 200Mi
cpu: 200m

csi_resizer:
requests:
memory: 50Mi
cpu: 50m
limits:
memory: 200Mi
cpu: 200m

csi_snapshotter:
requests:
memory: 50Mi
cpu: 50m
limits:
memory: 200Mi
cpu: 200m

liveness_probe:
requests:
cpu: 25m
memory: 10Mi
limits:
cpu: 200m
memory: 200Mi

# https://github.com/topolvm/topolvm/blob/topolvm-chart-v15.5.5/docs/topolvm-scheduler.md
scheduler:
# start simple
enabled: false

cert-manager:
# start simple
enabled: false

snapshot:
enabled: true
2 changes: 1 addition & 1 deletion scripts/deployments/deploy_everything_locally.bash
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ if [ "$start_opsstack" -eq 0 ]; then
call_make "." up-"$stack_target";
popd

# -------------------------------- GRAYLOG -------------------------------
# -------------------------------- Graylog -------------------------------
log_info "starting graylog..."
service_dir="${repo_basedir}"/services/graylog
pushd "${service_dir}"
Expand Down
Binary file removed services/graylog/GraylogWorkflow.png
Binary file not shown.
2 changes: 1 addition & 1 deletion services/graylog/docker-compose.aws.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: '3.7'
version: '3.8'
services:
mongodb:
deploy:
Expand Down
1 change: 0 additions & 1 deletion services/graylog/docker-compose.dalco.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: "3.7"
services:
mongodb:
deploy:
Expand Down
2 changes: 1 addition & 1 deletion services/graylog/docker-compose.letsencrypt.dns.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: '3.7'
version: '3.8'
services:
graylog:
deploy:
Expand Down
2 changes: 1 addition & 1 deletion services/graylog/docker-compose.letsencrypt.http.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
version: '3.7'
version: '3.8'
services:
graylog:
deploy:
Expand Down
1 change: 0 additions & 1 deletion services/graylog/docker-compose.local.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: "3.7"
services:
mongodb:
deploy:
Expand Down
1 change: 0 additions & 1 deletion services/graylog/docker-compose.master.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: "3.7"
services:
mongodb:
deploy:
Expand Down
97 changes: 90 additions & 7 deletions services/graylog/docker-compose.yml.j2
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: "3.7"
services:
# MongoDB: https://hub.docker.com/_/mongo/
mongodb:
Expand All @@ -19,7 +18,7 @@ services:
memory: 300M
cpus: "0.1"
networks:
default:
graylog:
aliases:
- mongo # needed because of graylog configuration

Expand All @@ -46,6 +45,8 @@ services:
reservations:
memory: 1G
cpus: "0.1"
networks:
graylog:
# Graylog: https://hub.docker.com/r/graylog/graylog/
graylog:
image: graylog/graylog:6.0.5
Expand All @@ -69,8 +70,11 @@ services:
- GRAYLOG_HTTP_EXTERNAL_URI=${GRAYLOG_HTTP_EXTERNAL_URI}
- GRAYLOG_ELASTICSEARCH_HOSTS=http://elasticsearch:9200,
networks:
- public
- default
public:
monitoring:
graylog:
aliases:
- graylog
ports:
- 12201:12201/udp
- 12202:12202/udp
Expand All @@ -85,10 +89,9 @@ services:
reservations:
cpus: "0.1"
memory: 1G

labels:
- traefik.enable=true
- traefik.swarm.network=${PUBLIC_NETWORK}
- traefik.docker.network=${PUBLIC_NETWORK}
# direct access through port
- traefik.http.services.graylog.loadbalancer.server.port=9000
- traefik.http.routers.graylog.rule=Host(`${MONITORING_DOMAIN}`) && PathPrefix(`/graylog`)
Expand All @@ -97,18 +100,98 @@ services:
- traefik.http.middlewares.graylog_replace_regex.replacepathregex.regex=^/graylog/?(.*)$$
- traefik.http.middlewares.graylog_replace_regex.replacepathregex.replacement=/$${1}
- traefik.http.routers.graylog.middlewares=ops_whitelist_ips@swarm, ops_gzip@swarm, graylog_replace_regex
fluentd:
image: itisfoundation/fluentd:v1.16.8-1.0
configs:
- source: fluentd_config
target: /fluentd/etc/fluent.conf
environment:
- GRAYLOG_HOST=graylog
- GRAYLOG_PORT=12201
- LOKI_URL=http://loki:3100
- FLUENTD_HOSTNAME={% raw %}{{.Node.Hostname}}{% endraw %}
ports:
- "24224:24224/tcp"
deploy:
#mode: global # Run on all nodes
restart_policy:
condition: on-failure
resources:
limits:
cpus: '1.0'
memory: 1G
reservations:
cpus: '0.5'
memory: 512M
update_config:
parallelism: 1
delay: 10s
order: start-first
networks:
- monitoring
- graylog
healthcheck:
test: ["CMD", "curl", "-f", "http://0.0.0.0:24220/api/plugins"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

loki:
image: grafana/loki:3.5.0
configs:
- source: loki_config
target: /etc/loki/loki.yaml
command: -config.file=/etc/loki/loki.yaml
deploy:
placement:
constraints: []
replicas: 1
restart_policy:
condition: any
delay: 5s
resources:
limits:
cpus: '1.0'
memory: 2G
reservations:
cpus: '0.5'
memory: 1G
update_config:
parallelism: 1
delay: 10s
order: start-first
networks:
- monitoring
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://0.0.0.0:3100/ready"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s


volumes:
loki-data:
mongo_data:
elasticsearch_data:
graylog_journal:

networks:
graylog:
public:
external: true
name: ${PUBLIC_NETWORK}

monitoring:
external: true
name: ${MONITORED_NETWORK}
configs:
graylog_config:
name: ${STACK_NAME}_graylog_config_{{ "./data/contentpacks/osparc-custom-content-pack-v2.json" | sha256file | substring(0,10) }}
file: ./data/contentpacks/osparc-custom-content-pack-v2.json
fluentd_config:
name: ${STACK_NAME}_fluentd_config_{{ "./fluentd/fluent.conf" | sha256file | substring(0,10) }}
file: ./fluentd/fluent.conf
loki_config:
name: ${STACK_NAME}_loki_config_{{ "./loki.yaml" | sha256file | substring(0,10) }}
file: ./loki.yaml
Loading