Introduce Logging Stack: Add fluentd, add loki (🚧🚧 DEVOPS 🚧🚧) #1058

mrnicegyu11 · 2025-05-21T09:57:20Z

What do these changes do?

⚠️ ⚠️ ⚠️ Invasive PR - merge at convenient time ⚠️ ⚠️ ⚠️

Logs will now be sent from the docker daemon to fluend, with fluentd running as a docker swarm service inside the logging stack, via a TCP port that is host-exposed. Fluentd will send the logs to graylog and the newly introduced grafana loki.

Syslogs from the Linux Kernel will for now only be available in graylog.

Computational clusters and the impact of this PR on them have not been fully evaluated, please comment on this quickly @sanderegg . If it doesnt impact them at all, even better.

This unblocks grafana alerting.

🚧🚧 DevOps 🚧🚧:

This PR requires https://git.speag.com/oSparc/osparc-infra/-/merge_requests/305 to be merged and new AMIs based on this code being built, available and put into the repo.config.template
Any old machines (persistent or not) will stop sending logs to graylog (and loki for that matter) once this PR is rolled out. Warm and hot buffers must be re-created. The docker daemon on persistant machines must be restarted (short downtime) and the /etc/docker/daemon.json must be updated.

Related issue/s

Related PR/s

Checklist

I tested and it works

Service has resource limits and reservations
Service has placement constraints or is global
Service is restartable
Service restart is zero-downtime
Service is monitored (via prometheus and grafana)
Service is not bound to one specific node (e.g. via files or volumes)
Relevant OPS E2E Test are added
Service's Public URL is included in maintenance mode
Service's Public URL is included in testing mode

Merge remote-tracking branch 'upstream/main'

…oundation#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size

bisgaard-itis

Thanks a lot for the effort.

services/logging/docker-compose.yml.j2

YuryHrytsuk

Thanks!

services/logging/fluentd/Dockerfile

services/logging/fluentd/fluent.conf

services/logging/fluentd/Dockerfile

services/logging/fluentd/fluent.conf

services/logging/docker-compose.yml.j2

matusdrobuliak66

Sorry, I haven’t gone through it thoroughly, to unblock you a approve

mrnicegyu11 · 2025-08-11T07:40:18Z

~~All good, I checked with SAN and he had some very reasonable feedback, i will do one more round of tests~~
mispalced comment on the wrong PR, this will be merged after the upcomming prod release

sanderegg · 2025-08-11T09:05:31Z

Computational clusters and the impact of this PR on them have not been fully evaluated, please comment on this quickly @sanderegg . If it doesnt impact them at all, even better.

@mrnicegyu11 not entirely sure here but from computational clusters end what is important is:

a new AMI that can be configured to send docker logs to fluentd which I think you have prepared here https://git.speag.com/oSparc/osparc-infra/-/merge_requests/305 -> AMI must be created
then the configuration of both autoscaling and clusters-keeper must be updated per deployment accordingly which seems to be there: https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1511
now it depends in what orders this gets deployed, but it is important that the AMI and configuration are in sync.

sanderegg

commented

YuryHrytsuk

What is the state of unchecked items? Could you add a note next to them what is the state?

Also do I get this correctly, that once this is merged and deployed to master, all logging workflows would be broken until machines are reprovisioned? The same with AMIs I presume. So, probably, once merged, DevOps must be on top of it.

All the infra changes would need to be applied on PROD, so we need clear and easy to do (difficult to make mistakes) instructions to ensure stag / prod release are done smoothly (if it gets to this stage ✊ ) @mrnicegyu11

Thanks for all the effort!

services/logging/docker-compose.yml.j2

services/logging/fluentd/README.md

YuryHrytsuk · 2025-08-12T13:27:43Z

@mrnicegyu11 if is safer if it only gets to master and does not block future staging / prod releases (these stages will not be affected). Thanks 🙏

mrnicegyu11 · 2025-08-13T08:28:47Z

Host Port leads to non-zero-downtime, is it avoidable? It is not, as the docker daemon cannot send logs "directly into the swarm / into containers"
Update-Policy is something to think about

mrnicegyu11 · 2025-08-15T07:23:50Z

re Or to put it in other words, can it be only deployed in master or as long as we merge this PR, it will to all deployments with next staging / prod release?: It would propagate sadly, we can revert it but then changes to this repo, ops-config and osparc-infra also need reverting. AMIs need reverting as well. So it is not impossible but also not easy to change back @YuryHrytsuk

mrnicegyu11 · 2025-08-15T07:29:20Z

re But loki does need volumes. Am I missing something?: If loki should never lose any log, one must give it a persistent volume. For now, i still consider graylog our main logging platform, and loki is to be tested, it is not clear if loki will be able to replace graylog fully or if it is only helpful in the context of grafana combined telemetry (logging, tracing, metrics)

mrnicegyu11 · 2025-08-15T09:51:09Z

Required fix: bf6719e

mrnicegyu11 and others added 27 commits October 15, 2024 16:18

wip

f0d8cf0

Merge remote-tracking branch 'upstream/main' into main

e906b41

Merge remote-tracking branch 'upstream/main' into main

14c751d

Add csi-s3 and have portainer use it

293f63c

Change request @Hrytsuk 1GB max portainer volume size

f7f72ec

t push

94cfb76

Merge remote-tracking branch 'upstream/main'

Merge remote-tracking branch 'upstream/main'

509c717

Merge remote-tracking branch 'upstream/main'

1a65ecf

Merge remote-tracking branch 'upstream/main'

77ee45e

Arch Linux Certificates Customization

c9c70d6

Merge remote-tracking branch 'upstream/main'

7b8be53

Merge remote-tracking branch 'upstream/main'

bcd61cd

Merge remote-tracking branch 'upstream/main'

58e1030

Merge remote-tracking branch 'upstream/main'

ed8d479

Merge remote-tracking branch 'upstream/main'

dda6e01

Merge remote-tracking branch 'upstream/main'

f6f4f36

Merge remote-tracking branch 'upstream/main'

5dca5c3

Merge remote-tracking branch 'upstream/main'

4a653ef

Merge remote-tracking branch 'upstream/main'

3a21f0f

Fix pgsql exporter failure

48fbbca

Merge remote-tracking branch 'upstream/main'

08c57db

Experimental: Try to add tracing to simcore-traefik on master

3ea41b5

Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd

1cf605d

wip

bcc67d4

Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd

57947e3

Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd

88e4ed5

mrnicegyu11 self-assigned this Jul 7, 2025

mrnicegyu11 added observability alerting/monitoring EPIC labels Jul 7, 2025

mrnicegyu11 requested a review from YuryHrytsuk as a code owner August 5, 2025 09:55

Remove accidental commit

7944b1d

bisgaard-itis approved these changes Aug 7, 2025

View reviewed changes

services/logging/docker-compose.yml.j2 Outdated Show resolved Hide resolved

YuryHrytsuk reviewed Aug 7, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd

9100654

mrnicegyu11 requested review from YuryHrytsuk and sanderegg August 8, 2025 08:34

mrnicegyu11 added 2 commits August 8, 2025 10:35

Merge remote-tracking branch 'upstream/main' into 2025/add/fluentd

9b99c28

Add README.md, remove loki docker volume

aeab4bd

matusdrobuliak66 approved these changes Aug 8, 2025

View reviewed changes

Merge branch 'main' into 2025/add/fluentd

f974fdd

sanderegg reviewed Aug 11, 2025

View reviewed changes

YuryHrytsuk approved these changes Aug 12, 2025

View reviewed changes

services/logging/docker-compose.yml.j2 Show resolved Hide resolved

services/logging/docker-compose.yml.j2 Show resolved Hide resolved

services/logging/docker-compose.yml.j2 Show resolved Hide resolved

services/logging/fluentd/README.md Outdated Show resolved Hide resolved

mrnicegyu11 modified the milestones: Engage, Voyager Aug 13, 2025

mrnicegyu11 added 3 commits August 13, 2025 12:53

Merge branch 'main' into 2025/add/fluentd

8188b04

Merge branch 'main' into 2025/add/fluentd

98aac4a

Merge branch 'main' into 2025/add/fluentd

d5e4f3c

Update README.md

057a3f8

mrnicegyu11 merged commit 1c70781 into ITISFoundation:main Aug 15, 2025
3 checks passed

This was referenced Aug 25, 2025

docker UDP log issues - conntrack #1180

Open

FluentD configuration: remove / in front of container names #1188

Merged

sanderegg mentioned this pull request Aug 26, 2025

ensure graylog source is correct #1189

Merged

1 task

Introduce Logging Stack: Add fluentd, add loki (🚧🚧 DEVOPS 🚧🚧) #1058

Introduce Logging Stack: Add fluentd, add loki (🚧🚧 DEVOPS 🚧🚧) #1058

Uh oh!

Conversation

mrnicegyu11 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

Related PR/s

Checklist

Uh oh!

bisgaard-itis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YuryHrytsuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matusdrobuliak66 left a comment

Choose a reason for hiding this comment

Uh oh!

mrnicegyu11 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanderegg commented Aug 11, 2025

Uh oh!

sanderegg left a comment

Choose a reason for hiding this comment

Uh oh!

YuryHrytsuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuryHrytsuk commented Aug 12, 2025

Uh oh!

mrnicegyu11 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrnicegyu11 commented Aug 15, 2025

Uh oh!

mrnicegyu11 commented Aug 15, 2025

Uh oh!

Uh oh!

mrnicegyu11 commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mrnicegyu11 commented May 21, 2025 •

edited

Loading

mrnicegyu11 commented Aug 11, 2025 •

edited

Loading

mrnicegyu11 commented Aug 13, 2025 •

edited

Loading