Skip to content

Conversation

@mrnicegyu11
Copy link
Member

@mrnicegyu11 mrnicegyu11 commented May 21, 2025

What do these changes do?

⚠️ ⚠️ ⚠️ Invasive PR - merge at convenient time ⚠️ ⚠️ ⚠️

Logs will now be sent from the docker daemon to fluend, with fluentd running as a docker swarm service inside the logging stack, via a TCP port that is host-exposed. Fluentd will send the logs to graylog and the newly introduced grafana loki.

Syslogs from the Linux Kernel will for now only be available in graylog.

Computational clusters and the impact of this PR on them have not been fully evaluated, please comment on this quickly @sanderegg . If it doesnt impact them at all, even better.

This unblocks grafana alerting.

🚧🚧 DevOps 🚧🚧:

  • This PR requires https://git.speag.com/oSparc/osparc-infra/-/merge_requests/305 to be merged and new AMIs based on this code being built, available and put into the repo.config.template
  • Any old machines (persistent or not) will stop sending logs to graylog (and loki for that matter) once this PR is rolled out. Warm and hot buffers must be re-created. The docker daemon on persistant machines must be restarted (short downtime) and the /etc/docker/daemon.json must be updated.

Related issue/s

Related PR/s

Checklist

  • I tested and it works
  • Service has resource limits and reservations
  • Service has placement constraints or is global
  • Service is restartable
  • Service restart is zero-downtime
  • Service is monitored (via prometheus and grafana)
  • Service is not bound to one specific node (e.g. via files or volumes)
  • Relevant OPS E2E Test are added
  • Service's Public URL is included in maintenance mode
  • Service's Public URL is included in testing mode

mrnicegyu11 and others added 27 commits October 15, 2024 16:18
Merge remote-tracking branch 'upstream/main'
…oundation#979)

* Introduce longhorn chart

* Further longhorn configuration

* Longhorn: further settings configuration

* Fix longhorn configuration bugs

Extra: introduce longhorn pv vales for portainer

* Add comment for deletion longhorn

* Further longhorn configuration

* Add README.md for Longhorn wit FAQ

* Update Longhorn readme

* Update readme

* Futher LH configuration

* Update LH's Readme

* Update Longhorn Readme

* Improve LH's Readme

* LH: Reduce reserved default disk space to 5%

Since we use a dedicated disk for LH, we can go ahead with 5%

* Use values to set Longhorn storage class

* Update LH's Readme

* LH Readme: add requirements reference

* PR Review: bring back portainer s3 pv

* LH: decrease portinaer volume size
@mrnicegyu11 mrnicegyu11 self-assigned this Jul 7, 2025
@mrnicegyu11 mrnicegyu11 added observability alerting/monitoring EPIC labels Jul 7, 2025
Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the effort.

Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I haven’t gone through it thoroughly, to unblock you a approve

@mrnicegyu11
Copy link
Member Author

mrnicegyu11 commented Aug 11, 2025

All good, I checked with SAN and he had some very reasonable feedback, i will do one more round of tests
mispalced comment on the wrong PR, this will be merged after the upcomming prod release

@sanderegg
Copy link
Member

Computational clusters and the impact of this PR on them have not been fully evaluated, please comment on this quickly @sanderegg . If it doesnt impact them at all, even better.

@mrnicegyu11 not entirely sure here but from computational clusters end what is important is:

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented

Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the state of unchecked items? Could you add a note next to them what is the state?

Image

Also do I get this correctly, that once this is merged and deployed to master, all logging workflows would be broken until machines are reprovisioned? The same with AMIs I presume. So, probably, once merged, DevOps must be on top of it.

All the infra changes would need to be applied on PROD, so we need clear and easy to do (difficult to make mistakes) instructions to ensure stag / prod release are done smoothly (if it gets to this stage ✊ ) @mrnicegyu11

Thanks for all the effort!

@YuryHrytsuk
Copy link
Collaborator

@mrnicegyu11 if is safer if it only gets to master and does not block future staging / prod releases (these stages will not be affected). Thanks 🙏

@mrnicegyu11
Copy link
Member Author

mrnicegyu11 commented Aug 13, 2025

  • Host Port leads to non-zero-downtime, is it avoidable? It is not, as the docker daemon cannot send logs "directly into the swarm / into containers"
  • Update-Policy is something to think about

@mrnicegyu11 mrnicegyu11 modified the milestones: Engage, Voyager Aug 13, 2025
@mrnicegyu11
Copy link
Member Author

re Or to put it in other words, can it be only deployed in master or as long as we merge this PR, it will to all deployments with next staging / prod release?: It would propagate sadly, we can revert it but then changes to this repo, ops-config and osparc-infra also need reverting. AMIs need reverting as well. So it is not impossible but also not easy to change back @YuryHrytsuk

@mrnicegyu11
Copy link
Member Author

re But loki does need volumes. Am I missing something?: If loki should never lose any log, one must give it a persistent volume. For now, i still consider graylog our main logging platform, and loki is to be tested, it is not clear if loki will be able to replace graylog fully or if it is only helpful in the context of grafana combined telemetry (logging, tracing, metrics)

@mrnicegyu11 mrnicegyu11 merged commit 1c70781 into ITISFoundation:main Aug 15, 2025
3 checks passed
@mrnicegyu11
Copy link
Member Author

Required fix: bf6719e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

EPIC observability alerting/monitoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants