[AWS Master] Kubernetes: add logging stack #1063

YuryHrytsuk · 2025-05-23T11:17:37Z

What do these changes do?

Add Highly Available Logging Stack to Kubernetes (scraping, storing and querying)

Minor:

Fix typo in Longhorn's README
Reuse EBS storage class name (avoid hardcode)

Technology choice

Logging Backend / Frontend(read more in comments) --> Victoria Logs
Logging shipper (read more in comments) --> vector.dev

Next steps

We can use Victoria Logs Datasource to visualize and query logs in Grafana https://github.com/VictoriaMetrics/victorialogs-datasource

Related issue/s

Kubernetes: add logging system #1057

Related PR/s

configuration https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1418

Checklist

I tested and it works

Service has resource limits and reservations
Service has placement constraints or is global
Service is restartable
Service restart is zero-downtime
~~Service is monitored (via prometheus and grafana)~~ not applicable atm
Service is not bound to one specific node (e.g. via files or volumes)
Relevant OPS E2E Test are added
~~Service's Public URL is included in maintenance mode~~ not applicable atm
~~Service's Public URL is included in testing mode not applicable~~ atm

YuryHrytsuk · 2025-05-23T12:15:52Z

Logging Stack Research

Candidates:

~~ELK/EFK~~ --> the only stack that does not use PVs (except ES itself which can be managed with operator). Too complicated to manage
~~Grafana Loki~~ --> easy to manage + stores Logs in S3 (but full text search can be an issue) --> nogo since it uses PVs
VictoriaLogs --> stores logs in PVs (does not support real databases to store data). But data can be replicated natively. Developers state VL to be resource efficient, easy to manage. It also seems to be raising quickly in popularity (already popular atm)

Useful links:

Vitctoria Logs

no object storage (e.g. S3) support Object Storage VictoriaMetrics/VictoriaMetrics#38
how to deal with logs storage https://www.reddit.com/r/VictoriaMetrics/comments/1kv3sj6/vl_on_kubernetes_how_do_you_deal_with_logs/
people mention lack of object storage https://www.reddit.com/r/kubernetes/comments/1i1qv77/hello_victoriametrics_and_logs/
Application Level data replication (workaround issue with local PVs) VictoriaLogs application-level replication VictoriaMetrics/VictoriaLogs#166
- Related Victoria Logs (single-node instance) High Availability with Helm VictoriaMetrics/VictoriaLogs#33

Loki

https://www.reddit.com/r/grafana/comments/14vz8wa/is_loki_the_right_choice/
loki configured with S3 still requires PVs https://www.reddit.com/r/grafana/comments/1kvsgro/loki_with_s3_still_needs_pvcs_pvs_really/ NOGO (robustly managing PVs cross-region for critical services is a task YH does not find to worth the effort)
https://www.reddit.com/r/grafana/comments/vmv0li/confused_with_loki_kubernetes_deployment/

ELK

https://www.reddit.com/r/devops/comments/qt6isb/is_elk_stack_really_worth_it/

ELK vs Loki vs VictoriaLogs:

Summary
I tried to deploy ELK with helm chart (elastic cloud on kubernetes). It was a painful experience. Firstly, our tiny kubernetes cluster got dead since ELK stack ate up all Memory of a manager node and nothing worked. Once managed to deal with resource issues I kept spending a lot of time trying to understand how to configure misc. parts of ELK stack (it was actually Elastic + Vector + Kibana). Kibanas GUI looked to complicated for our just-logs use case. All in all, ELK was expectedly hard to bring up and manage. It must support data replication and HA natively which is still nice and important.

I spent some time configuring Loki until I realized that Loki (even configured to use S3) still uses PV (https://www.reddit.com/r/grafana/comments/1kvsgro/loki_with_s3_still_needs_pvcs_pvs_really/). These PVs I expect to be high-ops. So we cannot use on-prem distributed storage for them since our networking is slow. I stopped considering Loki at this point since I don't see a way how to make it high available (these Persistent Volumes) But maybee there is a way. I just stopped researching further

I managed to deploy Victoria Logs helm chart in 5 minutes. It had kubernetes logs sent via vector out of the box. It couldn't have been easier. I then spent 2 days trying to understand how to make it highly available (VictoriaMetrics/VictoriaLogs#33) and faced a few bugs / question wrt. its helm charts (VictoriaMetrics/helm-charts#2214, VictoriaMetrics/helm-charts#2219). All in all, it seems to be a good solution that works our of the box, can be HA, easy to maintain and is powerful (enough for us). Helm charts are still WIP (as I can see) but I am fine with this (after my experience with ELK helm charts).

YuryHrytsuk · 2025-05-23T12:37:22Z

Log Shipper Research

Candidates:

~~Grafana Alloy~~ --> people complain about documentation
~~Fluent bit~~ --> light weight version of Fluentd
~~Fluentd~~ in favor of vector
~~Promtail~~ --> deprecated
Vector --> good community feedback, people remark its performance + included by default in Victoria Logs helm chart

Useful links:

Fluent bit vs Fluentd:

https://www.reddit.com/r/kubernetes/comments/lyq387/testing_fluentbit_vs_fluentd/

Fluent bit:

official plugin for loki https://docs.fluentbit.io/manual/pipeline/outputs/loki

Grafana alloy:

people complain about documentation quality https://www.reddit.com/r/grafana/comments/1ix5rb5/alloy_documentation/

Vector:

https://www.reddit.com/r/kubernetes/comments/1kv42qk/what_is_your_experience_with_vectordev_for/
Is it production ready? Who's using Vector in production? vectordotdev/vector#790 --> yes 🤷‍♂️

Fluent bit vs Vector:

https://www.reddit.com/r/kubernetes/comments/1cuzuy4/comment/l4m4qiq/

valyala · 2025-05-28T17:07:44Z

ELK/EFK --> the only stack that does not use PVs (except ES itself which can be managed with operator)

ELK stores all the ingested logs in PVs, and allows moving the historical data to object storage via snapshots - https://www.elastic.co/docs/reference/elasticsearch/index-lifecycle-actions/ilm-searchable-snapshot .

Grafana Loki --> easy to manage + stores Logs in S3 (but full text search can be an issue)

Loki is very hard to manage comparing to VictoriaLogs, since it consists of many interconnected micro-services with very complex configs, which are mostly undocumented and tend to break with every new release. See https://grafana.com/docs/loki/latest/get-started/architecture/ .

VictoriaLogs --> stores logs in PVs (does not support real databases to store data)

VictoriaLogs stores data to the built-in database optimized for typical logs' workloads. ELK and Loki do exactly the same - they store data into built-in databases. The difference is that VictoriaLogs uses more efficient database format, which needs less disk space, RAM and CPU, comparing to ELK and Loki. See https://itnext.io/how-do-open-source-solutions-for-logs-work-elasticsearch-loki-and-victorialogs-9f7097ecbc2f and https://itnext.io/why-victorialogs-is-a-better-alternative-to-grafana-loki-7e941567c4d5 for details.

The better approach to select the needed logging solution is to configure and run multiple solutions on your particular production workload and then choose the best solution for the given production workload.

I recommend evaluating the official Helm charts for every tested solution:

YuryHrytsuk · 2025-05-29T12:30:47Z

ELK/EFK --> the only stack that does not use PVs (except ES itself which can be managed with operator)

ELK stores all the ingested logs in PVs, and allows moving the historical data to object storage via snapshots - https://www.elastic.co/docs/reference/elasticsearch/index-lifecycle-actions/ilm-searchable-snapshot .

Hi @valyala and thank you for your feedback. I have a question however.

When you say that ELK stores logs in PVs you imply Elasticsearch that uses PVs, right? I am not aware of anything else using PVs in this Stack (except Elasticsearch) 🤔

…ing-stack

matusdrobuliak66

Thanks! nice work

sanderegg

nice work.
so I understand you want to replace Graylog with this stack.
Are we also getting all the same loging facilities that we have in Graylog?
at least from what I know:

the docker engine
the machines syslogs

mrnicegyu11

thanks a lot for this! if you gain more experience with configuring vector, talk to me in case you think it is "better" than fluentd for the docker-swarm usecase ;)

very nice indeed

mrnicegyu11 · 2025-06-04T09:04:53Z

nice work. so I understand you want to replace Graylog with this stack. Are we also getting all the same logging facilities that we have in Graylog? at least from what I know:
* the docker engine

* the machines syslogs

this is the full list actually.
I will be curious to see if we can get alerts or so as well.
As I understood it (confirm or deny @YuryHrytsuk ), this is a step to have any, albeit robust and production ready, logging stack on k8s. Via vector, which is a competitor to fluentd as I read it, we could I guess add graylog additionally in case we really need it (I personally say good riddance graylog)

YuryHrytsuk · 2025-06-04T09:58:52Z

nice work. so I understand you want to replace Graylog with this stack. Are we also getting all the same loging facilities that we have in Graylog? at least from what I know:
* the docker engine

* the machines syslogs

Machine sys log yes --> https://vector.dev/docs/reference/configuration/sources/syslog/

Docker engine (@sanderegg do you mean docker logs?) --> https://vector.dev/docs/reference/configuration/sources/docker_logs/

See all sources we can scrape https://vector.dev/docs/reference/configuration/sources/

YuryHrytsuk · 2025-06-04T10:40:51Z

As I understood it (confirm or deny @YuryHrytsuk ), this is a step to have any, albeit robust and production ready, logging stack on k8s.

Robust, production ready and Highly Available, but also powerful and flexible. any doesn't fit well here.

as I read it, we could I guess add graylog additionally in case we really need it

I believe so, yes vectordotdev/vector#4868

(I personally say good riddance graylog)

I agree. I don't see reasons atm why we would need graylog.

Add dir for grafana loki chart

c22636f

YuryHrytsuk added this to the Bazinga! milestone May 23, 2025

YuryHrytsuk self-assigned this May 23, 2025

YuryHrytsuk changed the title ~~Add grafana loki for Kubernetes Logging~~ Kubernetes : add grafana loki for Logging May 23, 2025

YuryHrytsuk changed the title ~~Kubernetes : add grafana loki for Logging~~ Kubernetes: add grafana loki for logging May 23, 2025

Switch from loki to ELK stack

ce9d5c7

YuryHrytsuk changed the title ~~Kubernetes: add grafana loki for logging~~ [AWS Master] Kubernetes: add logging May 28, 2025

YuryHrytsuk mentioned this pull request May 23, 2025

Kubernetes: add logging system #1057

Closed

1 task

Longhorn Readme. Fix typo

af0848f

YuryHrytsuk added 8 commits May 31, 2025 12:06

Further configuration

2660bc3

Final draft ELK configuration

092a830

add victoria logs

e06f730

Introduce victoria logs and auth

b990738

vl ha configuration with 2 charts + 1 vmauth chart

76547be

Converge on a single replicated victoria logs chart

85e06c5

Remove elastic stack chart

2551c61

Remove vector chart (already included in victora logs chart)

6047972

YuryHrytsuk marked this pull request as ready for review June 4, 2025 08:20

YuryHrytsuk requested a review from mrnicegyu11 as a code owner June 4, 2025 08:20

Merge remote-tracking branch 'upstream/main' into add-kubernetes-logg…

48fc459

…ing-stack

YuryHrytsuk requested review from matusdrobuliak66 and sanderegg June 4, 2025 08:20

YuryHrytsuk changed the title ~~[AWS Master] Kubernetes: add logging~~ [AWS Master] Kubernetes: add logging stack Jun 4, 2025

matusdrobuliak66 approved these changes Jun 4, 2025

View reviewed changes

sanderegg approved these changes Jun 4, 2025

View reviewed changes

mrnicegyu11 approved these changes Jun 4, 2025

View reviewed changes

Fix gui trailing slash issue with a href

c0334f1

YuryHrytsuk merged commit 55597cf into ITISFoundation:main Jun 4, 2025
3 checks passed

YuryHrytsuk deleted the add-kubernetes-logging-stack branch June 4, 2025 10:42

YuryHrytsuk mentioned this pull request Jun 4, 2025

Document k8s pv removal #1067

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AWS Master] Kubernetes: add logging stack #1063

[AWS Master] Kubernetes: add logging stack #1063

Uh oh!

YuryHrytsuk commented May 23, 2025 •

edited

Loading

Uh oh!

YuryHrytsuk commented May 23, 2025 •

edited

Loading

Uh oh!

YuryHrytsuk commented May 23, 2025 •

edited

Loading

Uh oh!

valyala commented May 28, 2025

Uh oh!

YuryHrytsuk commented May 29, 2025

Uh oh!

matusdrobuliak66 left a comment

Uh oh!

sanderegg left a comment

Uh oh!

mrnicegyu11 left a comment

Uh oh!

mrnicegyu11 commented Jun 4, 2025 •

edited

Loading

Uh oh!

YuryHrytsuk commented Jun 4, 2025 •

edited

Loading

Uh oh!

YuryHrytsuk commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[AWS Master] Kubernetes: add logging stack #1063

[AWS Master] Kubernetes: add logging stack #1063

Uh oh!

Conversation

YuryHrytsuk commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Technology choice

Next steps

Related issue/s

Related PR/s

Checklist

Uh oh!

YuryHrytsuk commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Logging Stack Research

Uh oh!

YuryHrytsuk commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Log Shipper Research

Uh oh!

valyala commented May 28, 2025

Uh oh!

YuryHrytsuk commented May 29, 2025

Uh oh!

matusdrobuliak66 left a comment

Choose a reason for hiding this comment

Uh oh!

sanderegg left a comment

Choose a reason for hiding this comment

Uh oh!

mrnicegyu11 left a comment

Choose a reason for hiding this comment

Uh oh!

mrnicegyu11 commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YuryHrytsuk commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YuryHrytsuk commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YuryHrytsuk commented May 23, 2025 •

edited

Loading

YuryHrytsuk commented May 23, 2025 •

edited

Loading

YuryHrytsuk commented May 23, 2025 •

edited

Loading

mrnicegyu11 commented Jun 4, 2025 •

edited

Loading

YuryHrytsuk commented Jun 4, 2025 •

edited

Loading

YuryHrytsuk commented Jun 4, 2025 •

edited

Loading