Skip to content

Conversation

@YuryHrytsuk
Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk commented May 23, 2025

What do these changes do?

Add Highly Available Logging Stack to Kubernetes (scraping, storing and querying)

image

Minor:

  • Fix typo in Longhorn's README
  • Reuse EBS storage class name (avoid hardcode)

Technology choice

Logging Backend / Frontend(read more in comments) --> Victoria Logs
Logging shipper (read more in comments) --> vector.dev

Next steps

We can use Victoria Logs Datasource to visualize and query logs in Grafana https://github.com/VictoriaMetrics/victorialogs-datasource

Related issue/s

Related PR/s

Checklist

  • I tested and it works
  • Service has resource limits and reservations
  • Service has placement constraints or is global
  • Service is restartable
  • Service restart is zero-downtime
  • Service is monitored (via prometheus and grafana) not applicable atm
  • Service is not bound to one specific node (e.g. via files or volumes)
  • Relevant OPS E2E Test are added
  • Service's Public URL is included in maintenance mode not applicable atm
  • Service's Public URL is included in testing mode not applicable atm

@YuryHrytsuk YuryHrytsuk added this to the Bazinga! milestone May 23, 2025
@YuryHrytsuk YuryHrytsuk self-assigned this May 23, 2025
@YuryHrytsuk YuryHrytsuk changed the title Add grafana loki for Kubernetes Logging Kubernetes : add grafana loki for Logging May 23, 2025
@YuryHrytsuk YuryHrytsuk changed the title Kubernetes : add grafana loki for Logging Kubernetes: add grafana loki for logging May 23, 2025
@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented May 23, 2025

Logging Stack Research

Candidates:

  • ELK/EFK --> the only stack that does not use PVs (except ES itself which can be managed with operator). Too complicated to manage
  • Grafana Loki --> easy to manage + stores Logs in S3 (but full text search can be an issue) --> nogo since it uses PVs
  • VictoriaLogs --> stores logs in PVs (does not support real databases to store data). But data can be replicated natively. Developers state VL to be resource efficient, easy to manage. It also seems to be raising quickly in popularity (already popular atm)

Useful links:

Vitctoria Logs

Loki

ELK

ELK vs Loki vs VictoriaLogs:

Summary
I tried to deploy ELK with helm chart (elastic cloud on kubernetes). It was a painful experience. Firstly, our tiny kubernetes cluster got dead since ELK stack ate up all Memory of a manager node and nothing worked. Once managed to deal with resource issues I kept spending a lot of time trying to understand how to configure misc. parts of ELK stack (it was actually Elastic + Vector + Kibana). Kibanas GUI looked to complicated for our just-logs use case. All in all, ELK was expectedly hard to bring up and manage. It must support data replication and HA natively which is still nice and important.

I spent some time configuring Loki until I realized that Loki (even configured to use S3) still uses PV (https://www.reddit.com/r/grafana/comments/1kvsgro/loki_with_s3_still_needs_pvcs_pvs_really/). These PVs I expect to be high-ops. So we cannot use on-prem distributed storage for them since our networking is slow. I stopped considering Loki at this point since I don't see a way how to make it high available (these Persistent Volumes) But maybee there is a way. I just stopped researching further

I managed to deploy Victoria Logs helm chart in 5 minutes. It had kubernetes logs sent via vector out of the box. It couldn't have been easier. I then spent 2 days trying to understand how to make it highly available (VictoriaMetrics/VictoriaLogs#33) and faced a few bugs / question wrt. its helm charts (VictoriaMetrics/helm-charts#2214, VictoriaMetrics/helm-charts#2219). All in all, it seems to be a good solution that works our of the box, can be HA, easy to maintain and is powerful (enough for us). Helm charts are still WIP (as I can see) but I am fine with this (after my experience with ELK helm charts).

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented May 23, 2025

Log Shipper Research

Candidates:

  • Grafana Alloy --> people complain about documentation
  • Fluent bit --> light weight version of Fluentd
  • Fluentd in favor of vector
  • Promtail --> deprecated
  • Vector --> good community feedback, people remark its performance + included by default in Victoria Logs helm chart

Useful links:

Fluent bit vs Fluentd:

Fluent bit:

Grafana alloy:

Vector:

Fluent bit vs Vector:

@YuryHrytsuk YuryHrytsuk changed the title Kubernetes: add grafana loki for logging [AWS Master] Kubernetes: add logging May 28, 2025
@YuryHrytsuk YuryHrytsuk mentioned this pull request May 23, 2025
1 task
@valyala
Copy link

valyala commented May 28, 2025

ELK/EFK --> the only stack that does not use PVs (except ES itself which can be managed with operator)

ELK stores all the ingested logs in PVs, and allows moving the historical data to object storage via snapshots - https://www.elastic.co/docs/reference/elasticsearch/index-lifecycle-actions/ilm-searchable-snapshot .

Grafana Loki --> easy to manage + stores Logs in S3 (but full text search can be an issue)

Loki is very hard to manage comparing to VictoriaLogs, since it consists of many interconnected micro-services with very complex configs, which are mostly undocumented and tend to break with every new release. See https://grafana.com/docs/loki/latest/get-started/architecture/ .

VictoriaLogs --> stores logs in PVs (does not support real databases to store data)

VictoriaLogs stores data to the built-in database optimized for typical logs' workloads. ELK and Loki do exactly the same - they store data into built-in databases. The difference is that VictoriaLogs uses more efficient database format, which needs less disk space, RAM and CPU, comparing to ELK and Loki. See https://itnext.io/how-do-open-source-solutions-for-logs-work-elasticsearch-loki-and-victorialogs-9f7097ecbc2f and https://itnext.io/why-victorialogs-is-a-better-alternative-to-grafana-loki-7e941567c4d5 for details.

The better approach to select the needed logging solution is to configure and run multiple solutions on your particular production workload and then choose the best solution for the given production workload.

I recommend evaluating the official Helm charts for every tested solution:

@YuryHrytsuk
Copy link
Collaborator Author

ELK/EFK --> the only stack that does not use PVs (except ES itself which can be managed with operator)

ELK stores all the ingested logs in PVs, and allows moving the historical data to object storage via snapshots - https://www.elastic.co/docs/reference/elasticsearch/index-lifecycle-actions/ilm-searchable-snapshot .

Hi @valyala and thank you for your feedback. I have a question however.

When you say that ELK stores logs in PVs you imply Elasticsearch that uses PVs, right? I am not aware of anything else using PVs in this Stack (except Elasticsearch) 🤔

@YuryHrytsuk YuryHrytsuk marked this pull request as ready for review June 4, 2025 08:20
@YuryHrytsuk YuryHrytsuk requested a review from mrnicegyu11 as a code owner June 4, 2025 08:20
@YuryHrytsuk YuryHrytsuk changed the title [AWS Master] Kubernetes: add logging [AWS Master] Kubernetes: add logging stack Jun 4, 2025
Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! nice work

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work.
so I understand you want to replace Graylog with this stack.
Are we also getting all the same loging facilities that we have in Graylog?
at least from what I know:

  • the docker engine
  • the machines syslogs

Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for this! if you gain more experience with configuring vector, talk to me in case you think it is "better" than fluentd for the docker-swarm usecase ;)

very nice indeed

@mrnicegyu11
Copy link
Member

mrnicegyu11 commented Jun 4, 2025

nice work. so I understand you want to replace Graylog with this stack. Are we also getting all the same logging facilities that we have in Graylog? at least from what I know:

* the docker engine

* the machines syslogs

this is the full list actually.
I will be curious to see if we can get alerts or so as well.
As I understood it (confirm or deny @YuryHrytsuk ), this is a step to have any, albeit robust and production ready, logging stack on k8s. Via vector, which is a competitor to fluentd as I read it, we could I guess add graylog additionally in case we really need it (I personally say good riddance graylog)

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Jun 4, 2025

nice work. so I understand you want to replace Graylog with this stack. Are we also getting all the same loging facilities that we have in Graylog? at least from what I know:

* the docker engine

* the machines syslogs

Machine sys log yes --> https://vector.dev/docs/reference/configuration/sources/syslog/

Docker engine (@sanderegg do you mean docker logs?) --> https://vector.dev/docs/reference/configuration/sources/docker_logs/

See all sources we can scrape https://vector.dev/docs/reference/configuration/sources/

@YuryHrytsuk
Copy link
Collaborator Author

YuryHrytsuk commented Jun 4, 2025

As I understood it (confirm or deny @YuryHrytsuk ), this is a step to have any, albeit robust and production ready, logging stack on k8s.

Robust, production ready and Highly Available, but also powerful and flexible. any doesn't fit well here.

as I read it, we could I guess add graylog additionally in case we really need it

I believe so, yes vectordotdev/vector#4868

(I personally say good riddance graylog)

I agree. I don't see reasons atm why we would need graylog.

@YuryHrytsuk YuryHrytsuk merged commit 55597cf into ITISFoundation:main Jun 4, 2025
3 checks passed
@YuryHrytsuk YuryHrytsuk deleted the add-kubernetes-logging-stack branch June 4, 2025 10:42
@YuryHrytsuk YuryHrytsuk mentioned this pull request Jun 4, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants