-
Notifications
You must be signed in to change notification settings - Fork 2.3k
docs: add comprehensive troubleshooting section to README #4711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ABHISHEK-DBZ
wants to merge
3
commits into
prometheus:main
Choose a base branch
from
ABHISHEK-DBZ:docs/add-troubleshooting-section
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -404,6 +404,91 @@ alerting: | |
|
|
||
| If running Alertmanager in high availability mode is not desired, setting `--cluster.listen-address=` prevents Alertmanager from listening to incoming peer requests. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Common Issues and Solutions | ||
|
|
||
| #### Cluster peers not connecting | ||
|
|
||
| **Symptoms:** Alertmanager instances cannot discover each other in cluster mode. | ||
|
|
||
| **Solutions:** | ||
| - Verify that both UDP and TCP ports are open on `--cluster.listen-address` (default: 9094) | ||
| - Check firewall rules and ensure the clustering port is whitelisted for both protocols | ||
| - Verify `--cluster.advertise-address` is set correctly and reachable from other peers | ||
| - Use `--cluster.peer` flag to explicitly specify initial peers | ||
| - Check logs for DNS resolution errors, especially if using hostnames | ||
| - Increase `--cluster.peers-resolve-timeout` if DNS lookups are slow (default: 15s) | ||
|
|
||
| Example of correct cluster setup: | ||
| ```bash | ||
| # Node 1 | ||
| ./alertmanager --cluster.listen-address=0.0.0.0:9094 \ | ||
| --cluster.advertise-address=192.168.1.10:9094 \ | ||
| --cluster.peer=192.168.1.11:9094 | ||
|
|
||
| # Node 2 | ||
| ./alertmanager --cluster.listen-address=0.0.0.0:9094 \ | ||
| --cluster.advertise-address=192.168.1.11:9094 \ | ||
| --cluster.peer=192.168.1.10:9094 | ||
| ``` | ||
|
|
||
| #### Alerts not being received | ||
|
|
||
| **Symptoms:** Prometheus is sending alerts but Alertmanager shows no alerts. | ||
|
|
||
| **Solutions:** | ||
| - Verify Alertmanager is reachable from Prometheus: `curl http://<alertmanager>:9093/-/healthy` | ||
| - Check Prometheus alerting configuration points to correct Alertmanager endpoints | ||
| - Review Prometheus logs for connection errors to Alertmanager | ||
| - Ensure alerts are actually firing in Prometheus: check `/alerts` page | ||
| - Verify no firewall blocking between Prometheus and Alertmanager | ||
|
|
||
| #### Notifications not being sent | ||
|
|
||
| **Symptoms:** Alerts appear in Alertmanager UI but notifications are not delivered. | ||
|
|
||
| **Solutions:** | ||
| - Check Alertmanager logs for errors related to notification delivery | ||
| - Verify receiver configuration in `alertmanager.yml` is correct | ||
| - Test receiver credentials and endpoints manually | ||
| - Check if alerts are being inhibited or silenced | ||
| - Verify routing configuration matches alert labels | ||
| - Use `amtool config routes test` to verify routing logic | ||
|
|
||
| #### High memory usage | ||
|
|
||
| **Symptoms:** Alertmanager consuming excessive memory. | ||
|
|
||
| **Solutions:** | ||
| - Check for alert storms - large number of unique alert groups | ||
| - Review `group_by` labels in routing configuration | ||
| - Consider using more specific grouping to reduce alert group count | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this better read "broader", since it sounds like if you go for more specific, you'll get more groups, not fewer? |
||
| - Monitor notification log size and configure retention as needed | ||
| - Check for large number of active silences | ||
|
|
||
| #### DNS resolution timeouts | ||
|
|
||
| **Symptoms:** Alertmanager becomes unresponsive, readiness checks fail. | ||
|
|
||
| **Solutions:** | ||
| - Increase `--cluster.peers-resolve-timeout` (default: 15s) | ||
| - Use IP addresses instead of hostnames in `--cluster.peer` flags | ||
| - Check DNS server responsiveness and network connectivity | ||
| - Review DNS resolution logs in Alertmanager output | ||
| - Consider using a local DNS cache | ||
|
|
||
| #### Configuration reload fails | ||
|
|
||
| **Symptoms:** Configuration changes don't take effect or Alertmanager fails to reload. | ||
|
|
||
| **Solutions:** | ||
| - Validate configuration before reload: `amtool check-config alertmanager.yml` | ||
| - Check Alertmanager logs for specific configuration errors | ||
| - Verify file permissions on configuration file | ||
| - Ensure template files referenced in config exist and are readable | ||
| - Send SIGHUP signal manually: `kill -HUP <alertmanager-pid>` | ||
|
|
||
| ## Contributing | ||
|
|
||
| Check the [Prometheus contributing page](https://github.com/prometheus/prometheus/blob/main/CONTRIBUTING.md). | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can possibly remove this line which doesn't specify how to review them, and merge them with the one below