-
Notifications
You must be signed in to change notification settings - Fork 595
Description
Describe the bug
Hello! we have a problem with our influxdb2 oss, which sometimes (for unknown reasons) becomes unresponsive and dies.
On the icinga-side the influxdbwriter runs into an exception. After this exception the number of work queue items grows infinite. Now we have a short time window to detect this and restart the affected influx-daemon with an eventcommand, because when the number of an influxdbwriter's work queue items hits 10.000.000 or so (there seems to be an internal hard limit), the icinga-master becomes unresponsive and needs to be restarted afterards.
[2025-11-07 07:13:49 +0100] warning/InfluxdbWriter: Unexpected response code: Internal Server Error
...
[2025-11-07 07:22:50 +0100] warning/InfluxdbWriter: Unexpected response code: Internal Server Error
[2025-11-07 07:22:57 +0100] warning/InfluxdbWriter: Failed to parse HTTP response from host '<influx_ip>' port '8086': Error: end of stream
Stacktrace:
0# __cxa_throw in /usr/lib64/icinga2/sbin/icinga2
1# void boost::throw_exception<boost::exception_detail::error_info_injector<boost::system::system_error> >(boost::exception_detail::error_info_injector<boost::system::system_error> const&) in /usr/lib64/icinga2/sbin/icinga2
2# void boost::exception_detail::throw_exception_<boost::system::system_error>(boost::system::system_error const&, char const*, char const*, int) in /usr/lib64/icinga2/sbin/icinga2
3# unsigned long boost::beast::http::read<icinga::Shared<boost::asio::buffered_stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp> > >, boost::beast::basic_flat_buffer<std::allocator<char> >, false, boost::beast::http::parser<false, boost::beast::http::basic_string_body<char, std::char_traits<char>, std::allocator<char> >, std::allocator<char> > >(icinga::Shared<boost::asio::buffered_stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp> > >&, boost::beast::basic_flat_buffer<std::allocator<char> >&, boost::beast::http::basic_parser<false, boost::beast::http::parser<false, boost::beast::http::basic_string_body<char, std::char_traits<char>, std::allocator<char> >, std::allocator<char> > >&) in /usr/lib64/icinga2/sbin/icinga2
4# icinga::InfluxdbCommonWriter::FlushWQ() in /usr/lib64/icinga2/sbin/icinga2
5# icinga::InfluxdbCommonWriter::FlushTimeoutWQ() in /usr/lib64/icinga2/sbin/icinga2
6# icinga::WorkQueue::RunTaskFunction(std::function<void ()> const&) in /usr/lib64/icinga2/sbin/icinga2
7# icinga::WorkQueue::WorkerThreadProc() in /usr/lib64/icinga2/sbin/icinga2
8# 0x00007F0BF49255E1 in /lib64/libboost_thread.so.1.66.0
9# 0x00007F0BF44F31CA in /lib64/libpthread.so.0
10# clone in /lib64/libc.so.6
[2025-11-07 07:22:57 +0100] critical/InfluxdbWriter: Exception during InfluxDB operation: Verify that your backend is operational!
[2025-11-07 07:22:57 +0100] warning/InfluxdbWriter: Can't connect to InfluxDB on host '<influx_ip>' port '8086'.
...
[2025-11-07 07:23:22 +0100] warning/InfluxdbWriter: Flush failed, cannot connect to InfluxDB: Error: connect: Connection refused
To Reproduce
- have an unresponsive target for the influxdbwriter
- fill it continuously with perfdata
- wait until the queue hits 10.000.000
Expected behavior
The influxdbwriter's work queue should not kill the whole monitoring system. Maybe graceful dropping whould be better in this case.
Screenshots
To prevent our influxdb (and as a side effect also icinga) from dying, we are checking both of the influxdbwriter's work queue via icinga's api with a critical threshold of 4.000.000 and kickstart the corrosponding influxdb with an eventcommand, until a solution is found on the influxdb-side.
Your Environment
Include as many relevant details about the environment you experienced the problem in
- Version used (
icinga2 --version): 2.15.1 - Operating System and version: rhel 8.10
- Enabled features (
icinga2 feature list): Enabled features: api checker icingadb influxdb mainlog notification - Icinga Web 2 version and modules (System - About): 2.12.5
- Config validation (
icinga2 daemon -C):
[2025-11-11 13:45:47 +0100] information/cli: Icinga application loader (version: r2.15.1-1)
[2025-11-11 13:45:47 +0100] information/cli: Loading configuration file(s).
[2025-11-11 13:45:52 +0100] information/ConfigItem: Committing config item(s).
[2025-11-11 13:45:52 +0100] information/ApiListener: My API identity: <icinga_master_02>
[2025-11-11 13:46:02 +0100] information/WorkQueue: #5 (DaemonUtility::LoadConfigFiles) items: 192, rate: 54.4/s (3264/min 3264/5min 3264/15min);
[2025-11-11 13:46:02 +0100] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:02 +0100] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:03 +0100] information/WorkQueue: #10 (InfluxdbWriter, influxdb02) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:03 +0100] information/WorkQueue: #9 (InfluxdbWriter, influxdb01) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 2 InfluxdbWriters.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 409 HostGroups.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 3 Users.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 7 TimePeriods.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 10 ServiceGroups.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 222680 Services.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 496 ScheduledDowntimes.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 6354 Zones.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 295581 Notifications.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 2 NotificationCommands.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 36526 Hosts.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 3 EventCommands.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1908 Downtimes.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 6360 Endpoints.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 14 Comments.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 14 ApiUsers.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 463 CheckCommands.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 IcingaDB.
[2025-11-11 13:46:24 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2025-11-11 13:46:24 +0100] information/cli: Finished validating the configuration file(s).
Additional context
Add any other context about the problem here.