Skip to content

work queue items grow infinite after influxdbwriter runs into an exception #10630

@K0nne

Description

@K0nne

Describe the bug

Hello! we have a problem with our influxdb2 oss, which sometimes (for unknown reasons) becomes unresponsive and dies.

On the icinga-side the influxdbwriter runs into an exception. After this exception the number of work queue items grows infinite. Now we have a short time window to detect this and restart the affected influx-daemon with an eventcommand, because when the number of an influxdbwriter's work queue items hits 10.000.000 or so (there seems to be an internal hard limit), the icinga-master becomes unresponsive and needs to be restarted afterards.

[2025-11-07 07:13:49 +0100] warning/InfluxdbWriter: Unexpected response code: Internal Server Error
...
[2025-11-07 07:22:50 +0100] warning/InfluxdbWriter: Unexpected response code: Internal Server Error
[2025-11-07 07:22:57 +0100] warning/InfluxdbWriter: Failed to parse HTTP response from host '<influx_ip>' port '8086': Error: end of stream
 
Stacktrace:
0# __cxa_throw in /usr/lib64/icinga2/sbin/icinga2
1# void boost::throw_exception<boost::exception_detail::error_info_injector<boost::system::system_error> >(boost::exception_detail::error_info_injector<boost::system::system_error> const&) in /usr/lib64/icinga2/sbin/icinga2
2# void boost::exception_detail::throw_exception_<boost::system::system_error>(boost::system::system_error const&, char const*, char const*, int) in /usr/lib64/icinga2/sbin/icinga2
3# unsigned long boost::beast::http::read<icinga::Shared<boost::asio::buffered_stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp> > >, boost::beast::basic_flat_buffer<std::allocator<char> >, false, boost::beast::http::parser<false, boost::beast::http::basic_string_body<char, std::char_traits<char>, std::allocator<char> >, std::allocator<char> > >(icinga::Shared<boost::asio::buffered_stream<boost::asio::basic_stream_socket<boost::asio::ip::tcp> > >&, boost::beast::basic_flat_buffer<std::allocator<char> >&, boost::beast::http::basic_parser<false, boost::beast::http::parser<false, boost::beast::http::basic_string_body<char, std::char_traits<char>, std::allocator<char> >, std::allocator<char> > >&) in /usr/lib64/icinga2/sbin/icinga2
4# icinga::InfluxdbCommonWriter::FlushWQ() in /usr/lib64/icinga2/sbin/icinga2
5# icinga::InfluxdbCommonWriter::FlushTimeoutWQ() in /usr/lib64/icinga2/sbin/icinga2
6# icinga::WorkQueue::RunTaskFunction(std::function<void ()> const&) in /usr/lib64/icinga2/sbin/icinga2
7# icinga::WorkQueue::WorkerThreadProc() in /usr/lib64/icinga2/sbin/icinga2
8# 0x00007F0BF49255E1 in /lib64/libboost_thread.so.1.66.0
9# 0x00007F0BF44F31CA in /lib64/libpthread.so.0
10# clone in /lib64/libc.so.6
[2025-11-07 07:22:57 +0100] critical/InfluxdbWriter: Exception during InfluxDB operation: Verify that your backend is operational!
[2025-11-07 07:22:57 +0100] warning/InfluxdbWriter: Can't connect to InfluxDB on host '<influx_ip>' port '8086'.
...
[2025-11-07 07:23:22 +0100] warning/InfluxdbWriter: Flush failed, cannot connect to InfluxDB: Error: connect: Connection refused

To Reproduce

  1. have an unresponsive target for the influxdbwriter
  2. fill it continuously with perfdata
  3. wait until the queue hits 10.000.000

Expected behavior

The influxdbwriter's work queue should not kill the whole monitoring system. Maybe graceful dropping whould be better in this case.

Screenshots

Image

To prevent our influxdb (and as a side effect also icinga) from dying, we are checking both of the influxdbwriter's work queue via icinga's api with a critical threshold of 4.000.000 and kickstart the corrosponding influxdb with an eventcommand, until a solution is found on the influxdb-side.

Image

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): 2.15.1
  • Operating System and version: rhel 8.10
  • Enabled features (icinga2 feature list): Enabled features: api checker icingadb influxdb mainlog notification
  • Icinga Web 2 version and modules (System - About): 2.12.5
  • Config validation (icinga2 daemon -C):
[2025-11-11 13:45:47 +0100] information/cli: Icinga application loader (version: r2.15.1-1)
[2025-11-11 13:45:47 +0100] information/cli: Loading configuration file(s).
[2025-11-11 13:45:52 +0100] information/ConfigItem: Committing config item(s).
[2025-11-11 13:45:52 +0100] information/ApiListener: My API identity: <icinga_master_02>
[2025-11-11 13:46:02 +0100] information/WorkQueue: #5 (DaemonUtility::LoadConfigFiles) items: 192, rate: 54.4/s (3264/min 3264/5min 3264/15min);
[2025-11-11 13:46:02 +0100] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:02 +0100] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:03 +0100] information/WorkQueue: #10 (InfluxdbWriter, influxdb02) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:03 +0100] information/WorkQueue: #9 (InfluxdbWriter, influxdb01) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 2 InfluxdbWriters.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 409 HostGroups.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 3 Users.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 7 TimePeriods.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 10 ServiceGroups.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 222680 Services.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 496 ScheduledDowntimes.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 6354 Zones.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 295581 Notifications.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 2 NotificationCommands.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 36526 Hosts.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 3 EventCommands.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1908 Downtimes.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 6360 Endpoints.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 14 Comments.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 14 ApiUsers.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 463 CheckCommands.
[2025-11-11 13:46:24 +0100] information/ConfigItem: Instantiated 1 IcingaDB.
[2025-11-11 13:46:24 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2025-11-11 13:46:24 +0100] information/cli: Finished validating the configuration file(s).

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions