Skip to content

Plugin does not recover after network issues #6

@ThomKoomen

Description

@ThomKoomen

When network issues occur while using td-agent with the fluent-plugin-beats plugin, it stops sending data, throwing the below exception(s) and stacktrace several times.
This is, ofcourse, expected behaviour, more so it is really helpful.
However, after these network issues (automatically) recover, the plugin does not resume sending data.
When trying to restart td-agent (service td-agent restart) it will hang indefinitely, requiring killing of the td-agent processes.
After eventually managing to get td-agent back up again, it will run without problems again.

td-agent log:

2017-10-24 00:34:16 +0000 [error]: unexpected error error="Resource temporarily unavailable"
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `sysread'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `read_socket'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:319:in `run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/fluent/plugin/in_beats.rb:97:in `block in run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `call'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `run_task'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (3 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `loop'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `block (2 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `catch'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `block in create_worker'
2017-10-24 00:34:16 +0000 [error]: unexpected error error="Bad file descriptor"
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `sysread'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:342:in `read_socket'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/lumberjack/beats/server.rb:319:in `run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-beats-0.1.2/lib/fluent/plugin/in_beats.rb:97:in `block in run'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `call'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:331:in `run_task'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:320:in `block (3 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `loop'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:303:in `block (2 levels) in create_worker'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `catch'
  2017-10-24 00:34:16 +0000 [error]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/concurrent-ruby-0.9.2/lib/concurrent/executor/ruby_thread_pool_executor.rb:302:in `block in create_worker'

metricbeat log

2017-10-24T02:34:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1276 libb
eat.publisher.messages_in_worker_queues=10 libbeat.publisher.published_events=10
2017-10-24T02:34:36+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:36038->161.86.62.206:5044: i/o timeout
2017-10-24T02:34:36+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:36038->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:05+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=801 libbeat.logstash.published_but_not_acked_events=
10
2017-10-24T02:35:07+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54702->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:07+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54702->161.86.62.206:5044: i/o timeout
2017-10-24T02:35:13+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54922->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:13+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54922->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:23+02:00 ERR Failed to publish events caused by: read tcp 161.86.62.204:54992->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:23+02:00 INFO Error publishing events (retrying): read tcp 161.86.62.204:54992->161.86.62.206:5044: read: connection reset by peer
2017-10-24T02:35:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.logstash.publish.read_bytes=24 libbeat
.logstash.publish.read_errors=3 libbeat.logstash.publish.write_bytes=2700 libbeat.logstash.published_and_acked_events=10 libbeat.logstash.published_but_not_acked_events=30 libbeat.outputs.messages_dropped=1 libbeat.publisher.messages_in_
worker_queues=10 libbeat.publisher.published_events=10
2017-10-24T02:36:05+02:00 INFO No non-zero metrics in the last 30s
2017-10-24T02:36:35+02:00 INFO Non-zero metrics in the last 30s: fetches.system-cpu.events=1 fetches.system-cpu.success=1 fetches.system-diskio.events=6 fetches.system-diskio.success=1 fetches.system-filesystem.events=2 fetches.system-fi
lesystem.success=1 fetches.system-memory.events=1 fetches.system-memory.success=1 fetches.system-network.events=2 fetches.system-network.success=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=60 libbeat
.logstash.publish.write_bytes=1566 libbeat.logstash.published_and_acked_events=10 libbeat.publisher.messages_in_worker_queues=10 libbeat.publisher.published_events=10

metricbeat config:

###################### Metricbeat Configuration #######################

#==========================  Modules configuration ============================
metricbeat.modules:

#------------------------------- System Module -------------------------------
- module: system
  metricsets:
    - cpu
    - diskio
    - filesystem
    - network
    - memory
  enabled: true
  period: 60s
  filters:
    - drop_event.when.regexp.mount_point: '^/(sys|cgroup|proc|dev|etc|host|run|mnt/home|var/lib/lxcfs)($|/)'
  fields:
    uuid: xxxxx

#================================ Processors ==================================
processors:
- drop_event:
    when:
       regexp:
           system.diskio.name: "fd0|sr0"

#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ['xx:xx:xx:xx:xxxx']

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ['']

  # Certificate for SSL client authentication
  #ssl.certificate: ''

  # Client Certificate Key
  #ssl.key: ''


#================================ Logging =====================================

# Sets log level. The default log level is info.
# Available log levels are: critical, error, warning, info, debug
#logging.level: debug

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publish", "service".
#logging.selectors: ["*"]

td-agent config:

<source>
        @type beats
         metadata_as_tag
</source>
<filter *beat>
        @type record_transformer
        <record>
            tag ${tag}
        </record>
</filter>
<match metricbeat>
        ...
</match>

After looking at the code, in an effort trying to fix it myself, I feel like the error handling may possibly need a retry and I suspect the @thread.join to be preventing td-agent from properly restarting, however I'm not that experienced with Ruby so I may be completely wrong.
While waiting for a response I'll keep trying to fix it and will try to keep this ticket updated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions