Skip to content

Warp 10 output plugin: retry storms on token expiration causes DDoS-like behavior #18118

@FlorentinDUBOIS

Description

@FlorentinDUBOIS

Relevant telegraf.conf

[agent]
debug = false
flush_interval = "45s"
hostname = "28349e03-4898-4d12-bb35-85cc48a1efc4"
interval = "60s"
metric_batch_size = 1000
metric_buffer_limit = 10000
quiet = true
round_interval = false

[global_tags]
datacenter = "par8"
deployment_id = "deployment_570baf0d-4f67-4e41-b252-22018943050d"
flavor_name = "M"
hypervisor = "hv-par8-018"
image_type = "rust"
image_variant = "rust"
instance_source = "apps"
vm_type = "volatile"
zone = "par"
[[inputs.conntrack]]

[[inputs.cpu]]
fieldexclude = ["time_*", "usage_idle"]
percpu = false
totalcpu = true

[[inputs.cpu]]
fieldinclude = ["usage_idle"]
interval = "45s"
percpu = false
totalcpu = true

[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
interval = "5m"

[[inputs.exec]]
commands = ["echo 'series,application_name=lbaccess-logsarchiver user_application_name=\"lbaccess-logsarchiver\"'"]
data_format = "influx"
interval = "20m"
timeout = "5s"

[[inputs.exec]]
alias = "oomd"
commands = ["oomd -d"]
data_format = "json"
fieldexclude = ["oomd.dropin.*"]
interval = "20m"
name_override = "oomd"
timeout = "30s"

[[inputs.filestat]]
fieldinclude = ["md5_sum"]
files = ["/etc/passwd"]
interval = "1h"
md5 = true

[[inputs.http_response]]
follow_redirects = false
method = "GET"
response_timeout = "5s"
urls = ["http://127.0.0.1:8080"]

[inputs.http_response.headers]
Forwarded = "proto=https"
X-CleverCloud-Monitoring = "telegraf"
X-Forwarded-Proto = "https"

[[inputs.kernel]]
interval = "5m"

[[inputs.linux_sysctl_fs]]
fieldinclude = ["file-max"]
interval = "1h"

[[inputs.linux_sysctl_fs]]
fieldinclude = ["file-nr", "inode-nr", "inode-free-nr"]

[[inputs.mem]]
fieldexclude = ["available", "used_percent"]

[[inputs.mem]]
fieldinclude = ["available", "used_percent"]
interval = "45s"

[[inputs.net]]
fieldexclude = ["icmp_*", "icmpmsg_*", "ip_*", "tcp_*", "udp_*", "udplite_*"]

[[inputs.net_response]]
address = "127.0.0.1:8080"
protocol = "tcp"
timeout = "5s"

[[inputs.netstat]]

[[inputs.processes]]

[[inputs.procstat]]
cgroup = "system.slice/bas-deploy.service"
fieldinclude = ["pid_count"]
interval = "45s"

[[inputs.prometheus]]
metric_version = 2
response_timeout = "30s"
urls = ["http://localhost:8080/metrics"]

[[inputs.statsd]]
allowed_pending_messages = 100
delete_counters = false
name_prefix = "statsd."
service_address = "127.0.0.1:8125"

[[inputs.system]]
fieldinclude = ["load1", "uptime"]

[[inputs.system]]
fieldinclude = ["load1_per_cpu"]
interval = "45s"
[[outputs.warp10]]
print_error_body = true
token = "xxxx"
warp_url = "https://xxx"

Logs from Telegraf

I am sorry, but I did not save them.

System info

Telegraf 1.36.5-r500 on Exherbo Linux

Docker

No response

Steps to reproduce

  1. Setup the environment with a Warp 10 standalone from docker
  2. Create a very short lived token and paste it in the configuration above
  3. Update the endpoint as well
  4. See what happens...

Expected behavior

To me, as it is achieved in other module, it must stop retrying indefinitely

Actual behavior

It retries indefinitely, aggregating telemetry and create huge load issue on bandwidth and requests.

Additional info

Problem

When a Warp10 API token expires or gets revoked, the warp10 output plugin retries indefinitely, creating a DDoS-like effect against the Warp10 platform.

Impact

At Clever Cloud, we experienced this issue when token renewal failed. The telegraf agents kept retrying metrics against the Warp10 endpoint, which was returning authentication errors (Invalid token, Token Expired, Token revoked). This created excessive load on the Warp10 platform as thousands of agents simultaneously hammered the API with requests that would never succeed.

Root Cause

The warp10 plugin treated all errors as retryable. When the API returned authentication failures, metrics were kept in the retry buffer and continuously re-sent instead of being dropped.

Unlike REST APIs that use HTTP 401/403 status codes, Warp10 returns HTTP 500 with error details in the response body. The plugin wasn't parsing these responses to determine if errors were retryable.

Affected Error Types

Non-retryable errors (should drop metrics immediately):

  • Invalid token
  • Token Expired
  • Token revoked
  • Write token missing
  • Application suspended or closed

Retryable errors (should keep in buffer):

  • Exceeded Monthly Active Data Streams limit
  • Exceeded Daily Data Points limit
  • broken pipe
  • Server unavailable (503)

Workaround

If experiencing this issue, the only workaround is to restart telegraf after fixing the token, or to disable the warp10 output until the token is valid again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugunexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions