-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Description
Relevant telegraf.conf
[agent]
debug = false
flush_interval = "45s"
hostname = "28349e03-4898-4d12-bb35-85cc48a1efc4"
interval = "60s"
metric_batch_size = 1000
metric_buffer_limit = 10000
quiet = true
round_interval = false
[global_tags]
datacenter = "par8"
deployment_id = "deployment_570baf0d-4f67-4e41-b252-22018943050d"
flavor_name = "M"
hypervisor = "hv-par8-018"
image_type = "rust"
image_variant = "rust"
instance_source = "apps"
vm_type = "volatile"
zone = "par"
[[inputs.conntrack]]
[[inputs.cpu]]
fieldexclude = ["time_*", "usage_idle"]
percpu = false
totalcpu = true
[[inputs.cpu]]
fieldinclude = ["usage_idle"]
interval = "45s"
percpu = false
totalcpu = true
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
interval = "5m"
[[inputs.exec]]
commands = ["echo 'series,application_name=lbaccess-logsarchiver user_application_name=\"lbaccess-logsarchiver\"'"]
data_format = "influx"
interval = "20m"
timeout = "5s"
[[inputs.exec]]
alias = "oomd"
commands = ["oomd -d"]
data_format = "json"
fieldexclude = ["oomd.dropin.*"]
interval = "20m"
name_override = "oomd"
timeout = "30s"
[[inputs.filestat]]
fieldinclude = ["md5_sum"]
files = ["/etc/passwd"]
interval = "1h"
md5 = true
[[inputs.http_response]]
follow_redirects = false
method = "GET"
response_timeout = "5s"
urls = ["http://127.0.0.1:8080"]
[inputs.http_response.headers]
Forwarded = "proto=https"
X-CleverCloud-Monitoring = "telegraf"
X-Forwarded-Proto = "https"
[[inputs.kernel]]
interval = "5m"
[[inputs.linux_sysctl_fs]]
fieldinclude = ["file-max"]
interval = "1h"
[[inputs.linux_sysctl_fs]]
fieldinclude = ["file-nr", "inode-nr", "inode-free-nr"]
[[inputs.mem]]
fieldexclude = ["available", "used_percent"]
[[inputs.mem]]
fieldinclude = ["available", "used_percent"]
interval = "45s"
[[inputs.net]]
fieldexclude = ["icmp_*", "icmpmsg_*", "ip_*", "tcp_*", "udp_*", "udplite_*"]
[[inputs.net_response]]
address = "127.0.0.1:8080"
protocol = "tcp"
timeout = "5s"
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.procstat]]
cgroup = "system.slice/bas-deploy.service"
fieldinclude = ["pid_count"]
interval = "45s"
[[inputs.prometheus]]
metric_version = 2
response_timeout = "30s"
urls = ["http://localhost:8080/metrics"]
[[inputs.statsd]]
allowed_pending_messages = 100
delete_counters = false
name_prefix = "statsd."
service_address = "127.0.0.1:8125"
[[inputs.system]]
fieldinclude = ["load1", "uptime"]
[[inputs.system]]
fieldinclude = ["load1_per_cpu"]
interval = "45s"
[[outputs.warp10]]
print_error_body = true
token = "xxxx"
warp_url = "https://xxx"Logs from Telegraf
I am sorry, but I did not save them.
System info
Telegraf 1.36.5-r500 on Exherbo Linux
Docker
No response
Steps to reproduce
- Setup the environment with a Warp 10 standalone from docker
- Create a very short lived token and paste it in the configuration above
- Update the endpoint as well
- See what happens...
Expected behavior
To me, as it is achieved in other module, it must stop retrying indefinitely
Actual behavior
It retries indefinitely, aggregating telemetry and create huge load issue on bandwidth and requests.
Additional info
Problem
When a Warp10 API token expires or gets revoked, the warp10 output plugin retries indefinitely, creating a DDoS-like effect against the Warp10 platform.
Impact
At Clever Cloud, we experienced this issue when token renewal failed. The telegraf agents kept retrying metrics against the Warp10 endpoint, which was returning authentication errors (Invalid token, Token Expired, Token revoked). This created excessive load on the Warp10 platform as thousands of agents simultaneously hammered the API with requests that would never succeed.
Root Cause
The warp10 plugin treated all errors as retryable. When the API returned authentication failures, metrics were kept in the retry buffer and continuously re-sent instead of being dropped.
Unlike REST APIs that use HTTP 401/403 status codes, Warp10 returns HTTP 500 with error details in the response body. The plugin wasn't parsing these responses to determine if errors were retryable.
Affected Error Types
Non-retryable errors (should drop metrics immediately):
- Invalid token
- Token Expired
- Token revoked
- Write token missing
- Application suspended or closed
Retryable errors (should keep in buffer):
- Exceeded Monthly Active Data Streams limit
- Exceeded Daily Data Points limit
- broken pipe
- Server unavailable (503)
Workaround
If experiencing this issue, the only workaround is to restart telegraf after fixing the token, or to disable the warp10 output until the token is valid again.