You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another one I'm trying to track down what it might be before I file a bug report. Our k8s cluster had a node failure, so I went and restarted our one Vector instance that performs message routing and distribution. The sinks in this instance all have disk buffers on them, backed by a large PVC.
I've seen this only once before, but never really caught why. I perform a rolling restart, which initiates a normal Vector shutdown. Invariably some of the sinks never finish acknowledgements (Splunk HEC is notorious at this, taking nearly 5 minutes to index-acknowledge), so the instance gets killed after the 60 second timeout. After the instance starts back up, it would immediately crash on startup with the error message below. I have to completely delete the PVC containing all the disk buffers in order to get the instance to start again. Apparently the 60 second timeout, in its processing of killing the shutdown, corrupted the buffer. Any ideas on a workaround? Having to delete the PVC and start the buffers from scratch is... annoying. :) I'm willing to work on a PR if I could figure out what/where the disk buffer at-startup recovery code might be failing.
2023-07-18T16:09:21.356944Z ERROR vector::topology: Configuration error. error=Sink "splunk_hec": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Another one I'm trying to track down what it might be before I file a bug report. Our k8s cluster had a node failure, so I went and restarted our one Vector instance that performs message routing and distribution. The sinks in this instance all have disk buffers on them, backed by a large PVC.
I've seen this only once before, but never really caught why. I perform a rolling restart, which initiates a normal Vector shutdown. Invariably some of the sinks never finish acknowledgements (Splunk HEC is notorious at this, taking nearly 5 minutes to index-acknowledge), so the instance gets killed after the 60 second timeout. After the instance starts back up, it would immediately crash on startup with the error message below. I have to completely delete the PVC containing all the disk buffers in order to get the instance to start again. Apparently the 60 second timeout, in its processing of killing the shutdown, corrupted the buffer. Any ideas on a workaround? Having to delete the PVC and start the buffers from scratch is... annoying. :) I'm willing to work on a PR if I could figure out what/where the disk buffer at-startup recovery code might be failing.
Beta Was this translation helpful? Give feedback.
All reactions