Corrupt disk buffer at startup causes crash? #18015

sbalmos · 2023-07-18T16:32:25Z

sbalmos
Jul 18, 2023

Another one I'm trying to track down what it might be before I file a bug report. Our k8s cluster had a node failure, so I went and restarted our one Vector instance that performs message routing and distribution. The sinks in this instance all have disk buffers on them, backed by a large PVC.

I've seen this only once before, but never really caught why. I perform a rolling restart, which initiates a normal Vector shutdown. Invariably some of the sinks never finish acknowledgements (Splunk HEC is notorious at this, taking nearly 5 minutes to index-acknowledge), so the instance gets killed after the 60 second timeout. After the instance starts back up, it would immediately crash on startup with the error message below. I have to completely delete the PVC containing all the disk buffers in order to get the instance to start again. Apparently the 60 second timeout, in its processing of killing the shutdown, corrupted the buffer. Any ideas on a workaround? Having to delete the PVC and start the buffers from scratch is... annoying. :) I'm willing to work on a PR if I could figure out what/where the disk buffer at-startup recovery code might be failing.

2023-07-18T16:09:21.356944Z ERROR vector::topology: Configuration error. error=Sink "splunk_hec": error occurred when building buffer: failed to build individual stage 0: failed to seek to position where reader left off: failed to decoded record: InvalidProtobufPayload

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Corrupt disk buffer at startup causes crash? #18015

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Corrupt disk buffer at startup causes crash? #18015

Uh oh!

sbalmos Jul 18, 2023

Replies: 0 comments

sbalmos
Jul 18, 2023