Trying to understand chunking, retry and buffer drop #4093

viralmutant · 2023-03-12T19:54:06Z

viralmutant
Mar 12, 2023

We are using version 1.10.2 with buf_file and have been having some issue lately where lot of messages are being lost.
I am trying to understand how exactly the parameters work
This is our td-agent.conf:

chunk_limit_size 2MB
queued_chunks_limit_size 1
total_limit_size 6GB
overflow_action drop_oldest_chunk
flush_mode interval
flush_interval 5s
retry_type periodic
retry_wait 15s
retry_timeout 15m
disable_chunk_backup true

We have chunk_limit_size of 2MB but I observe files of a few KBs as well. I thought each file(chunk) should be 2MB until we start with new file. Isn't it ?

The retry_wait and retry_timeout is applicable to each of the chunks individually ? With only 1 worker, it works sequentially that one chunk is retried till timeout is hit, then it picks the next chunk ?

I am also wondering if we set the retry_timeout > (retry_wait * retry_max_times), how does it work ? I mean, if someone sets only 2 retries after 15 seconds, but timeout of say 10 minutes. How will that work ?

I see such messages in logs

2023-02-17 00:01:00 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2023-02-17 00:01:13 +0000 chunk="5f4d8155df443c2f5224234c973d0ea9" error_class=Net::ReadTimeout error="Net::ReadTimeout"
2023-02-17 00:01:00 +0000 [warn]: #0 suppressed same stacktrace
2023-02-17 00:02:14 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2023-02-17 00:02:30 +0000 chunk="5f4d8155df443c2f5224234c973d0ea9" error_class=Net::ReadTimeout error="Net::ReadTimeout"
2023-02-17 00:02:14 +0000 [warn]: #0 suppressed same stacktrace
...
2023-02-17 00:14:47 +0000 [warn]: #0 suppressed same stacktrace
2023-02-17 00:15:48 +0000 [error]: #0 failed to flush the buffer, and hit limit for retries. dropping all chunks in the buffer queue. retry_times=12 records=168070 error_class=Net::ReadTimeout error="Net::ReadTimeout"

next_retry_seconds is exactly after 15sec as configured in the conf file. But the next retry actually appears after 1 minute, why is that so ?

And after 15minutes, when it prints dropping all chunks, what all is actually dropped ?

daipom · 2023-03-13T02:16:32Z

daipom
Mar 13, 2023
Maintainer

The version is too old, please consider using the latest Fluentd version.
The specification may differ from what I know, I will answer as much as I know.

We have chunk_limit_size of 2MB but I observe files of a few KBs as well. I thought each file(chunk) should be 2MB until we start with new file. Isn't it ?

flush_mode interval
flush_interval 5s

Since you set these options, Fluentd flushes chunks every 5 seconds, so the chunk size can be a few KBs.

next_retry_seconds is exactly after 15sec as configured in the conf file. But the next retry actually appears after 1 minute, why is that so ?

This is something strange... It may be the old version's issue.
Maybe the interval of the processing of the flushing thread is 1 minute, and that's the reason for the delay.
I think this had been improved in recent versions.

The retry_wait and retry_timeout is applicable to each of the chunks individually ? With only 1 worker, it works sequentially that one chunk is retried till timeout is hit, then it picks the next chunk ?

I think it works sequentially in one flush thread.
By default, there is 1 flush thread in each worker, but you can change it. (I don't know if 1.10.2 supports this...)

https://docs.fluentd.org/output#flush_thread_count

And after 15minutes, when it prints dropping all chunks, what all is actually dropped ?

It means the data is lost.
Please use secondary to avoid losing data.

https://docs.fluentd.org/output#secondary-output

2 replies

viralmutant Mar 13, 2023
Author

Thanks for the answer.

Since you set these options, Fluentd flushes chunks every 5 seconds, so the chunk size can be a few KBs.

But that part confuses me. I deliberately induced URL error in the plugin which tries to post the logs to a remote server. Now chunks start to accumulate in the directory but they are of varying sizes. Some of them are like 8-10 KB, 40-50KB and some even swell to 150KB, post that 5 seconds windows. And the file size keeps increasing. If a chunk is cut-off every flush_interval window, then the chunk files should not increase in size after 5 seconds, right ? And the files are deleted even before 15min

It means the data is lost.

I understand there would be data loss but would it be only that chunk or everything that is there in the buffer directory ?

daipom Mar 13, 2023
Maintainer

Thanks for the answer.

Since you set these options, Fluentd flushes chunks every 5 seconds, so the chunk size can be a few KBs.

But that part confuses me. I deliberately induced URL error in the plugin which tries to post the logs to a remote server. Now chunks start to accumulate in the directory but they are of varying sizes. Some of them are like 8-10 KB, 40-50KB and some even swell to 150KB, post that 5 seconds windows. And the file size keeps increasing. If a chunk is cut-off every flush_interval window, then the chunk files should not increase in size after 5 seconds, right ? And the files are deleted even before 15min

If Fluentd successfully queues the chunk after 5 seconds, the size of the chunk should not increase.

https://docs.fluentd.org/output#overview

If Fluentd is running without stagnation, every chunk should be queued after 5 seconds.
So perhaps an already queued chunk is stagnant, and other chunks can't be queued on time.

Note: queued_chunks_limit_size is the number of chunks that can be queued at once, and it defaults to the same value as flush_thread_count (1 by default).

It means the data is lost.

I understand there would be data loss but would it be only that chunk or everything that is there in the buffer directory ?

I don't know 1.10,, but perhaps it would be only the already queued chunks.

viralmutant · 2023-03-13T08:16:08Z

viralmutant
Mar 13, 2023
Author

So in my case, since queued_chunks_limit_size is 1, only single chunk will be queued, tried/retried and flushed, right ? Once this chunk is processed, other chunk would be picked up.
And the other chunk's size might be increasing because it was not queued yet ?

1 reply

daipom Mar 13, 2023
Maintainer

Yes, I think so!

viralmutant · 2023-03-16T03:37:28Z

viralmutant
Mar 16, 2023
Author

What's the relation between total_limit_size and overflow_action ? These two seem contradictory to me or are these parameters referring to something else ?

While total_limit_size says that if the buffer is full, any newly generated additional data is lost. And overflow_action says there is an option to drop_oldest_chunk. That would mean, oldest data is lost and new data is added to buffer.

0 replies

daipom · 2023-03-16T03:54:44Z

daipom
Mar 16, 2023
Maintainer

Are you referring to these documents?

total_limit_size
Once the total size of stored buffer reached this threshold, all append operations will fail with error (and data will be lost)

overflow_action
Default: throw_exception
How does output plugin behave when its buffer queue is full?

I don't think these explanations are contradictory.

By default, it should behave as described in total_limit_size, and that can be changed by setting overflow_action.

1 reply

viralmutant Mar 16, 2023
Author

Ah, so setting one would modify the behavior of the other, that part was not clear to me from the documentation.

I think a similar thing would be true for retry_timeout and retry_max_times. Say I set the max times to 3 with retry_wait to 15 seconds and retry_timeout is default 72 hrs. The chunk would be discarded after 3rd attempt and it won't wait until the timeout.

daipom · 2023-03-16T03:59:33Z

daipom
Mar 16, 2023
Maintainer

If you think this can be explained more clearly, we are welcome to receive PR!

https://github.com/fluent/fluentd-docs-gitbook/tree/1.0

0 replies

Trying to understand chunking, retry and buffer drop #4093

Uh oh!

Uh oh!

viralmutant Mar 12, 2023

Replies: 5 comments · 4 replies

Uh oh!

daipom Mar 13, 2023 Maintainer

Uh oh!

viralmutant Mar 13, 2023 Author

Uh oh!

daipom Mar 13, 2023 Maintainer

Uh oh!

viralmutant Mar 13, 2023 Author

Uh oh!

daipom Mar 13, 2023 Maintainer

Uh oh!

viralmutant Mar 16, 2023 Author

Uh oh!

daipom Mar 16, 2023 Maintainer

Uh oh!

viralmutant Mar 16, 2023 Author

Uh oh!

daipom Mar 16, 2023 Maintainer

viralmutant
Mar 12, 2023

Replies: 5 comments 4 replies

daipom
Mar 13, 2023
Maintainer

viralmutant Mar 13, 2023
Author

daipom Mar 13, 2023
Maintainer

viralmutant
Mar 13, 2023
Author

daipom Mar 13, 2023
Maintainer

viralmutant
Mar 16, 2023
Author

daipom
Mar 16, 2023
Maintainer

viralmutant Mar 16, 2023
Author

daipom
Mar 16, 2023
Maintainer