-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
If a node exceeds the flood-stage disk watermark then we block further writes to its indices but we allow merges to continue. Merges can temporarily consume a very large amount of disk space, more than enough to fill up the gap between any reasonable flood-stage watermark and a completely full disk. When a node completely fills its disk, it basically dies.
We could pause merge-related writes in this situation, for instance by overriding Store$StoreDirectory#createOutput and adjusting the output's behaviour according to the supplied IOContext.
We probably don't want to do this for all writes, because (e.g.) a primary may need to refresh before it can relocate itself elsewhere, and because blocking random write threads seems like a recipe for deadlocks. Blocking merge threads seems ok tho. We may also need to be sensitive to the size of the merge (see IOContext.mergeInfo.estimatedMergeBytes and IOContext.flushInfo.estimatedSegmentSize) since smaller merges may soon be triggered by the merge-on-refresh feature.
It's unclear whether to do this based on the read_only_allow_delete block (which affects other nodes below the flood-stage watermark) or the actual disk usage on the node (which may not know the flood-stage watermark that the master is using).
NB we can also consider reducing the flood-stage max headroom once we have better protection against merges consuming all the remaining space.