Skip to content

[BUG] Shard indexing pressure results in request corruptionΒ #20724

@varunbharadwaj

Description

@varunbharadwaj

Describe the bug

OpenSearch uses unsafe buffers for bulk indexing requests by default. This was added for performance reasons, and overrides default netty logic to only release the buffers from the buffer pool when the HTTP response is ready. This prevents reusing the buffer while the request is in-progress, protecting the request payloads.

However, shard indexing pressure request rejection can corrupt the indexing payload in the following scenario:

  1. Assume the bulk request contains shards local to the coordinating node, along with other shards which are overloaded.
  2. Local shard is processed first, and submitted for processing.
  3. Later when we reach processing the overloaded shard, it is rejected due to shard indexing pressure. This results in an exception, which will trigger the failure callback.
  4. Bulk request will be failed, and the buffers will be released.
  5. At this point, the local shard request is still in-progress, but the released buffer is recycled and overwritten.
  6. This results in corrupted indexing payloads at the engine layer. It could also be possible for the engine to successfully index, and corruption only visible in the translog entry, failing on recovery.

The same issue also happens on the replica, but rarely compared to the primary depending on the timing.

One potential solution to verify this is to move the index shard pressure checks before any shard request is submitted. Alternatively, we can wait for the already submitted shards to finish before failing the bulk request.

We try the first option to confirm this solves the payload corruption. We can see below that parsing errors stop after making this change.

Image

Related component

Indexing

To Reproduce

The issue can be seen on OS 3.x versions (3.4 is where it was observed, and issue also exists on latest main branch - unreleased 3.6). Issue depends on race conditions, when bulk request fails partially, with new requests overwriting in-flight payloads.

Expected behavior

There should be no request payload corruption

Additional Details

No response

Metadata

Metadata

Labels

IndexingIndexing, Bulk Indexing and anything related to indexingbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions