Hadoop 19641: [ABFS][ReadAheadV2] First Read should bypass ReadBufferManager #7835

anujmodi2021 · 2025-07-29T13:04:58Z

Description of PR

JIRA: https://issues.apache.org/jira/browse/HADOOP-19641
We have observed this across multiple workload runs that when we start reading data from input stream. The first read which came to input stream has to be read synchronously even if we trigger prefetch request for that particular offset. Most of the times we end up doing extra work of checking if the prefetch is trigerred, removing prefetch from the pending queue and go ahead to do a direct remote read in workload thread itself.

To avoid all this overhead, we will always bypass read ahead for the very first read of each input stream and trigger read aheads for second read onwards.

How was this patch tested?

TBA

Copilot

Pull Request Overview

This PR optimizes the read-ahead behavior in ABFS input streams by bypassing the ReadBufferManager for the first read operation, reducing unnecessary overhead when the read pattern is not yet established.

Introduces a bypass mechanism for the first read to avoid synchronous prefetch overhead
Modifies the number of read-ahead requests from configurable depth to 1 for the first read
Updates existing tests to account for the new first-read bypass behavior

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
AbfsInputStream.java	Implements first-read bypass logic by setting numReadAheads to 1 and adds isFirstRead() method
ReadBufferManagerV1.java	Prevents removal of prefetch buffers from queue when triggered by first read
TestAbfsInputStream.java	Updates test expectations and mocking to accommodate the new first-read behavior

Copilot · 2025-08-05T08:44:17Z

...tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java

+       * For the first read of this input stream, we don't read ahead but keep
+       * the current read data in cache as pattern might not be sequential.
+       */
+      int numReadAheads = firstRead? 1 : this.readAheadQueueDepth;


[nitpick] The ternary operator lacks spaces around the ? operator, which deviates from Java coding conventions. It should be firstRead ? 1 : this.readAheadQueueDepth.

Suggested change

int numReadAheads = firstRead? 1 : this.readAheadQueueDepth;

int numReadAheads = firstRead ? 1 : this.readAheadQueueDepth;

Copilot · 2025-08-05T08:44:17Z

...tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java

        if (-1 == fCursorAfterLastRead || fCursorAfterLastRead == fCursor || b.length >= bufferSize) {
-          LOG.debug("Sequential read with read ahead size of {}", bufferSize);
+          // Sequential read pattern detected. Enable read ahead.
+          LOG.debug("Sequential read with read size of {} and read ahead enabled", bufferSize);


The log message mentions 'read size of {}' but logs bufferSize instead of the actual read size (b.length). This could be misleading when debugging read operations.

Suggested change

LOG.debug("Sequential read with read size of {} and read ahead enabled", bufferSize);

LOG.debug("Sequential read with read size of {} and read ahead enabled", b.length);

...tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsInputStream.java

anujmodi2021 added 2 commits July 29, 2025 00:34

Prod Changes

32387a1

Fix Failing Tests

d2b1502

github-actions bot added trunk TOOLS ABFS labels Jul 29, 2025

anujmodi2021 changed the title ~~Hadoop 19641~~ Hadoop 19641: [ABFS][ReadAheadV2] First Read should bypass ReadBufferManager Jul 29, 2025

This comment was marked as outdated.

Sign in to view

anujmodi2021 marked this pull request as ready for review July 30, 2025 07:02

anujmodi2021 requested a review from Copilot July 31, 2025 06:05

This comment was marked as outdated.

Sign in to view

Code Changes

6454056

This comment was marked as outdated.

Sign in to view

anujmodi2021 requested a review from Copilot August 5, 2025 08:43

Copilot AI reviewed Aug 5, 2025

View reviewed changes

anujmodi2021 marked this pull request as draft August 7, 2025 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hadoop 19641: [ABFS][ReadAheadV2] First Read should bypass ReadBufferManager #7835

Hadoop 19641: [ABFS][ReadAheadV2] First Read should bypass ReadBufferManager #7835

anujmodi2021 commented Jul 29, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Copilot AI left a comment

Uh oh!

Copilot AI Aug 5, 2025

Uh oh!

Copilot AI Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

	int numReadAheads = firstRead? 1 : this.readAheadQueueDepth;
	int numReadAheads = firstRead ? 1 : this.readAheadQueueDepth;

	LOG.debug("Sequential read with read size of {} and read ahead enabled", bufferSize);
	LOG.debug("Sequential read with read size of {} and read ahead enabled", b.length);

Hadoop 19641: [ABFS][ReadAheadV2] First Read should bypass ReadBufferManager #7835

Are you sure you want to change the base?

Hadoop 19641: [ABFS][ReadAheadV2] First Read should bypass ReadBufferManager #7835

Conversation

anujmodi2021 commented Jul 29, 2025

Description of PR

How was this patch tested?

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!