-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Hadoop 19641: [ABFS][ReadAheadV2] First Read should bypass ReadBufferManager #7835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the read-ahead behavior in ABFS input streams by bypassing the ReadBufferManager for the first read operation, reducing unnecessary overhead when the read pattern is not yet established.
- Introduces a bypass mechanism for the first read to avoid synchronous prefetch overhead
- Modifies the number of read-ahead requests from configurable depth to 1 for the first read
- Updates existing tests to account for the new first-read bypass behavior
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
AbfsInputStream.java | Implements first-read bypass logic by setting numReadAheads to 1 and adds isFirstRead() method |
ReadBufferManagerV1.java | Prevents removal of prefetch buffers from queue when triggered by first read |
TestAbfsInputStream.java | Updates test expectations and mocking to accommodate the new first-read behavior |
* For the first read of this input stream, we don't read ahead but keep | ||
* the current read data in cache as pattern might not be sequential. | ||
*/ | ||
int numReadAheads = firstRead? 1 : this.readAheadQueueDepth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The ternary operator lacks spaces around the ?
operator, which deviates from Java coding conventions. It should be firstRead ? 1 : this.readAheadQueueDepth
.
int numReadAheads = firstRead? 1 : this.readAheadQueueDepth; | |
int numReadAheads = firstRead ? 1 : this.readAheadQueueDepth; |
Copilot uses AI. Check for mistakes.
if (-1 == fCursorAfterLastRead || fCursorAfterLastRead == fCursor || b.length >= bufferSize) { | ||
LOG.debug("Sequential read with read ahead size of {}", bufferSize); | ||
// Sequential read pattern detected. Enable read ahead. | ||
LOG.debug("Sequential read with read size of {} and read ahead enabled", bufferSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message mentions 'read size of {}' but logs bufferSize
instead of the actual read size (b.length
). This could be misleading when debugging read operations.
LOG.debug("Sequential read with read size of {} and read ahead enabled", bufferSize); | |
LOG.debug("Sequential read with read size of {} and read ahead enabled", b.length); |
Copilot uses AI. Check for mistakes.
Description of PR
JIRA: https://issues.apache.org/jira/browse/HADOOP-19641
We have observed this across multiple workload runs that when we start reading data from input stream. The first read which came to input stream has to be read synchronously even if we trigger prefetch request for that particular offset. Most of the times we end up doing extra work of checking if the prefetch is trigerred, removing prefetch from the pending queue and go ahead to do a direct remote read in workload thread itself.
To avoid all this overhead, we will always bypass read ahead for the very first read of each input stream and trigger read aheads for second read onwards.
How was this patch tested?
TBA