-
Notifications
You must be signed in to change notification settings - Fork 267
minor: include taskAttemptId in log messages #2467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minor: include taskAttemptId in log messages #2467
Conversation
acquireMemory returns less memory than requestedacquireMemory returns less memory than requested [WIP]
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2467 +/- ##
============================================
+ Coverage 56.12% 58.33% +2.20%
- Complexity 976 1438 +462
============================================
Files 119 146 +27
Lines 11743 13518 +1775
Branches 2251 2350 +99
============================================
+ Hits 6591 7886 +1295
- Misses 4012 4400 +388
- Partials 1140 1232 +92 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
acquireMemory returns less memory than requested [WIP]
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @andygrove
| if used != 0 { | ||
| warn!("CometUnifiedMemoryPool dropped with {used} bytes still reserved"); | ||
| warn!( | ||
| "Task {} dropped CometUnifiedMemoryPool with {used} bytes still reserved", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Task {} dropped CometUnifiedMemoryPool with {used} bytes still reserved", | |
| "Task ID {} dropped CometUnifiedMemoryPool with {used} bytes still reserved", |
| .unwrap_or_else(|_| panic!("Failed to release {size} bytes")); | ||
| if let Err(e) = self.release_to_spark(size) { | ||
| panic!( | ||
| "Task {} failed to return {size} bytes to Spark: {e:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Task {} failed to return {size} bytes to Spark: {e:?}", | |
| "Task ID {} failed to return {size} bytes to Spark: {e:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following Spark's convention for logging so that we can use grep to search for Task 1234 and view the combined logging from both Spark and Comet.
~/git/apache/apache-spark-3.5.6$ find . -name "*.scala" -exec grep taskAttemptId {} \; | grep log
logDebug(s"Starting pushing blocks for the task ${context.taskAttemptId()}")
logWarning(s"Task ${taskAttemptId.get} already completed, not releasing lock for $blockId")
logTrace(s"Task $taskAttemptId trying to acquire read lock for $blockId")
logTrace(s"Task $taskAttemptId acquired read lock for $blockId")
logTrace(s"Task $taskAttemptId trying to acquire write lock for $blockId")
logTrace(s"Task $taskAttemptId acquired write lock for $blockId")
logTrace(s"Task $taskAttemptId downgrading write lock for $blockId")
logTrace(s"Task $taskAttemptId releasing lock for $blockId")
logTrace(s"Task $taskAttemptId trying to remove block $blockId")
logInfo(s"Task ${TaskContext.get().taskAttemptId} force spilling in-memory map to disk " +
logInfo(s"Task ${context.taskAttemptId} force spilling in-memory map to disk and " +
logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")
logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt $stageAttemptNumber) has entered " +
logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt $stageAttemptNumber) waiting " +
logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt $stageAttemptNumber) finished " +
logInfo(s"Task $taskAttemptId from Stage $stageId(Attempt $stageAttemptNumber) failed " +
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
| { | ||
| panic!("overflow when releasing {size} of {prev} bytes"); | ||
| panic!( | ||
| "Task {} overflow when releasing {size} of {prev} bytes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Task {} overflow when releasing {size} of {prev} bytes", | |
| "Task ID {} overflow when releasing {size} of {prev} bytes", |
|
|
||
| return Err(resources_datafusion_err!( | ||
| "Failed to acquire {} bytes, only got {}. Reserved: {}", | ||
| "Task {} failed to acquire {} bytes, only got {}. Reserved: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Task {} failed to acquire {} bytes, only got {}. Reserved: {}", | |
| "Task ID {} failed to acquire {} bytes, only got {}. Reserved: {}", |
| { | ||
| return Err(resources_datafusion_err!( | ||
| "Failed to acquire {} bytes due to overflow. Reserved: {}", | ||
| "Task {} failed to acquire {} bytes due to overflow. Reserved: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Task {} failed to acquire {} bytes due to overflow. Reserved: {}", | |
| "Task ID {} failed to acquire {} bytes due to overflow. Reserved: {}", |
| // Returns the actual amount of memory (in bytes) granted. | ||
| public long acquireMemory(long size) { | ||
| if (logger.isTraceEnabled()) { | ||
| logger.trace("Task {} requested {} bytes", taskAttemptId, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger.trace("Task {} requested {} bytes", taskAttemptId, size); | |
| logger.trace("Task ID {} requested {} bytes", taskAttemptId, size); |
| long newUsed = used.addAndGet(acquired); | ||
| if (acquired < size) { | ||
| logger.warn( | ||
| "Task {} requested {} bytes but only received {} bytes. Current allocation is {} and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Task {} requested {} bytes but only received {} bytes. Current allocation is {} and " | |
| "Task ID {} requested {} bytes but only received {} bytes. Current allocation is {} and " |
| // Called by Comet native through JNI | ||
| public void releaseMemory(long size) { | ||
| if (logger.isTraceEnabled()) { | ||
| logger.trace("Task {} released {} bytes", taskAttemptId, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger.trace("Task {} released {} bytes", taskAttemptId, size); | |
| logger.trace("Task ID {} released {} bytes", taskAttemptId, size); |
| if (newUsed < 0) { | ||
| logger.error( | ||
| "Used memory is negative: " + newUsed + " after releasing memory chunk of: " + size); | ||
| "Task {} used memory is negative ({}) after releasing {} bytes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Task {} used memory is negative ({}) after releasing {} bytes", | |
| "Task ID {} used memory is negative ({}) after releasing {} bytes", |
parthchandra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Which issue does this PR close?
Part of #2453
Rationale for this change
This additional information in log messages and exceptions has been helpful to me in tracking down memory issues. I think it could be helpful in the future.
What changes are included in this PR?
How are these changes tested?