Skip to content

Conversation

mattkduran
Copy link

@mattkduran mattkduran commented Aug 3, 2025

Description of PR

The ABFS driver's auto-throttling feature (fs.azure.enable.autothrottling=true) creates Timer threads in AbfsClientThrottlingAnalyzer that are never properly cleaned up, leading to a memory leak that eventually causes OutOfMemoryError in long-running applications like Hive Metastore.

Impact:

  • Thread count grows indefinitely (observed >100,000 timer threads)
  • Affects any long-running service that creates multiple ABFS filesystem instances

Root Cause:

AbfsClientThrottlingAnalyzer creates Timer objects in its constructor but provides no mechanism to cancel them. When AbfsClient instances are closed, the associated timer threads continue running indefinitely.

Solution

Implement proper resource cleanup by making the throttling components implement Closeable and ensuring timers are cancelled when ABFS clients are closed.

Changes Made

  1. AbfsClientThrottlingAnalyzer.java
  • Added: implements Closeable
  • Added: close() method that calls timer.cancel() and timer.purge()
  • Purpose: Ensures timer threads are properly terminated when analyzer is no longer needed
  1. AbfsThrottlingIntercept.java (Interface)
  • Added: extends Closeable
  • Added: close() method signature
  • Purpose: Establishes cleanup contract for all throttling intercept implementations
  1. AbfsClientThrottlingIntercept.java
  • Added: close() method that closes both readThrottler and writeThrottler
  • Purpose: Coordinates cleanup of both read and write throttling analyzers
  1. AbfsNoOpThrottlingIntercept.java
  • Added: No-op close() method
  • Purpose: Satisfies interface contract for no-op implementation
  1. AbfsClient.java
  • Added: IOUtils.cleanupWithLogger(LOG, intercept) in existing close() method
  • Purpose: Integrates throttling cleanup into existing client resource management

https://github.com/mattkduran/ABFSleaktest
https://www.mail-archive.com/[email protected]/msg43483.html

How was this patch tested?

Standalone Validation Tool

This fix was validated using a standalone reproduction and testing tool that directly exercises the ABFS auto-throttling components outside of a full Hadoop deployment.
Repository: ABFSLeakTest

Testing Scope

  • Problem reproduction confirmed - demonstrates the timer thread leak
  • Fix validation confirmed - proves close() method resolves the leak
  • Resource cleanup verified - shows proper timer cancellation
  • Limited integration testing - standalone tool, not full Hadoop test suite

Test Results

Leak Reproduction Evidence

# Without fix: Timer threads accumulate over filesystem creation cycles
Cycle    Total Threads    ABFS Timer Threads    Status
1        50->52          0->2                   LEAK DETECTED
50       150->152        98->100               LEAK GROWING  
200      250->252        398->400              LEAK CONFIRMED

Final Analysis: 400 leaked timer threads named "abfs-timer-client-throttling-analyzer-*"
Memory Impact: ~90MB additional heap usage

# Direct analyzer testing:
🔴 Without close(): +3 timer threads (LEAKED)
✅ With close():    +0 timer threads (NO LEAK)

Test Environment

  • Java Version: OpenJDK 11.0.x
  • Hadoop Version: 3.3.6/3.4.1 (both affected)
  • Test Duration: 200 filesystem creation/destruction cycles
  • Thread Monitoring: JMX ThreadMXBean

For code changes:

  • [ X ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 20m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 41s trunk passed
+1 💚 compile 0m 42s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 36s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 32s trunk passed
+1 💚 mvnsite 0m 41s trunk passed
+1 💚 javadoc 0m 42s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 34s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 10s trunk passed
+1 💚 shadedclient 41m 9s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
-1 ❌ mvninstall 0m 21s /patch-mvninstall-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
-1 ❌ compile 0m 23s /patch-compile-hadoop-tools_hadoop-azure-jdkUbuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.txt hadoop-azure in the patch failed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.
-1 ❌ javac 0m 23s /patch-compile-hadoop-tools_hadoop-azure-jdkUbuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.txt hadoop-azure in the patch failed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.
-1 ❌ compile 0m 21s /patch-compile-hadoop-tools_hadoop-azure-jdkPrivateBuild-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09.txt hadoop-azure in the patch failed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09.
-1 ❌ javac 0m 21s /patch-compile-hadoop-tools_hadoop-azure-jdkPrivateBuild-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09.txt hadoop-azure in the patch failed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09.
-1 ❌ blanks 0m 0s /blanks-eol.txt The patch has 3 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚 checkstyle 0m 21s the patch passed
-1 ❌ mvnsite 0m 23s /patch-mvnsite-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
-1 ❌ javadoc 0m 22s /patch-javadoc-hadoop-tools_hadoop-azure-jdkUbuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.txt hadoop-azure in the patch failed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.
-1 ❌ javadoc 0m 26s /patch-javadoc-hadoop-tools_hadoop-azure-jdkPrivateBuild-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09.txt hadoop-azure in the patch failed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09.
-1 ❌ spotbugs 0m 22s /patch-spotbugs-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
+1 💚 shadedclient 44m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 0m 25s /patch-unit-hadoop-tools_hadoop-azure.txt hadoop-azure in the patch failed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
160m 49s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7852/1/artifact/out/Dockerfile
GITHUB PR #7852
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 39b56ca5b682 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / fee9861
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7852/1/testReport/
Max. process+thread count 536 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7852/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@anujmodi2021 anujmodi2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch and thorough testing of the issue @mattkduran
I have a few suggestions and comments. Please do take a look at them.

Also we need at least one test (unit or integration) to be inlcuded in this patch. Can you plan for one? Idea is to have coverage of code impacted here.

Also, I see a few PR checks failing, If you click on the link of each -1 commented by hadoop-yetus, you should be able to see the issue reported and fix them.

Once all of this is done, we can wait for a few more reviews and get this checked in.

Thanks again for all the efforts.

@@ -26,7 +26,7 @@
*/
@InterfaceAudience.Private
@InterfaceStability.Unstable
public interface AbfsThrottlingIntercept {
public interface AbfsThrottlingIntercept extends Closable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be extends Closeable?

@@ -40,4 +42,12 @@ public void updateMetrics(final AbfsRestOperationType operationType,
public void sendingRequest(final AbfsRestOperationType operationType,
final AbfsCounters abfsCounters) {
}

/**
* No-op implementation of close method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: javadoc to include @ throws


/**
* Closes the throttling intercept and releases associated resources.
* This method closes both the read and write throttling analyzers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Javadoc to include @ throws

/**
* Closes the throttling analyzer and releases associated resources.
* This method cancels the internal timer and cleans up any pending timer tasks.
* It is safe to call this method multiple times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix javadoc here a well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants