Skip to content

Conversation

deejgregor
Copy link
Contributor

@deejgregor deejgregor commented Jul 27, 2025

What Does This Do

This adds a pool.waiting span when waiting for an available database connection from a Hikari (>= 2.4.0) or DBCP2 (>= 2.10.0) connection pool.

Motivation

  1. Currently, request delays due to database connection pool exhaustion cannot be definitively diagnosed using traces alone. A significant gap between a parent span and a database query span can be a clear indicator this might be the problem, but it is not definitive, nor is it readily apparent to most users viewing a trace. Metrics or wall clock profiles might show the issue (1) if they are available and (2) users know to look there.
  2. Proactively monitor for connection pool shortages with a trace analytics monitor (or possibly trace metrics?).

Additional Notes

Is this a contribution you would be interested in?

It would definitely be helpful to us at $WORK if this could be in the tracer, as we use connection pools heavily and delays due to exhausted connection pools tend to be not be quick to identify.

Concerns

For modern versions of HIkari, the instrumentation to determine blocking in HikariBlockedTrackingSynchronousQueue replaces the SynchronousQueue created by ConcurrentBag.<init>, and assumes that true is always passed when creating the SynchronousQueue. This sets the fairness policy and it has been set this way since 2017, so the likelihood of it changing in future versions of HIkari is very low.

Instrumented Code

Hikari

There is a common method, ConcurrentBag.borrow, that is used for both blocking and non-blocking paths. This method is instrumented along with deeper methods that indicate when the blocking path was taken. Depending on the Hikari version, there are two different downstream methods that are instrumented. A ThreadLocal is used with the help of HikariBlockedTracker to communicate between the various instrumentations when the blocking path was taken. Apool.waiting span is created when ConcurrentBag.borrow returns, but only if blocking occurred.

Apache DBCP2

In commons-pool >= 2.10.0 and later, this is straightforward, as LinkedBlockingDeque.pollFirst(Duration) is called only in blocking cases by GenericObjectPool.borrowObject from a number of places in dbcp2. A CallDepthThreadLocalMap is used to ensure a span is only created when LinkedBlockingDeque.pollFirst is called from dbcp2 and a test method is added to JDBCInstrumentationTestBase to ensure this is the case.

This code can be extended to work with earlier versions of commons-pool, however determining the blocking signature of poolFirst will require a little bit more work and might require differentiating between versions of commons-pool to make sure only the appropriate signature is instrumented (or looking for another way to determine blocking).

Test cases

A test method was added in JDBCInstrumentationTestBase.

Additional work

Although this seems to work functionally in with the included test, there are a few items still being worked on:

  • Make sure commons-pool LinkedBlockingDeque span is only created when called from dbcp2 (use CallDepth?)
  • Spotless, code guidelines
  • Testing in deployed environments with Hikari
  • Testing in deployed environments with DBCP2
  • Review component names
  • Review package structure
  • Review configuration (it will be disabled by default)
  • Clarify which versions of common-dbcp / common-pool are supported (e.g.: dbcp 2.10.0 requires commons-pool 2.11.1)
  • Docs?
  • Generate a trace metric?

Other considerations

  • Should downstream getConnection() calls to database drivers be traced from connection pools to indicate when a request delay is caused by new connection is being opened? This might be particularly helpful for forking database backends like PostgreSQL that have longer connection times.
  • Should other connection pools be added?
  • Should earlier versions of dbcp2 be supported?

Contributor Checklist

Jira ticket: [PROJ-IDENT]

@deejgregor deejgregor force-pushed the hikaricp-blocked-tracking branch from df910b9 to fec7eac Compare July 27, 2025 15:54
@deejgregor
Copy link
Contributor Author

Also: see support case 2199108

@ValentinZakharov
Copy link
Contributor

Thanks for the contribution @deejgregor!
Please let me review the proposed changes shortly and get back to you with feedback

@deejgregor
Copy link
Contributor Author

I've done a little bit more experimenting and I've found a few things along the way, so I wanted to provide an update and ask a few questions.

First, my update: I simplified (I think) HikariConcurrentBagInstrumentation using an AsmVisitorWrapper to identify when a blocking call is happening by tweaking the method body of ConcurrentBag.borrow. Everyone needs to learn JVM bytecode at some point in their life, right? This eliminates HikariBlockedTrackingSynchronousQueue and HikariQueuedSequenceSynchronizerInstrumentation. I also moved HikariWaitingTracker into the instrumentation class--I don't know if that's the best way to do that or not, but it is no longer used by another other instrumentation classes, so I thought I'd do it for now and check if that's a good way to go.

My questions:

  1. Do you have a preference between (a) editing the borrow method, or (b) going back to the previous method advice?
  2. It looks like I should probably try to merge the test case I added with JDBCInstrumentationTestBase (I have no idea how I missed this first time around).
  3. Any thoughts/updates/gut feels from your side on my overall approach? Any guidance at this point?
  4. How to configure this? (see below)

Configuration options and tracing

Today

jdbc-datasource not enabled (default)

Database pool and database driver connection operations are not visible: no database.connection spans.

jdbc-datasource enabled

database.connection spans for all calls to any DataSource.getConnection implementation, regardless of whether they happen on a pool or a database driver talking to a real database. A request satisfied from a pool will have a single database.connection call to the pool (or more, as I've seen with dbcp2, I think). Waiting on the pool woudl be noticeable based on the duration of that span into the pool. A request satisfied through the pool when a new connection is added will have an additional span from the database driver when it connects to the real database.

Future

I wanted to propose an option or two of how to configure this, but I haven't been terribly happy with anything I came up with.

Here are my main thoughts:

  1. I'd like to get the pool.waiting spans (or another clear indication of connection pool blocking), preferably by default.
  2. I'd prefer to not get database.connection spans for getConnection calls into the connection pool, at least when they are quick and non-blocking.
  3. I would like to get database.connection spans for getConnection calls to a real database DataSource--anything made from the pool to the database or anything that doesn't go through a pool. Those should maybe even have a span.kind of client.
  4. APM -> DBM integration still works. The documentation says that jdbc-datasource integration needs to be enabled, but from looking at the code, I can't tell why.

That combination gets me three things:

  1. It doesn't create any extra spans for the happy path (quickly getting a connection from the pool).
  2. It creates a span when a new connection is made to a real database.
  3. It creates a span if we have to wait for a connection from the pool to free up.

One option could be to just enable the jdbc-datasource integration everywhere for us, throw away this PR (but treat it as a good learning experience), accept the extra database.connection spans, and use duration to try to catch pool exhaustion. I wasn't a big fan of this before, but now that I'm struggling with how to configure this to achieve what I listed above, I'm beginning to accept this option. :-/

@deejgregor deejgregor force-pushed the hikaricp-blocked-tracking branch 2 times, most recently from 9515216 to a9a65c0 Compare August 27, 2025 17:27
Copy link
Contributor

@ValentinZakharov ValentinZakharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, let me check your latest improvements...

@ValentinZakharov
Copy link
Contributor

Thanks a lot for this contribution - the pool.waiting span is a very valuable addition and helps explain connection pool exhaustion delays much more clearly. A few comments:

  1. ASM vs Advice
    We’d prefer to stick with the advice-based approach. Rewriting the body of ConcurrentBag.borrow via ASM makes the integration brittle against future Hikari upgrades. The accuracy gain doesn’t seem worth the long-term maintenance cost. If possible, please revert to the previous advice/thread-local tracking solution.

  2. Tests
    It would be great to merge SaturatedPoolBlockingTest into JDBCInstrumentationTestBase

  3. DBCP2 instrumentation
    I have one concern about the current approach: since we’re intercepting LinkedBlockingDeque.pollFirst(Duration) (a JDK class), there’s a risk of creating pool.waiting spans for unrelated usages of LinkedBlockingDeque outside of DBCP2.
    Should we add any stronger guard checks (e.g., verifying the call comes from within DBCP2 packages), or do you consider the existing safeguards sufficient?

  4. Configuration
    The proposed configuration looks good to me.

I really like your vision for future improvements - especially the idea of adding more pool-level metrics (e.g. provider, max size, active usage). That would make the traces much more insightful.

I also share your concern about generating a large number of short spans in high-traffic systems. Sampling or aggregating metrics instead of emitting every single pool.waiting span could be a good direction.

One question for you: do you prefer to merge the current changes as they are (so we can start getting value from pool.waiting spans right away), and then work on the additional logic in a follow-up PR? Or would you rather extend this PR with those improvements before merging?

- Use dd.trace.experimental.jdbc.pool.waiting.enabled=true to enable.
- Change instrumentation name from jdbc-datasource to jdbc.
- Record exceptions.
- Use default instrumentation name (jdbc).
- Set resource name to {dbcp2,hikari}.waiting
@deejgregor deejgregor force-pushed the hikaricp-blocked-tracking branch from a9a65c0 to b5e714a Compare August 31, 2025 18:57
@deejgregor
Copy link
Contributor Author

I just pushed updates that I think should address the first two items below.

  1. ASM vs Advice
    We’d prefer to stick with the advice-based approach. Rewriting the body of ConcurrentBag.borrow via ASM makes the integration brittle against future Hikari upgrades. The accuracy gain doesn’t seem worth the long-term maintenance cost. If possible, please revert to the previous advice/thread-local tracking solution.

Done.

  1. Tests
    It would be great to merge SaturatedPoolBlockingTest into JDBCInstrumentationTestBase

Done.

  1. DBCP2 instrumentation
    I have one concern about the current approach: since we’re intercepting LinkedBlockingDeque.pollFirst(Duration) (a JDK class), there’s a risk of creating pool.waiting spans for unrelated usages of LinkedBlockingDeque outside of DBCP2.
    Should we add any stronger guard checks (e.g., verifying the call comes from within DBCP2 packages), or do you consider the existing safeguards sufficient?

Good news!

  1. The LinkedBlockingDeque here is not a JDK class--it's Apache Pool 2's class. It's either org.apache.commons.pool2.impl.LinkedBlockingDeque or org.apache.tomcat.dbcp.pool2.impl.LinkedBlockingDeque.
  2. There is already a guard check in Dbcp2LinkedBlockingDequeInstrumentation using a CallDepthThreadLocalMap paired with the other Dbcp2*Instrumentation classes to make sure the calls come from DBCP2, and there is a test in JDBCInstrumentationTestBase, as well. When you suggest adding a stronger guard check, what are you thinking of? (I apologize, I'm not too familiar with the options here)

I also share your concern about generating a large number of short spans in high-traffic systems. Sampling or aggregating metrics instead of emitting every single pool.waiting span could be a good direction.

The more that I think of it, I've gotten to be pretty unconcerned about creating pool.waiting spans, since they indicate a wait condition is happening instead of other work. Based on the issues I've seen when pools run out of capacity, the impact of another span is minuscule compared to the inbound request backups and increase in worker threads in the impacted app.

@deejgregor deejgregor marked this pull request as ready for review August 31, 2025 19:20
@deejgregor deejgregor requested review from a team as code owners August 31, 2025 19:20
@deejgregor deejgregor requested a review from mcculls August 31, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants