Skip to content

Add a span when waiting for an available database connection from a pool #9251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

deejgregor
Copy link

@deejgregor deejgregor commented Jul 27, 2025

What Does This Do

This adds a pool.waiting span when waiting for an available database connection from a Hikari (>= 2.4.0) or DBCP2 (>= 2.10.0) connection pool.

Motivation

  1. Currently, request delays due to database connection pool exhaustion cannot be definitively diagnosed using traces alone. A significant gap between a parent span and a database query span can be a clear indicator this might be the problem, but it is not definitive, nor is it readily apparent to most users viewing a trace. Metrics or wall clock profiles might show the issue (1) if they are available and (2) users know to look there.
  2. Proactively monitor for connection pool shortages with a trace analytics monitor (or possibly trace metrics?).

Additional Notes

Is this a contribution you would be interested in?

It would definitely be helpful to us at $WORK if this could be in the tracer, as we use connection pools heavily and delays due to exhausted connection pools tend to be not be quick to identify.

Concerns

For modern versions of HIkari, the instrumentation to determine blocking in HikariBlockedTrackingSynchronousQueue replaces the SynchronousQueue created by ConcurrentBag.<init>, and assumes that true is always passed when creating the SynchronousQueue. This sets the fairness policy and it has been set this way since 2017, so the likelihood of it changing in future versions of HIkari is very low.

Instrumented Code

Hikari

There is a common method, ConcurrentBag.borrow, that is used for both blocking and non-blocking paths. This method is instrumented along with deeper methods that indicate when the blocking path was taken. Depending on the Hikari version, there are two different downstream methods that are instrumented. A ThreadLocal is used with the help of HikariBlockedTracker to communicate between the various instrumentations when the blocking path was taken. Apool.waiting span is created when ConcurrentBag.borrow returns, but only if blocking occurred.

Apache DBCP2

In commons-pool >= 2.10.0 and later, this is straightforward, as LinkedBlockingDeque.pollFirst(Duration) is called only in blocking cases by GenericObjectPool.borrowObject from a number of places in dbcp2. Additional work needs to be done to ensure that a span is only created when LinkedBlockingDeque.pollFirst is called from dbcp2. This is done.

This code can be extended to work with earlier versions of commons-pool, however determining the blocking signature of poolFirst will require a little bit more work and might require differentiating between versions of commons-pool to make sure only the appropriate signature is instrumented (or looking for another way to determine blocking).

Test cases

A test class was added in https://github.com/deejgregor/dd-trace-java/blob/hikaricp-blocked-tracking/dd-java-agent/instrumentation/jdbc/src/test/groovy/SaturatedPoolBlockingTest.groovy.

Additional work

Although this seems to work functionally in with the included test, there are a few items still being worked on:

  • Make sure commons-pool LinkedBlockingDeque span is only created when called from dbcp2 (use CallDepth?)
  • Spotless, code guidelines
  • Testing in deployed environments with Hikari
  • Testing in deployed environments with DBCP2
  • Review component names
  • Review package structure
  • Review configuration (should it be enabled by default?)
  • Clarify which versions of common-dbcp / common-pool are supported (e.g.: dbcp 2.10.0 requires commons-pool 2.11.1)
  • Docs?
  • Generate a trace metric?

Other considerations

  • Should downstream getConnection() calls to database drivers be traced from connection pools to indicate when a request delay is caused by new connection is being opened? This might be particularly helpful for forking database backends like PostgreSQL that have longer connection times.
  • Should other connection pools be added?
  • Should earlier versions of dbcp2 be supported?

Contributor Checklist

Jira ticket: [PROJ-IDENT]

@deejgregor deejgregor force-pushed the hikaricp-blocked-tracking branch from df910b9 to fec7eac Compare July 27, 2025 15:54
@deejgregor
Copy link
Author

Also: see support case 2199108

@ValentinZakharov
Copy link
Contributor

Thanks for the contribution @deejgregor!
Please let me review the proposed changes shortly and get back to you with feedback

@deejgregor
Copy link
Author

I've done a little bit more experimenting and I've found a few things along the way, so I wanted to provide an update and ask a few questions.

First, my update: I simplified (I think) HikariConcurrentBagInstrumentation using an AsmVisitorWrapper to identify when a blocking call is happening by tweaking the method body of ConcurrentBag.borrow. Everyone needs to learn JVM bytecode at some point in their life, right? This eliminates HikariBlockedTrackingSynchronousQueue and HikariQueuedSequenceSynchronizerInstrumentation. I also moved HikariWaitingTracker into the instrumentation class--I don't know if that's the best way to do that or not, but it is no longer used by another other instrumentation classes, so I thought I'd do it for now and check if that's a good way to go.

My questions:

  1. Do you have a preference between (a) editing the borrow method, or (b) going back to the previous method advice?
  2. It looks like I should probably try to merge the test case I added with JDBCInstrumentationTestBase (I have no idea how I missed this first time around).
  3. Any thoughts/updates/gut feels from your side on my overall approach? Any guidance at this point?
  4. How to configure this? (see below)

Configuration options and tracing

Today

jdbc-datasource not enabled (default)

Database pool and database driver connection operations are not visible: no database.connection spans.

jdbc-datasource enabled

database.connection spans for all calls to any DataSource.getConnection implementation, regardless of whether they happen on a pool or a database driver talking to a real database. A request satisfied from a pool will have a single database.connection call to the pool (or more, as I've seen with dbcp2, I think). Waiting on the pool woudl be noticeable based on the duration of that span into the pool. A request satisfied through the pool when a new connection is added will have an additional span from the database driver when it connects to the real database.

Future

I wanted to propose an option or two of how to configure this, but I haven't been terribly happy with anything I came up with.

Here are my main thoughts:

  1. I'd like to get the pool.waiting spans (or another clear indication of connection pool blocking), preferably by default.
  2. I'd prefer to not get database.connection spans for getConnection calls into the connection pool, at least when they are quick and non-blocking.
  3. I would like to get database.connection spans for getConnection calls to a real database DataSource--anything made from the pool to the database or anything that doesn't go through a pool. Those should maybe even have a span.kind of client.
  4. APM -> DBM integration still works. The documentation says that jdbc-datasource integration needs to be enabled, but from looking at the code, I can't tell why.

That combination gets me three things:

  1. It doesn't create any extra spans for the happy path (quickly getting a connection from the pool).
  2. It creates a span when a new connection is made to a real database.
  3. It creates a span if we have to wait for a connection from the pool to free up.

One option could be to just enable the jdbc-datasource integration everywhere for us, throw away this PR (but treat it as a good learning experience), accept the extra database.connection spans, and use duration to try to catch pool exhaustion. I wasn't a big fan of this before, but now that I'm struggling with how to configure this to achieve what I listed above, I'm beginning to accept this option. :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants