Skip to content

Make MetricsITBase tests serial instead of parallelizable (CASSJAVA-19) #2048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 4.x
Choose a base branch
from

Conversation

Bouncheck
Copy link

@Bouncheck Bouncheck commented Jul 23, 2025

This change fixes random failures by removing MicrometerMetricsIT, MicroProfileMetricsIT and DropwizardMetricsIT from the ParallelizableTests category.
It is not the only solution but seems to be the simplest workaround.

The reasoning is as follows:
should_evict_down_node_metrics_when_timeout_fires() has two places where it manipulates AbstractMetricUpdater.MIN_EXPIRE_AFTER in order to set DefaultDriverOption.METRICS_NODE_EXPIRE_AFTER to 1 second during session initialization. Otherwise MIN_EXPIRE_AFTER would not allow for that and warning log would be printed about using higher value than what user wanted to set.
It lowers it before session initialization:

And sets it back to 5 mintues at the end of the test:
AbstractMetricUpdater.MIN_EXPIRE_AFTER = Duration.ofMinutes(5);
The code comment in AbstractMetricUpdater also mentions that this variable is intentionally not made final for testing purposes. I believe this is what causes the intermittent failures. When those three tests that extend this class run in parallel it is possible that test B sets the minimum back to 5 minutes after the test A lowers it to 1 second but before test A initializes its session.
The test waits around 10 seconds for the change to happen, but in such case it would need to wait for 5 minutes.
Warnings with this pattern should be visible when the test fails on the CI:

c.d.o.d.i.c.m.AbstractMetricUpdater - [s6] Value too low for advanced.metrics.node.expire-after: PT1S. Forcing to PT5M instead.

I can confirm that this is what was happening when I was testing this locally, but I did not try to search for the logs on cassandra-java-driver's CI.

This change fixes random failures by removing `MicrometerMetricsIT`,
`MicroProfileMetricsIT` and `DropwizardMetricsIT` from the
`ParallelizableTests` category.
It is not the only solution but seems to be the simplest workaround.

The reasoning is as follows:
`should_evict_down_node_metrics_when_timeout_fires()` has two places where it
manipulates `AbstractMetricUpdater.MIN_EXPIRE_AFTER` in order to set
`DefaultDriverOption.METRICS_NODE_EXPIRE_AFTER` to 1 second during session
initialization. Otherwise `MIN_EXPIRE_AFTER` would not allow for that and
warning log would be printed about using higher value than what user wanted to
set.
It lowers it before session initialization:
https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java#L157
And sets it back to 5 mintues at the end of the test:
https://github.com/apache/cassandra-java-driver/blob/17ebe6092e2877d8c524e07489c4c3d005cfeea5/integration-tests/src/test/java/com/datastax/oss/driver/core/metrics/MetricsITBase.java#L186
The code comment in AbstractMetricUpdater also mentions that this variable is
intentionally not made final for testing purposes.
I believe this is what causes the intermittent failures. When those three tests
that extend this class run in parallel it is possible that test B sets the
minimum back to 5 minutes after the test A lowers it to 1 second but before
test A initializes its session.
The test waits around 10 seconds for the change to happen, but in such case it
would need to wait for 5 minutes.
Warnings with this pattern should be visible when the test fails on the CI:
```
c.d.o.d.i.c.m.AbstractMetricUpdater - [s6] Value too low for advanced.metrics.node.expire-after: PT1S. Forcing to PT5M instead.
```
I can confirm that this is what was happening when I was testing this locally,
but I did not try to search for the logs on cassandra-java-driver's CI.
@Bouncheck Bouncheck force-pushed the upstream-4.x-CASSJAVA-19 branch from 771cc50 to 441389f Compare July 23, 2025 10:11
@Bouncheck
Copy link
Author

Bouncheck commented Jul 23, 2025

Here's also a handy script for reproducing this issue. I think I needed around 20 runs on version before the change to trigger the bug. (note: exit code 0 does not mean no test failed in general, but here it's sufficient)

#!/bin/bash

mvn clean install -P fast

echo "Running tests in a loop until failure detection..."
counter=1

while true; do
    echo "=============================================="
    echo "Attempt #$counter"
    echo "=============================================="

    # Run the test command and show output in real-time while also capturing it
    mvn -pl integration-tests integration-test -Dit.test='MicrometerMetricsIT, MicroProfileMetricsIT, DropwizardMetricsIT' | tee /tmp/maven_output.log

    # Capture the exit status of the tee command (which will be the exit status of the Maven command)
    maven_status=${PIPESTATUS[0]}

    # Check for explicit error containing the method name in the captured output
    if grep -q "\[ERROR\].*should_evict_down_node_metrics_when_timeout_fires" /tmp/maven_output.log; then
        echo ""
        echo "Found test failure on attempt #$counter"
        echo "Failing test output:"
        grep -A 5 -B 2 "should_evict_down_node_metrics_when_timeout_fires" /tmp/maven_output.log
        break
    fi

    # Also check standard Maven error exit code
    if [ $maven_status -ne 0 ]; then
        echo ""
        echo "Test command failed on attempt #$counter, but the specific error wasn't found."
        break
    fi

    echo ""
    echo "Test passed. Continuing..."
    ((counter++))

    # Optional: small delay between runs
    sleep 1
done

echo ""
echo "Tests failed after $counter attempts."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant