Pull Request: Fix Unreliable CI Acceptance Tests

Overview

This PR addresses the reliability issues in Brighter's CI acceptance tests for MessagingGateways, Inboxes, and Outboxes. The tests were exhibiting erratic behavior in GitHub Actions, often failing due to timing-related issues that were difficult to reproduce locally.

Problem

The CI build (.github/workflows/ci.yml) runs acceptance tests that require various middleware (Kafka, RabbitMQ, Redis, MQTT) and databases (PostgreSQL, MySQL, SQL Server, MongoDB, DynamoDB). These tests frequently failed in CI but worked locally, suggesting infrastructure timing issues rather than code bugs.

Root Causes

Missing Service Health Checks: Many services lacked health checks, causing tests to start before services were ready
Inadequate Retry Counts: Health checks had too few retries (3-5) for CI environment variability
Kafka Blind Wait: A fixed 30-second sleep instead of active readiness verification
Short Job Timeouts: 5-minute timeouts were insufficient for slower CI environments
No Documentation: No record of known issues or improvement strategies

Solution

1. Comprehensive Service Health Checks

Added or improved health checks for all services in .github/workflows/ci.yml:

Service	Health Check Method	Retries	Max Wait
Kafka	`kafka-broker-api-versions`	15	~150s
Zookeeper	`nc localhost 2181` check	10	~100s
Schema Registry	HTTP endpoint	10	~100s
RabbitMQ	`rabbitmqctl node_health_check`	10 ↑	~100s
Redis	`redis-cli ping`	10 ↑	~100s
PostgreSQL	`pg_isready`	10 ↑	~100s
MySQL/MariaDB	`healthcheck.sh`	10 ↑	~100s
SQL Server	`sqlcmd` query	10	~100s
MongoDB	`mongosh` ping	15 ↑	~300s
MQTT	`mosquitto_sub`	10	~100s
DynamoDB	HTTP endpoint	10	~100s

(↑ indicates increased from previous value)

2. Active Kafka Readiness Verification

Before:

- name: Sleep to let Kafka spin up
  uses: jakejarvis/wait-action@master
  with:
    time: '30s'

After:

- name: Wait for Kafka to be ready
  run: |
    max_attempts=30
    while [ $attempt -lt $max_attempts ]; do
      if kafkacat -b localhost:9092 -L > /dev/null 2>&1; then
        echo "Kafka is ready!"
        break
      fi
      sleep 2
    done

Benefits:

Active verification instead of blind waiting
Can complete in <30s if Kafka is ready quickly
Can wait up to 60s if needed
Fails fast with clear error message
Better troubleshooting through logging

3. Increased Job Timeouts

All acceptance test jobs increased from 5 minutes → 8 minutes:

redis-ci, mqtt-ci, rabbitmq-ci
postgres-ci, sqlserver-ci, mysql-ci
dynamo-ci, localstack-ci, mongodb-ci
aws-ci, aws-scheduler-ci, azure-ci
sqlite-ci, gcp-ci

Note: kafka-ci already had 20 minutes (unchanged)

4. Comprehensive Documentation

Added two documentation files:

docs/CI-Test-Reliability.md (171 lines)
- Detailed analysis of issues
- Explanation of all improvements
- Recommendations for future work
- Success metrics to track
RELIABILITY-IMPROVEMENTS-SUMMARY.md (140 lines)
- Executive summary
- Quick reference tables
- Validation and rollback plans
- Next steps

Impact

Expected Improvements

✅ Service Startup Reliability

Services verified ready before tests run
Longer retry windows accommodate CI variability

✅ Kafka Test Reliability

Active readiness check replaces blind wait
Better handling of slow Kafka initialization

✅ Test Execution Success

8-minute timeouts prevent premature cancellation
Tests have adequate time to complete

✅ Maintainability

Documentation preserves knowledge
Clear patterns for future improvements

Tests Expected to Improve

Kafka tests (13 marked [Trait("Fragile", "CI")])
RabbitMQ tests (6 marked fragile)
Database initialization tests
Message broker timing tests

Testing & Validation

Pre-merge Validation

✅ CI workflow YAML syntax validated ✅ All changes backward compatible ✅ No test or application code modified ✅ Changes isolated to infrastructure

Post-merge Monitoring

Recommended monitoring approach:

Track Pass Rates: Monitor CI success over 10+ builds
Measure Duration: Check if 8-minute timeouts are adequate
Review Logs: Verify Kafka readiness checks succeed
Re-enable Tests: Gradually remove "Fragile" trait from stable tests
Optimize Timeouts: Adjust based on observed durations

Success Metrics

Metric	Current	Target
CI Pass Rate	TBD	>95%
False Positive Rate	TBD	<5%
Kafka Startup Time	30s (blind)	10-30s (actual)
Job Duration (P95)	TBD	<7 min

Rollback Plan

If issues arise:

Revert the 4 commits from this PR
All changes are in CI config and docs only
No code changes require rollback
Previous behavior fully restored

Future Work

Short Term

Monitor CI success rates
Remove "Fragile" trait from stable tests
Optimize timeout values based on data

Medium Term

Implement retry helpers in tests
Add test-level readiness checks
Use environment variables for CI-specific timeouts

Long Term

Improve test fixtures with proper initialization
Re-evaluate test parallelization restrictions
Create CI-specific test configuration profiles

Files Changed

.github/workflows/ci.yml             | 94 insertions(+), 26 deletions(-)
docs/CI-Test-Reliability.md          | 171 new file
RELIABILITY-IMPROVEMENTS-SUMMARY.md  | 140 new file

Total: 3 files changed, 379 insertions(+), 26 deletions(-)

Related Issues

[Chore] Unreliable Acceptance Tests - Original issue

Commits

6d99514 - Improve CI service health checks and remove fixed delays
c4b8f5c - Increase CI job timeouts to accommodate slower test environments
b3440bd - Add documentation for CI test reliability improvements
b1dea5e - Add comprehensive summary of reliability improvements

Reviewer Notes

Key Changes to Review

Health Check Additions: Verify health check commands are appropriate
Kafka Readiness Script: Review the bash logic for correctness
Timeout Values: Confirm 8 minutes is reasonable but not excessive
Documentation: Ensure documentation is accurate and helpful

Questions to Consider

Are the health check retry counts adequate?
Is the Kafka readiness script robust enough?
Should any tests be re-enabled immediately?
Are there other services that need health checks?

Conclusion

This PR makes infrastructure-only changes to improve CI test reliability without modifying any test or application code. The changes are conservative, well-documented, and easily reversible. The improvements address root causes rather than symptoms, providing a foundation for long-term reliability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull Request: Fix Unreliable CI Acceptance Tests

Overview

Problem

Root Causes

Solution

1. Comprehensive Service Health Checks

2. Active Kafka Readiness Verification

3. Increased Job Timeouts

4. Comprehensive Documentation

Impact

Expected Improvements

Tests Expected to Improve

Testing & Validation

Pre-merge Validation

Post-merge Monitoring

Success Metrics

Rollback Plan

Future Work

Short Term

Medium Term

Long Term

Files Changed

Related Issues

Commits

Reviewer Notes

Key Changes to Review

Questions to Consider

Conclusion

FilesExpand file tree

PR-DESCRIPTION.md

Latest commit

History

PR-DESCRIPTION.md

File metadata and controls

Pull Request: Fix Unreliable CI Acceptance Tests

Overview

Problem

Root Causes

Solution

1. Comprehensive Service Health Checks

2. Active Kafka Readiness Verification

3. Increased Job Timeouts

4. Comprehensive Documentation

Impact

Expected Improvements

Tests Expected to Improve

Testing & Validation

Pre-merge Validation

Post-merge Monitoring

Success Metrics

Rollback Plan

Future Work

Short Term

Medium Term

Long Term

Files Changed

Related Issues

Commits

Reviewer Notes

Key Changes to Review

Questions to Consider

Conclusion