This PR addresses the reliability issues in Brighter's CI acceptance tests for MessagingGateways, Inboxes, and Outboxes. The tests were exhibiting erratic behavior in GitHub Actions, often failing due to timing-related issues that were difficult to reproduce locally.
The CI build (.github/workflows/ci.yml) runs acceptance tests that require various middleware (Kafka, RabbitMQ, Redis, MQTT) and databases (PostgreSQL, MySQL, SQL Server, MongoDB, DynamoDB). These tests frequently failed in CI but worked locally, suggesting infrastructure timing issues rather than code bugs.
- Missing Service Health Checks: Many services lacked health checks, causing tests to start before services were ready
- Inadequate Retry Counts: Health checks had too few retries (3-5) for CI environment variability
- Kafka Blind Wait: A fixed 30-second sleep instead of active readiness verification
- Short Job Timeouts: 5-minute timeouts were insufficient for slower CI environments
- No Documentation: No record of known issues or improvement strategies
Added or improved health checks for all services in .github/workflows/ci.yml:
| Service | Health Check Method | Retries | Max Wait |
|---|---|---|---|
| Kafka | kafka-broker-api-versions |
15 | ~150s |
| Zookeeper | nc localhost 2181 check |
10 | ~100s |
| Schema Registry | HTTP endpoint | 10 | ~100s |
| RabbitMQ | rabbitmqctl node_health_check |
10 ↑ | ~100s |
| Redis | redis-cli ping |
10 ↑ | ~100s |
| PostgreSQL | pg_isready |
10 ↑ | ~100s |
| MySQL/MariaDB | healthcheck.sh |
10 ↑ | ~100s |
| SQL Server | sqlcmd query |
10 | ~100s |
| MongoDB | mongosh ping |
15 ↑ | ~300s |
| MQTT | mosquitto_sub |
10 | ~100s |
| DynamoDB | HTTP endpoint | 10 | ~100s |
(↑ indicates increased from previous value)
Before:
- name: Sleep to let Kafka spin up
uses: jakejarvis/wait-action@master
with:
time: '30s'After:
- name: Wait for Kafka to be ready
run: |
max_attempts=30
while [ $attempt -lt $max_attempts ]; do
if kafkacat -b localhost:9092 -L > /dev/null 2>&1; then
echo "Kafka is ready!"
break
fi
sleep 2
doneBenefits:
- Active verification instead of blind waiting
- Can complete in <30s if Kafka is ready quickly
- Can wait up to 60s if needed
- Fails fast with clear error message
- Better troubleshooting through logging
All acceptance test jobs increased from 5 minutes → 8 minutes:
- redis-ci, mqtt-ci, rabbitmq-ci
- postgres-ci, sqlserver-ci, mysql-ci
- dynamo-ci, localstack-ci, mongodb-ci
- aws-ci, aws-scheduler-ci, azure-ci
- sqlite-ci, gcp-ci
Note: kafka-ci already had 20 minutes (unchanged)
Added two documentation files:
-
docs/CI-Test-Reliability.md(171 lines)- Detailed analysis of issues
- Explanation of all improvements
- Recommendations for future work
- Success metrics to track
-
RELIABILITY-IMPROVEMENTS-SUMMARY.md(140 lines)- Executive summary
- Quick reference tables
- Validation and rollback plans
- Next steps
✅ Service Startup Reliability
- Services verified ready before tests run
- Longer retry windows accommodate CI variability
✅ Kafka Test Reliability
- Active readiness check replaces blind wait
- Better handling of slow Kafka initialization
✅ Test Execution Success
- 8-minute timeouts prevent premature cancellation
- Tests have adequate time to complete
✅ Maintainability
- Documentation preserves knowledge
- Clear patterns for future improvements
- Kafka tests (13 marked
[Trait("Fragile", "CI")]) - RabbitMQ tests (6 marked fragile)
- Database initialization tests
- Message broker timing tests
✅ CI workflow YAML syntax validated ✅ All changes backward compatible ✅ No test or application code modified ✅ Changes isolated to infrastructure
Recommended monitoring approach:
- Track Pass Rates: Monitor CI success over 10+ builds
- Measure Duration: Check if 8-minute timeouts are adequate
- Review Logs: Verify Kafka readiness checks succeed
- Re-enable Tests: Gradually remove "Fragile" trait from stable tests
- Optimize Timeouts: Adjust based on observed durations
| Metric | Current | Target |
|---|---|---|
| CI Pass Rate | TBD | >95% |
| False Positive Rate | TBD | <5% |
| Kafka Startup Time | 30s (blind) | 10-30s (actual) |
| Job Duration (P95) | TBD | <7 min |
If issues arise:
- Revert the 4 commits from this PR
- All changes are in CI config and docs only
- No code changes require rollback
- Previous behavior fully restored
- Monitor CI success rates
- Remove "Fragile" trait from stable tests
- Optimize timeout values based on data
- Implement retry helpers in tests
- Add test-level readiness checks
- Use environment variables for CI-specific timeouts
- Improve test fixtures with proper initialization
- Re-evaluate test parallelization restrictions
- Create CI-specific test configuration profiles
.github/workflows/ci.yml | 94 insertions(+), 26 deletions(-)
docs/CI-Test-Reliability.md | 171 new file
RELIABILITY-IMPROVEMENTS-SUMMARY.md | 140 new file
Total: 3 files changed, 379 insertions(+), 26 deletions(-)
- [Chore] Unreliable Acceptance Tests - Original issue
6d99514- Improve CI service health checks and remove fixed delaysc4b8f5c- Increase CI job timeouts to accommodate slower test environmentsb3440bd- Add documentation for CI test reliability improvementsb1dea5e- Add comprehensive summary of reliability improvements
- Health Check Additions: Verify health check commands are appropriate
- Kafka Readiness Script: Review the bash logic for correctness
- Timeout Values: Confirm 8 minutes is reasonable but not excessive
- Documentation: Ensure documentation is accurate and helpful
- Are the health check retry counts adequate?
- Is the Kafka readiness script robust enough?
- Should any tests be re-enabled immediately?
- Are there other services that need health checks?
This PR makes infrastructure-only changes to improve CI test reliability without modifying any test or application code. The changes are conservative, well-documented, and easily reversible. The improvements address root causes rather than symptoms, providing a foundation for long-term reliability.