Skip to content

chore: fix Docker cqlsh install and improve simulation diagnostics#7691

Closed
zawadzkidiana wants to merge 1 commit intocadence-workflow:masterfrom
zawadzkidiana:diana/infra-docker-simulation-fixes
Closed

chore: fix Docker cqlsh install and improve simulation diagnostics#7691
zawadzkidiana wants to merge 1 commit intocadence-workflow:masterfrom
zawadzkidiana:diana/infra-docker-simulation-fixes

Conversation

@zawadzkidiana
Copy link
Contributor

What changed?
Fix Docker cadence-auto-setup image build and improve replication simulation test diagnostics.

  • Install py3-setuptools, py3-wheel, and ca-certificates in Docker images so pip3 install cqlsh succeeds on newer Alpine/pip
  • Add --no-build-isolation to cqlsh pip install
  • Return structured JSON from simulation worker /health endpoint with cluster info, ready domain count, and last errors
  • Add per-endpoint error/status logging in simulation health check with 1s HTTP client timeout to prevent hangs

Why?
These changes were extracted from the replication histogram metric PRs (#7683, #7684) where they were originally included to unblock CI. They are prerequisites for those PRs to pass CI, but are logically independent infrastructure fixes.

The Docker cadence-auto-setup image build was broken on Alpine 3.18 with newer pip versions. pip3 install cqlsh fails because setuptools and wheel are no longer bundled by default — they must be explicitly installed as system packages. The --no-build-isolation flag is also needed because cqlsh's build dependencies cannot be resolved in an isolated environment without pre-installed setuptools. The ca-certificates addition in the dockerize stage ensures HTTPS downloads (e.g. the dockerize release tarball via wget) succeed reliably. Without these fixes, all CI jobs that build or use the Docker image — including the replication simulation tests that validate the histogram additions — fail before any test code runs.

The simulation worker /health endpoint previously returned empty 200/503 responses with no body, making it impossible to diagnose readiness failures in CI logs. When a health check failed, the test logged only "Workers are not reporting healthy yet" with no indication of whether the failure was DNS resolution, connection refused, HTTP 503, or a specific domain not being ready. The structured JSON response and per-endpoint logging make CI failures immediately actionable.

- Install py3-setuptools, py3-wheel, and ca-certificates in Docker
  images so pip3 install cqlsh succeeds on newer Alpine/pip
- Add --no-build-isolation to cqlsh pip install
- Return structured JSON from simulation worker /health endpoint
  with cluster info, ready domain count, and last errors
- Add per-endpoint error/status logging in simulation health check
  with 1s HTTP client timeout to prevent hangs

Signed-off-by: Diana Zawadzki <dzawa@live.de>
@gitar-bot
Copy link

gitar-bot bot commented Feb 10, 2026

Wrapping up

🔍 CI failure analysis for 06b0385: The codecov test failure in TestPollForDecisionTasks is a flaky test unrelated to this PR's Docker and simulation changes.

Issue

Test TestMatchingEngineSuite/TestPollForDecisionTasks failed in service/matching/handler/engine_integration_test.go with an assertion error:

Expected: PollerWaitTimeInMs: 0
Actual:   PollerWaitTimeInMs: 3

Root Cause

This failure is unrelated to the PR changes. This PR modifies:

  • Dockerfile - Docker image build configuration for cqlsh installation
  • simulation/replication/replication_simulation_test.go - Health check diagnostics
  • simulation/replication/worker/cmd/main.go - JSON health endpoint

The failing test is in the matching engine service testing poller behavior, which has no connection to Docker builds or simulation health checks.

Details

Evidence of Flakiness
  1. All related tests passed: All 18 Replication Simulation tests (directly affected by this PR) passed successfully
  2. Isolated failure: 38 other CI checks passed, only codecov failed
  3. Timing-sensitive assertion: The PollerWaitTimeInMs difference (0 vs 3 milliseconds) indicates a race condition or timing variation in the test environment
  4. No logical connection: Docker build configuration and simulation diagnostics cannot affect matching engine poller timing logic
  5. Test is integration test: Integration tests are more susceptible to environmental timing variations

The test expects a poller to wait 0ms but it waited 3ms, suggesting a timing race in the test setup or teardown.

Code Review ✅ Approved

Clean infrastructure PR: Docker build fixes for Alpine 3.18 pip compatibility and improved simulation health check diagnostics with proper concurrency handling. No issues found.

Rules ❌ No requirements met

Repository Rules

PR Description Quality Standards: Add **[How did you test it?]**, **[Potential risks]**, **[Release notes]**, and **[Documentation Changes]** sections per PR template
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@zawadzkidiana
Copy link
Contributor Author

fixed with: #7690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant