Skip to content

feat: dashboard + Helm/release + bundled CI flake fixes#3

Merged
debsahu merged 52 commits intomainfrom
combined/dashboard-and-ci-fixes
May 4, 2026
Merged

feat: dashboard + Helm/release + bundled CI flake fixes#3
debsahu merged 52 commits intomainfrom
combined/dashboard-and-ci-fixes

Conversation

@debsahu
Copy link
Copy Markdown
Owner

@debsahu debsahu commented May 4, 2026

Combined PR against the fork — bundles three upstream PRs into one mergeable branch so the fork stays usable even if upstream stalls. The three upstream branches remain authoritative and untouched; this is a --no-ff merge of all three.

Upstream PR Branch What it adds
#150 feat/helm-chart Helm chart, GoReleaser config, GHCR release workflow
#151 feat/web-dashboard Web dashboard at /dashboard/
#152 fix/ci-flakes etcd Peer.Stop deadlock + QUIC test port collision

What's in here

Helm + release pipeline (from feat/helm-chart)

Helm chart for Kubernetes deploys, GoReleaser config for multi-arch binaries, and a GHCR workflow that publishes container images.

Web dashboard (from feat/web-dashboard)

  • htmx + Tailwind dashboard at /dashboard/
  • Overview cards, sparklines, cluster topology view, SSE live event feed
  • Clients / subscriptions / topics / retained / sessions / blacklist / tools / settings pages
  • Auth: file or Redis cred store, lockout, password expiry, role-based access
  • Cluster mode: redis pub/sub bridge fans events between nodes
  • Wired into both cmd/single and cmd/cluster behind cfg.Dashboard.Enabled
  • Cleanup hook returned by dashboard.Routes() to stop the rate sampler + bridge goroutines on shutdown (fixes a goroutine leak caught by cmd/single's TestLeaks)

CI flake fixes (from fix/ci-flakes)

  • cluster/raft/etcd: serveRaft now closes httpDoneC, Peer.Stop() is sync.Once-guarded, and serveChannels owns its own teardown. Fixes the 600s TestJoinAndLeave timeout and the close of nil channel / nil-pointer panics in TestProposeAndLookup / TestGetLeader.
  • mqtt/listeners: QUIC tests use :0 instead of fixed :22222, eliminating intermittent bind: address already in use flakes.

Test plan

  • go build ./...
  • go test -timeout 180s -count=1 ./cluster/raft/etcd/ ./mqtt/listeners/ ./mqtt/dashboard/... ./cmd/single/... — 209/209 pass locally
  • Fork CI green

debsahu added 30 commits May 4, 2026 11:16
Closes wind-c#137. This change adds three deliverables:

1. A maintained Helm chart at deploy/helm/comqtt supporting both
   single-node (Deployment) and clustered (StatefulSet + Raft + Gossip)
   modes. The chart includes:
   - values.schema.json with conditional rules (cluster mode requires odd
     replicaCount and persistence.enabled=true)
   - A runtime entrypoint shim that renders the broker config from a
     ConfigMap template, computing seed members from replicaCount + the
     headless Service FQDN, and enabling --raft-bootstrap only when both
     pod-0 AND the Raft data dir is empty (idempotent on restart)
   - PodDisruptionBudget pinned to Raft quorum (ceil((n+1)/2))
   - Soft pod anti-affinity by default; hard via cluster.hardAntiAffinity
   - Per-replica PVCs via volumeClaimTemplates
   - liveness, readiness, and startup probes (all tunable)
   - Optional dashboard Ingress with documented MQTT-TCP caveats
   - Optional ServiceMonitor for kube-prometheus-stack
   - helm test pod that performs a mosquitto pub/sub round trip
   - chart README with full values reference, upgrade notes, and
     limitations (no operator, no automatic Raft member eviction)

2. .goreleaser.yaml building cmd/single + cmd/cluster across linux,
   darwin, windows × amd64, arm64. Builds multi-arch (amd64 + arm64)
   Docker images via goreleaser dockers + docker_manifests, publishing
   to ghcr.io/wind-c/comqtt with semver, minor, and latest tags.

3. .github/workflows/{release,chart-lint-test,chart-release}.yaml:
   - release.yaml triggers on tag push, runs GoReleaser, publishes
     binaries to GitHub Releases and images to GHCR
   - chart-lint-test runs ct lint + helm template + a kind boot test
     across single and cluster CI value files
   - chart-release runs helm/chart-releaser-action on push to main

Implementation note: cmd/cluster/main.go replaces all CLI flag values
with the loaded --conf file when one is supplied, so passing
--raft-bootstrap as a CLI flag alongside --conf is silently ignored.
The chart works around this by templating the config file itself at
runtime via the entrypoint shim.

Bitnami sub-charts (Redis/MySQL/Postgres) are intentionally NOT bundled
because the public bitnami/* Docker images now require authentication.
The chart documents bring-your-own and ships an example Valkey manifest
at deploy/helm/comqtt/ci/valkey.yaml as the recommended OSS RESP store.

Verification gauntlet (all passed locally on kind v0.31.0):
- helm lint deploy/helm/comqtt
- helm template (single + cluster)
- kubectl --dry-run=client apply
- helm install single + helm test (MQTT pub/sub round trip)
- helm install cluster (3-node Raft, leader election + member join)
- Cross-node MQTT: subscribe on cluster-comqtt-0, publish to
  cluster-comqtt-2, message delivered via shared Valkey state
- Bootstrap idempotency: kubectl delete pod cluster-comqtt-0 ->
  "genesis pod but raft dir is non-empty; bootstrap suppressed"
Add GET /api/v1/mqtt/sessions for paginated online + offline session
listing (with ?online=true|false filter) and DELETE
/api/v1/mqtt/sessions/{id} to disconnect a connected client. Offline
sessions are pulled from the storage backend through the new
Server.Hooks() accessor, which exposes the previously-unexported hook
bus so REST handlers can inspect stored state without the broker
reaching back into them.

DELETE is best-effort for v1: if the client is online it is
disconnected, otherwise the stored entry is acknowledged but
expiration is left to the existing hook eviction path.
Add the dashboard landing page rendering hero cards (Connections,
Subscriptions, Retained, Inflight, Msg In/sec, Msg Out/sec) that
refresh every 2s via htmx polling against a fragment endpoint.
Per-second rates are derived in-process by RateSampler, which
snapshots the cumulative MessagesReceived / MessagesSent counters
once per second and reports the delta. SSE wiring lands in Phase 3.
Adds GET/POST /dashboard/users for listing and creating dashboard
users, plus per-user POST endpoints to toggle role and delete.
Refuses to delete or demote the last admin. Extends the auth.CredStore
interface with SetRole, implemented on FileStore via the existing
mutate helper.
Add a read-only `/dashboard/account` page showing the current user's
username, role, password-set timestamp, and status, with a link to the
existing password-change form. Visible to all authenticated users.
debsahu added 21 commits May 4, 2026 13:54
Render events as inline <li> HTML fragments by default for direct
htmx-sse swap; opt into JSON via ?as=json for programmatic consumers.
Add a right-column events feed beside the overview cards.
Add a pure-Go inline SVG sparkline generator and render a 60-second
rolling sparkline under each numeric Overview card. RateSampler now
keeps a 60-bucket ring per metric (connections, subscriptions,
retained, msg in/sec, msg out/sec) and exposes History() so the cards
fragment can populate a new Spark field. The .sparkline class is
safelisted in the Tailwind config since it lives inside template.HTML
strings rather than scannable templates.
Adds four detail pages to the web dashboard backed by the existing REST
data-walking patterns: subscription list with topic/clientid filters,
topic trie tree, retained-message list with admin-only clear, and online
sessions list with admin-only disconnect.
- Vercel theme via CSS variables (light + dark): replaces brand/slate/rose/
  emerald palette with a monochrome shadcn-style token system. .dark is
  safelisted so the rule survives Tailwind's purge pass.
- Cluster topology SVG on Overview: monochrome ring layout with leader
  crown, follower hollow rings, "this node" halo, full-mesh edges.
  Sized 240x160 to fit alongside the cards.
- (*cluster.Agent).Leader() now falls back to matching the leader's
  address against the membership list when raft.LocalID does not equal
  the discovery node name. Fixes the "all nodes show as follower" bug.
- Per-page template tree isolation: each page parses with shared
  partials + its own file so the {{define "content"}} block stops
  colliding across pages (was rendering the wrong page after the last
  alphabetical template won the global dispatch).
- Flash field added on subscriptions/topics/retained/sessions page data
  structs so _flash.html no longer trips on missing fields.
- Retained page: hide $SYS topics by default, with a checkbox to bring
  them back via ?include_sys=1.
- Logout cookie attributes now match login (HttpOnly, SameSite=Lax,
  Secure when TLS) so strict browsers actually expire the session.
- Mobile-friendly nav drawer: hamburger top-bar on screens narrower
  than lg, slide-in nav with backdrop, auto-collapse on link click,
  Esc-to-close.
- Account link in the sidebar so viewer-role users can find their
  profile.
Routes() spawned a RateSampler ticker goroutine (and a redis pub/sub
Bridge in cluster mode) but never exposed a way to stop them. CI's
TestLeaks in cmd/single caught the leaked goroutine.

Return a cleanup func from Routes() and invoke it from cmd/single and
cmd/cluster before server.Close(). Update dashboard tests to defer
cleanup so they don't leak either.
…tdown

Two related shutdown bugs caused TestJoinAndLeave to deadlock for 600s
in CI and TestProposeAndLookup to panic with nil-pointer dereferences:

1. serveRaft never closed httpDoneC, so stopHTTP's "<-p.httpDoneC"
   blocked forever. Add a deferred close so the http goroutine signals
   completion in all exit paths.

2. The propose-loop goroutine and the main event loop both racily read
   p.node, p.confChangeC, etc. while Stop() concurrently nil'd them,
   producing "close of nil channel" panics and segfaults on next tick.

Centralize teardown in serveChannels via a deferred shutdown(); make
Stop() a thin idempotent signal (sync.Once -> close stopC) so any
goroutine can call it safely. The propose-loop now uses local channel
variables and exits cleanly on stopC without re-closing channels it
doesn't own.
The four QUIC tests that bind a real UDP socket all shared the
package-wide testAddr (":22222"). When CI ran them sequentially, the
previous test's socket wasn't always released by quic-go before the
next test tried to bind, producing intermittent
"listen udp :22222: bind: address already in use" failures.

Switch the QUIC tests to ":0" so the OS picks a fresh port each time.
The bound port is read back via l.listen.Addr() for the dialer, so the
tests still work end-to-end. testAddr stays unchanged for the other
listener tests, which depend on the literal value via string
concatenation in URLs.
@debsahu debsahu changed the title feat(dashboard): web dashboard with bundled CI flake fixes feat: dashboard + Helm/release + bundled CI flake fixes May 4, 2026
The test-connection pod had hook-delete-policy "before-hook-creation,
hook-succeeded", which deletes the pod immediately on success. The CI
step `helm test single --logs` then errors out because the pod is gone
before its logs can be fetched.

Drop hook-succeeded; the pod still gets cleaned up before the next
`helm test` run via before-hook-creation.
@debsahu debsahu merged commit ebb17e2 into main May 4, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant