feat: dashboard + Helm/release + bundled CI flake fixes#3
Merged
Conversation
Closes wind-c#137. This change adds three deliverables: 1. A maintained Helm chart at deploy/helm/comqtt supporting both single-node (Deployment) and clustered (StatefulSet + Raft + Gossip) modes. The chart includes: - values.schema.json with conditional rules (cluster mode requires odd replicaCount and persistence.enabled=true) - A runtime entrypoint shim that renders the broker config from a ConfigMap template, computing seed members from replicaCount + the headless Service FQDN, and enabling --raft-bootstrap only when both pod-0 AND the Raft data dir is empty (idempotent on restart) - PodDisruptionBudget pinned to Raft quorum (ceil((n+1)/2)) - Soft pod anti-affinity by default; hard via cluster.hardAntiAffinity - Per-replica PVCs via volumeClaimTemplates - liveness, readiness, and startup probes (all tunable) - Optional dashboard Ingress with documented MQTT-TCP caveats - Optional ServiceMonitor for kube-prometheus-stack - helm test pod that performs a mosquitto pub/sub round trip - chart README with full values reference, upgrade notes, and limitations (no operator, no automatic Raft member eviction) 2. .goreleaser.yaml building cmd/single + cmd/cluster across linux, darwin, windows × amd64, arm64. Builds multi-arch (amd64 + arm64) Docker images via goreleaser dockers + docker_manifests, publishing to ghcr.io/wind-c/comqtt with semver, minor, and latest tags. 3. .github/workflows/{release,chart-lint-test,chart-release}.yaml: - release.yaml triggers on tag push, runs GoReleaser, publishes binaries to GitHub Releases and images to GHCR - chart-lint-test runs ct lint + helm template + a kind boot test across single and cluster CI value files - chart-release runs helm/chart-releaser-action on push to main Implementation note: cmd/cluster/main.go replaces all CLI flag values with the loaded --conf file when one is supplied, so passing --raft-bootstrap as a CLI flag alongside --conf is silently ignored. The chart works around this by templating the config file itself at runtime via the entrypoint shim. Bitnami sub-charts (Redis/MySQL/Postgres) are intentionally NOT bundled because the public bitnami/* Docker images now require authentication. The chart documents bring-your-own and ships an example Valkey manifest at deploy/helm/comqtt/ci/valkey.yaml as the recommended OSS RESP store. Verification gauntlet (all passed locally on kind v0.31.0): - helm lint deploy/helm/comqtt - helm template (single + cluster) - kubectl --dry-run=client apply - helm install single + helm test (MQTT pub/sub round trip) - helm install cluster (3-node Raft, leader election + member join) - Cross-node MQTT: subscribe on cluster-comqtt-0, publish to cluster-comqtt-2, message delivered via shared Valkey state - Bootstrap idempotency: kubectl delete pod cluster-comqtt-0 -> "genesis pod but raft dir is non-empty; bootstrap suppressed"
Add GET /api/v1/mqtt/sessions for paginated online + offline session
listing (with ?online=true|false filter) and DELETE
/api/v1/mqtt/sessions/{id} to disconnect a connected client. Offline
sessions are pulled from the storage backend through the new
Server.Hooks() accessor, which exposes the previously-unexported hook
bus so REST handlers can inspect stored state without the broker
reaching back into them.
DELETE is best-effort for v1: if the client is online it is
disconnected, otherwise the stored entry is acknowledged but
expiration is left to the existing hook eviction path.
Add the dashboard landing page rendering hero cards (Connections, Subscriptions, Retained, Inflight, Msg In/sec, Msg Out/sec) that refresh every 2s via htmx polling against a fragment endpoint. Per-second rates are derived in-process by RateSampler, which snapshots the cumulative MessagesReceived / MessagesSent counters once per second and reports the delta. SSE wiring lands in Phase 3.
Adds GET/POST /dashboard/users for listing and creating dashboard users, plus per-user POST endpoints to toggle role and delete. Refuses to delete or demote the last admin. Extends the auth.CredStore interface with SetRole, implemented on FileStore via the existing mutate helper.
Add a read-only `/dashboard/account` page showing the current user's username, role, password-set timestamp, and status, with a link to the existing password-change form. Visible to all authenticated users.
Render events as inline <li> HTML fragments by default for direct htmx-sse swap; opt into JSON via ?as=json for programmatic consumers. Add a right-column events feed beside the overview cards.
Add a pure-Go inline SVG sparkline generator and render a 60-second rolling sparkline under each numeric Overview card. RateSampler now keeps a 60-bucket ring per metric (connections, subscriptions, retained, msg in/sec, msg out/sec) and exposes History() so the cards fragment can populate a new Spark field. The .sparkline class is safelisted in the Tailwind config since it lives inside template.HTML strings rather than scannable templates.
Adds four detail pages to the web dashboard backed by the existing REST data-walking patterns: subscription list with topic/clientid filters, topic trie tree, retained-message list with admin-only clear, and online sessions list with admin-only disconnect.
- Vercel theme via CSS variables (light + dark): replaces brand/slate/rose/
emerald palette with a monochrome shadcn-style token system. .dark is
safelisted so the rule survives Tailwind's purge pass.
- Cluster topology SVG on Overview: monochrome ring layout with leader
crown, follower hollow rings, "this node" halo, full-mesh edges.
Sized 240x160 to fit alongside the cards.
- (*cluster.Agent).Leader() now falls back to matching the leader's
address against the membership list when raft.LocalID does not equal
the discovery node name. Fixes the "all nodes show as follower" bug.
- Per-page template tree isolation: each page parses with shared
partials + its own file so the {{define "content"}} block stops
colliding across pages (was rendering the wrong page after the last
alphabetical template won the global dispatch).
- Flash field added on subscriptions/topics/retained/sessions page data
structs so _flash.html no longer trips on missing fields.
- Retained page: hide $SYS topics by default, with a checkbox to bring
them back via ?include_sys=1.
- Logout cookie attributes now match login (HttpOnly, SameSite=Lax,
Secure when TLS) so strict browsers actually expire the session.
- Mobile-friendly nav drawer: hamburger top-bar on screens narrower
than lg, slide-in nav with backdrop, auto-collapse on link click,
Esc-to-close.
- Account link in the sidebar so viewer-role users can find their
profile.
Routes() spawned a RateSampler ticker goroutine (and a redis pub/sub Bridge in cluster mode) but never exposed a way to stop them. CI's TestLeaks in cmd/single caught the leaked goroutine. Return a cleanup func from Routes() and invoke it from cmd/single and cmd/cluster before server.Close(). Update dashboard tests to defer cleanup so they don't leak either.
…tdown Two related shutdown bugs caused TestJoinAndLeave to deadlock for 600s in CI and TestProposeAndLookup to panic with nil-pointer dereferences: 1. serveRaft never closed httpDoneC, so stopHTTP's "<-p.httpDoneC" blocked forever. Add a deferred close so the http goroutine signals completion in all exit paths. 2. The propose-loop goroutine and the main event loop both racily read p.node, p.confChangeC, etc. while Stop() concurrently nil'd them, producing "close of nil channel" panics and segfaults on next tick. Centralize teardown in serveChannels via a deferred shutdown(); make Stop() a thin idempotent signal (sync.Once -> close stopC) so any goroutine can call it safely. The propose-loop now uses local channel variables and exits cleanly on stopC without re-closing channels it doesn't own.
The four QUIC tests that bind a real UDP socket all shared the
package-wide testAddr (":22222"). When CI ran them sequentially, the
previous test's socket wasn't always released by quic-go before the
next test tried to bind, producing intermittent
"listen udp :22222: bind: address already in use" failures.
Switch the QUIC tests to ":0" so the OS picks a fresh port each time.
The bound port is read back via l.listen.Addr() for the dialer, so the
tests still work end-to-end. testAddr stays unchanged for the other
listener tests, which depend on the literal value via string
concatenation in URLs.
…nd-c#150) # Conflicts: # .gitignore
The test-connection pod had hook-delete-policy "before-hook-creation, hook-succeeded", which deletes the pod immediately on success. The CI step `helm test single --logs` then errors out because the pod is gone before its logs can be fetched. Drop hook-succeeded; the pod still gets cleaned up before the next `helm test` run via before-hook-creation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Combined PR against the fork — bundles three upstream PRs into one mergeable branch so the fork stays usable even if upstream stalls. The three upstream branches remain authoritative and untouched; this is a
--no-ffmerge of all three.feat/helm-chartfeat/web-dashboard/dashboard/fix/ci-flakesWhat's in here
Helm + release pipeline (from
feat/helm-chart)Helm chart for Kubernetes deploys, GoReleaser config for multi-arch binaries, and a GHCR workflow that publishes container images.
Web dashboard (from
feat/web-dashboard)/dashboard/cmd/singleandcmd/clusterbehindcfg.Dashboard.Enableddashboard.Routes()to stop the rate sampler + bridge goroutines on shutdown (fixes a goroutine leak caught bycmd/single'sTestLeaks)CI flake fixes (from
fix/ci-flakes)cluster/raft/etcd:serveRaftnow closeshttpDoneC,Peer.Stop()issync.Once-guarded, andserveChannelsowns its own teardown. Fixes the 600sTestJoinAndLeavetimeout and theclose of nil channel/ nil-pointer panics inTestProposeAndLookup/TestGetLeader.mqtt/listeners: QUIC tests use:0instead of fixed:22222, eliminating intermittentbind: address already in useflakes.Test plan
go build ./...go test -timeout 180s -count=1 ./cluster/raft/etcd/ ./mqtt/listeners/ ./mqtt/dashboard/... ./cmd/single/...— 209/209 pass locally