Skip to content

Commit c643156

Browse files
stephnangueStephane NANGUE
andauthored
feat: add HA with standby nodes (#54)
* first version of HA with standby node implemented with reverse proxy for request forwarding * feat: HA with standby nodes, e2e test suite, and documentation - Implement active/standby HA with mTLS reverse proxy forwarding - Standby nodes forward requests to leader, serve health endpoints locally - Preserve original Host header through proxy for AWS SigV4 compatibility - Add cluster listener with self-signed TLS certificate generation - Add leader advertisement and cluster address discovery - Add comprehensive e2e test suite (11 packages, ~80 tests) covering cluster health, forwarding, SigV4, credentials, rotation, namespaces, providers, auth, audit, seal, concurrency - Add e2e infrastructure: 3-node cluster setup/teardown/reset scripts, Go test helpers, and Docker Compose for dependencies - Add Makefile targets: test-e2e, test-e2e-setup, test-e2e-teardown, test-e2e-reset - Remove unused test-integration target - Update README with HA section and CONTRIBUTING with e2e guide * docs: add CHANGELOG.md with release history Add changelog covering v0.2.0, v0.1.1, and v0.1.0 releases following Keep a Changelog conventions. --------- Co-authored-by: Stephane NANGUE <snangue@MacBook-Pro-de-Stephane.local>
1 parent f4aee9b commit c643156

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+11068
-226
lines changed

.github/workflows/ci.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,26 @@ jobs:
2323

2424
- name: Build
2525
run: go build -o warden .
26+
27+
e2e:
28+
runs-on: ubuntu-latest
29+
needs: [test]
30+
timeout-minutes: 45
31+
steps:
32+
- name: Checkout
33+
uses: actions/checkout@v6
34+
35+
- name: Set up Go
36+
uses: actions/setup-go@v6
37+
with:
38+
go-version-file: go.mod
39+
40+
- name: Start E2E cluster
41+
run: bash e2e/setup.sh
42+
43+
- name: Run E2E tests
44+
run: go test -tags e2e -v -count=1 -p 1 -timeout 45m ./e2e/cluster/ ./e2e/provider/ ./e2e/ha/ ./e2e/namespace/ ./e2e/forwarding/ ./e2e/credential/ ./e2e/rotation/ ./e2e/auth/ ./e2e/audit/ ./e2e/seal/ ./e2e/concurrency/
45+
46+
- name: Teardown E2E cluster
47+
if: always()
48+
run: bash e2e/teardown.sh

CHANGELOG.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Changelog
2+
3+
All notable changes to Warden are documented in this file.
4+
5+
## [v0.2.0] — 2026-03-04
6+
7+
### New Features
8+
9+
- **High Availability with Standby Nodes** — Active/standby HA using PostgreSQL advisory locks for leader election. Standby nodes forward requests to the leader via mTLS reverse proxy. Automatic failover when the leader becomes unavailable, with sealed-node protection to prevent forwarding to unhealthy nodes. Health and status endpoints (`sys/health`, `sys/leader`, `sys/seal-status`, `sys/init`, `sys/ready`) are served locally by standby nodes without forwarding. (#54)
10+
- **OpenAI AI Provider** — Native OpenAI provider with transparent gateway mode. (#52)
11+
- **Mistral AI Provider** — Native Mistral AI provider with transparent gateway mode. (#50)
12+
- **Opt-in Request Body Parsing for Streaming Requests** — Streaming requests can now opt in to request body parsing for policy evaluation while preserving the original stream. (#49)
13+
- **E2E Test Suite** — Comprehensive end-to-end tests running against a 3-node HA cluster covering cluster health, HA failover, request forwarding, provider integration, credential management, rotation, namespaces, seal/unseal, authentication, audit logging, and concurrency.
14+
15+
### Bug Fixes
16+
17+
- **SigV4 Host Header Preservation** — Fixed AWS SigV4 signature verification failure when requests are forwarded through standby nodes. The reverse proxy no longer rewrites the `Host` header, preserving the original value needed for signature verification. (#54)
18+
- **Dependabot Unblocked** — Fixed broken OpenBao sub-module references that prevented Dependabot from running. (#35)
19+
20+
### Infrastructure
21+
22+
- **Go 1.26.0** — Upgraded from Go 1.25.1. (#48)
23+
- **CI Updates** — Bumped `actions/checkout` to v6, `actions/setup-go` to v6, `goreleaser/goreleaser-action` to v7. (#36, #37, #38)
24+
- **Dependency Updates** — Updated `github.com/cloudflare/circl`, `github.com/go-chi/chi`, and various Go module dependencies. (#41, #42, #44, #47)
25+
26+
## [v0.1.1] — 2025-12-22
27+
28+
### Bug Fix
29+
30+
- **fix: handle custom dev root tokens in LookupToken**`LookupToken` failed with `"failed to detect token type"` when using `--dev-root-token` with a custom value that lacks a standard prefix. Added the same dev-mode fallback that `ResolveToken` already had. (#33)
31+
32+
## [v0.1.0] — 2025-12-21
33+
34+
Initial release. See the [v0.1.0 release notes](https://github.com/stephnangue/warden/releases/tag/v0.1.0) for the full feature list.
35+
36+
### Highlights
37+
38+
- Identity-aware egress gateway for cloud and SaaS services
39+
- Providers: AWS, Azure, GCP, GitHub, GitLab, Vault/OpenBao
40+
- Transparent and explicit gateway modes
41+
- JWT authentication with JWKS validation
42+
- Capability-based policy enforcement
43+
- Request-level audit trail
44+
- IP-bound sessions
45+
- Two-stage credential rotation
46+
- Seal/unseal with envelope encryption
47+
- Namespace isolation
48+
- Storage backends: in-memory, file, PostgreSQL
49+
- Docker image published to `ghcr.io/stephnangue/warden`
50+
- Pre-built binaries for Linux, macOS, and Windows
51+
52+
[v0.2.0]: https://github.com/stephnangue/warden/compare/v0.1.1...v0.2.0
53+
[v0.1.1]: https://github.com/stephnangue/warden/compare/v0.1.0...v0.1.1
54+
[v0.1.0]: https://github.com/stephnangue/warden/releases/tag/v0.1.0

CONTRIBUTING.md

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,16 +51,62 @@ make brd-fast
5151

5252
## Running Tests
5353

54+
### Unit Tests
55+
5456
```bash
5557
# Unit tests with race detection and coverage
5658
make test-unit
57-
58-
# Integration tests
59-
make test-integration
6059
```
6160

6261
Coverage output is generated in `coverage.out`.
6362

63+
### End-to-End Tests
64+
65+
E2E tests run against a live 3-node HA cluster with Vault, Hydra, and PostgreSQL.
66+
67+
| Command | Description |
68+
|---------|-------------|
69+
| `make test-e2e` | Start the cluster, run all e2e tests, tear down |
70+
| `make test-e2e-setup` | Start the e2e cluster only |
71+
| `make test-e2e-teardown` | Stop the e2e cluster |
72+
| `make test-e2e-reset` | Reset and restart the e2e cluster |
73+
74+
To run a specific test suite or single test against an already-running cluster:
75+
76+
```bash
77+
make test-e2e-setup
78+
go test -tags e2e -v ./e2e/forwarding/
79+
go test -tags e2e -run TestSigV4ThroughStandbyForwarding ./e2e/forwarding/ -v
80+
make test-e2e-teardown
81+
```
82+
83+
#### E2E Test Suites
84+
85+
| Package | Focus |
86+
|---------|-------|
87+
| `e2e/cluster` | Split-brain detection |
88+
| `e2e/ha` | Leader election, step-down, failover, node rejoin |
89+
| `e2e/forwarding` | Standby-to-leader request forwarding, SigV4 preservation |
90+
| `e2e/provider` | Vault transparent/non-transparent gateway, JWT validation |
91+
| `e2e/credential` | Credential issuance, caching, TTL expiry, cross-namespace |
92+
| `e2e/rotation` | Credential source rotation, activation delay, failover persistence |
93+
| `e2e/namespace` | Namespace CRUD and isolation |
94+
| `e2e/seal` | Seal/unseal operations |
95+
| `e2e/auth` | Authentication flows |
96+
| `e2e/audit` | Audit logging |
97+
| `e2e/concurrency` | Concurrent request handling |
98+
99+
#### Writing E2E Tests
100+
101+
- Use the `//go:build e2e` build tag
102+
- Import helpers: `h "github.com/stephnangue/warden/e2e/helpers"`
103+
- Use `h.GetLeaderPort(t)` and `h.GetStandbyPort(t)` to discover cluster topology
104+
- Register cleanup via `t.Cleanup` **before** creating resources to avoid orphans on partial failure
105+
- Accept `409` (conflict) in setup to be idempotent across test reruns
106+
- Use `h.GetLeaderPort(t)` at cleanup time (not a captured port) to handle leader changes
107+
108+
See [e2e/README.md](e2e/README.md) for full cluster architecture and configuration details.
109+
64110
## Development Workflow
65111

66112
1. **Start dependencies**:

Makefile

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: help build build-fast brd brd-fast build-no-cache up up-logs down logs logs-tail restart clean clean-all shell test test-unit test-integration query status rebuild rebuild-quick watch cache-info edit-config deps-up deps-down deps-logs reset-warden-db reset-warden-db-force warden-db-logs warden-db-shell
1+
.PHONY: help build build-fast brd brd-fast build-no-cache up up-logs down logs logs-tail restart clean clean-all shell test test-unit test-e2e test-e2e-setup test-e2e-teardown test-e2e-reset query status rebuild rebuild-quick watch cache-info edit-config deps-up deps-down deps-logs reset-warden-db reset-warden-db-force warden-db-logs warden-db-shell
22

33
# Enable BuildKit for faster builds
44
export DOCKER_BUILDKIT=1
@@ -46,10 +46,15 @@ help:
4646
@echo ""
4747
@echo "Testing:"
4848
@echo " make test-unit - Run Go unit tests"
49-
@echo " make test-integration - Run integration tests"
5049
@echo " make test - Test proxy connection"
5150
@echo " make query - Run sample query"
5251
@echo ""
52+
@echo "E2E Testing (3-node HA cluster):"
53+
@echo " make test-e2e-setup - Start the e2e cluster"
54+
@echo " make test-e2e - Run all e2e tests"
55+
@echo " make test-e2e-teardown - Stop the e2e cluster"
56+
@echo " make test-e2e-reset - Reset and restart the e2e cluster"
57+
@echo ""
5358
@echo "Maintenance:"
5459
@echo " make clean - Clean containers and volumes"
5560
@echo " make clean-all - Deep clean (including cache)"
@@ -63,11 +68,31 @@ test-unit:
6368
@go test -v -race -coverprofile=coverage.out ./...
6469
@echo "✓ All tests passed"
6570

66-
# Run integration tests (if you have a separate integration test suite)
67-
test-integration:
68-
@echo "Running integration tests..."
69-
@go test -v -tags=integration ./...
70-
@echo "✓ Integration tests passed"
71+
# Start the e2e 3-node HA cluster
72+
test-e2e-setup:
73+
@echo "Starting e2e cluster..."
74+
@bash e2e/setup.sh
75+
@echo "✓ E2E cluster ready"
76+
77+
# Run all e2e tests (starts cluster, runs tests, tears down)
78+
test-e2e: test-e2e-setup
79+
@echo "Running e2e tests..."
80+
@go test -tags e2e -v ./e2e/... || (bash e2e/teardown.sh && exit 1)
81+
@bash e2e/teardown.sh
82+
@echo "✓ E2E tests passed"
83+
84+
# Stop the e2e cluster
85+
test-e2e-teardown:
86+
@echo "Tearing down e2e cluster..."
87+
@bash e2e/teardown.sh
88+
@echo "✓ E2E cluster stopped"
89+
90+
# Reset and restart the e2e cluster
91+
test-e2e-reset:
92+
@echo "Resetting e2e cluster..."
93+
@bash e2e/reset.sh
94+
@bash e2e/setup.sh
95+
@echo "✓ E2E cluster reset and ready"
7196

7297
# Normal build with cache (runs tests first)
7398
build: test-unit

README.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,7 @@ Explicit mode is **required for AWS** (SigV4 request signing means Warden must h
202202
| Open source | Yes (MPL-2.0) | Gateway only (MIT) | No |
203203
| Credential rotation | Two-stage async (prepare → activate) | N/A — virtual keys only | Automated via SaaS |
204204
| Streaming support | Full HTTP streaming (SSE, chunked) | Yes | N/A |
205+
| High availability | Active/standby with automatic failover | SaaS-managed | SaaS-managed |
205206
| Identity model | Single JWT for all providers | RBAC + SSO (Enterprise) | Per-provider identity mapping |
206207

207208
## Architecture
@@ -214,6 +215,7 @@ Warden is a reverse proxy written in Go. Each provider is registered as a stream
214215
- **IP-bound sessions** — Sessions are tied to the caller's IP. A stolen session token is useless from a different machine.
215216
- **Two-stage credential rotation** — Rotation is split into PREPARE (mint new credentials while old ones remain valid) and ACTIVATE (commit new credentials, destroy old ones). For eventually-consistent providers like AWS and Azure, Warden defers activation by a configurable propagation delay, eliminating the polling loops that cloud SDKs typically require.
216217
- **Seal/unseal model** — Like Vault, Warden protects secrets at rest using envelope encryption. Supports dev mode (in-memory) and production mode with multiple seal types (Shamir, Transit, AWS KMS, GCP KMS, Azure Key Vault, OCI KMS, PKCS11, KMIP) and PostgreSQL storage.
218+
- **Active/standby HA** — Multiple Warden nodes share a storage backend and use lock-based leader election. One node is active; the rest are hot standbys that forward requests and automatically promote on leader failure. Sealed nodes are prevented from acquiring the leadership lock, eliminating cluster stalls.
217219
- **Namespace isolation** — Every credential source, policy, and mount point is scoped to a namespace with hard boundaries. Policies cannot leak across namespaces.
218220

219221
## Getting Started
@@ -281,6 +283,35 @@ Production mode requires a configuration file and external dependencies (Postgre
281283
./warden server --config=./warden.hcl
282284
```
283285

286+
### High Availability
287+
288+
Warden supports active/standby HA. Multiple nodes share the same storage backend and use PostgreSQL advisory locks for leader election. One node becomes the active leader; the rest are hot standbys that automatically promote on leader failure.
289+
290+
**How it works:**
291+
292+
- **Standby forwarding** — Standby nodes forward all write and read requests to the active leader via mTLS reverse proxy. Clients can send requests to any node; the response is the same regardless of which node receives it.
293+
- **Automatic failover** — If the leader fails, a standby acquires the lock and promotes itself. Standby nodes detect the leader change and redirect their forwarding proxy to the new leader.
294+
- **Sealed node protection** — Sealed nodes are prevented from acquiring the leadership lock, ensuring only fully operational nodes can become leader.
295+
296+
**Configuration** — each node needs `api_addr` (its own API address, used by the leader to advertise itself), `cluster_addr` (its mTLS cluster address for inter-node communication), and a shared storage backend with `ha_enabled`:
297+
298+
```hcl
299+
api_addr = "http://10.0.1.1:8400"
300+
cluster_addr = "https://10.0.1.1:8401"
301+
302+
storage "postgres" {
303+
connection_url = "postgres://warden:password@db:5432/warden?sslmode=require"
304+
table = "warden_store"
305+
ha_table = "warden_ha_locks"
306+
ha_enabled = "true"
307+
}
308+
309+
listener "tcp" {
310+
address = "0.0.0.0:8400"
311+
tls_enabled = false
312+
}
313+
```
314+
284315
### Configuration
285316

286317
Warden uses HCL configuration files. See `warden.local.hcl` for a full example covering storage backend, listener, providers, and auth methods.
@@ -299,7 +330,7 @@ Warden uses HCL configuration files. See `warden.local.hcl` for a full example c
299330

300331
**Security** — Rate limiting per identity, mTLS to upstream providers, audit log tamper detection
301332

302-
**Operations**Active/standby HA, Helm chart, Docker Compose quick start, Terraform module
333+
**Operations** — Helm chart, Docker Compose quick start, Terraform module
303334

304335
**Developer experience** — Web UI, Swagger/OpenAPI spec, MCP server for AI agent frameworks, SDKs (Go, Python, TypeScript)
305336

@@ -314,6 +345,7 @@ make deps-up # Start development dependencies
314345
make brd-fast # Build and run (skip tests)
315346
make dev-watch # Hot reload for development
316347
make test-unit # Run unit tests with race detection
348+
make test-e2e # Run e2e tests (3-node HA cluster)
317349
```
318350

319351
## License

auth/method/jwt/backend.go

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ import (
66
"crypto/x509"
77
"fmt"
88
"net/http"
9+
"strings"
910
"sync"
1011
"time"
1112

@@ -259,8 +260,11 @@ func verifyJWKSURLReachable(ctx context.Context, jwksURL string, caPEM string) e
259260
return verifyURLReachable(ctx, jwksURL, caPEM)
260261
}
261262

262-
// verifyOIDCDiscoveryURLReachable checks that the OIDC discovery URL is reachable
263+
// verifyOIDCDiscoveryURLReachable checks that the OIDC discovery endpoint is reachable.
264+
// The oidcDiscoveryURL is the issuer URL (e.g., http://localhost:4444); this function
265+
// appends /.well-known/openid-configuration to match what the cap/jwt library does.
263266
func verifyOIDCDiscoveryURLReachable(ctx context.Context, oidcDiscoveryURL string, caPEM string) error {
264-
return verifyURLReachable(ctx, oidcDiscoveryURL, caPEM)
267+
wellKnown := strings.TrimSuffix(oidcDiscoveryURL, "/") + "/.well-known/openid-configuration"
268+
return verifyURLReachable(ctx, wellKnown, caPEM)
265269
}
266270

0 commit comments

Comments
 (0)