Skip to content

Commit a6af263

Browse files
committed
explore the design space with claude and have it record what we found
1 parent ecebfdf commit a6af263

File tree

2 files changed

+94
-1
lines changed

2 files changed

+94
-1
lines changed

.beads/issues.jsonl

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,17 @@
2929
{"id":"epithet-37","title":"Fill in the rest of the %C connection fields in broker.go","description":"In pkg/broker/broker.go:30, the ConnectionInfo struct needs the remaining fields from the %C hash (local hostname, remote hostname, port, username, ProxyJump)","status":"closed","priority":2,"issue_type":"task","created_at":"2025-10-22T16:09:07.715489-07:00","updated_at":"2025-10-22T16:09:07.715489-07:00","closed_at":"2025-10-22T16:18:48.190311982Z"}
3030
{"id":"epithet-38","title":"Wire up epithet match arguments to broker MatchRequest RPC","description":"Add CLI arguments to MatchCLI (--host, --port, --user, --hash) and implement RPC call to broker passing all MatchRequest fields including derived LocalHost/LocalUser","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-22T16:09:07.715489-07:00","updated_at":"2025-10-22T16:09:07.715489-07:00","closed_at":"2025-10-22T16:22:47.109966271Z","dependencies":[{"issue_id":"epithet-38","depends_on_id":"epithet-37","type":"discovered-from","created_at":"2025-10-22T16:09:07.724121-07:00","created_by":"import"}]}
3131
{"id":"epithet-39","title":"Implement initial authentication in broker.Match when no token exists","description":"","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-25T12:00:33.253643-07:00","updated_at":"2025-10-25T12:05:49.332212-07:00","closed_at":"2025-10-25T12:05:49.332212-07:00","dependencies":[{"issue_id":"epithet-39","depends_on_id":"epithet-1","type":"blocks","created_at":"2025-10-25T12:00:33.256094-07:00","created_by":"daemon"}]}
32-
{"id":"epithet-40","title":"Handle re-authentication when CA returns 403/Forbidden","description":"","status":"open","priority":1,"issue_type":"task","created_at":"2025-10-25T12:00:36.648331-07:00","updated_at":"2025-10-25T12:00:36.648331-07:00","dependencies":[{"issue_id":"epithet-40","depends_on_id":"epithet-1","type":"blocks","created_at":"2025-10-25T12:00:36.650962-07:00","created_by":"daemon"}]}
32+
{"id":"epithet-40","title":"Handle re-authentication when CA returns 403/Forbidden","description":"","design":"Use standard HTTP semantics for auth vs authz failures:\n\n401 from CA/policy server:\n- Means: Token is invalid, expired, or missing\n- Broker action: Clear auth.token, call auth.Run() to re-authenticate, retry cert request with new token\n- User experience: Brief pause while re-authenticating, then connection proceeds (or fails for other reason)\n\n403 from CA/policy server:\n- Means: Token is valid (user authenticated), but not authorized for this access\n- May be temporary (approval workflow pending) or permanent (user lacks permission)\n- Broker action: Keep token (still valid!), return policy server error message to user, do NOT auto-retry\n- User experience: See error message explaining why (e.g., 'Approval required from ops-team'), can manually retry after approval granted\n\nThis distinction is important because:\n- 403 might resolve later (approval granted) with same token - don't force re-auth\n- 401 won't resolve without new token - must re-auth\n- Follows standard HTTP semantics (401=authentication, 403=authorization)","status":"open","priority":1,"issue_type":"task","created_at":"2025-10-25T12:00:36.648331-07:00","updated_at":"2025-10-25T18:53:44.682229-07:00","dependencies":[{"issue_id":"epithet-40","depends_on_id":"epithet-1","type":"blocks","created_at":"2025-10-25T12:00:36.650962-07:00","created_by":"daemon"}]}
3333
{"id":"epithet-41","title":"Pass connection details to auth command for mustache template rendering","description":"Currently auth.Run() is called with nil. We should pass MatchRequest fields (host, user, port, etc.) so auth commands can use mustache templates like {{host}} or {{user}} in their command line configuration.","status":"open","priority":3,"issue_type":"task","created_at":"2025-10-25T12:11:30.932645-07:00","updated_at":"2025-10-25T12:11:30.932645-07:00","dependencies":[{"issue_id":"epithet-41","depends_on_id":"epithet-1","type":"blocks","created_at":"2025-10-25T12:11:30.934792-07:00","created_by":"daemon"}]}
3434
{"id":"epithet-42","title":"Update caserver for v2: accept match data and return policy with certificate","description":"Currently CreateCertRequest only has token+publicKey. Need to add match data (RemoteHost, RemoteUser, Port, etc.) and pass to policy server. CreateCertResponse should return both certificate and policy pattern so broker can store it.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-25T12:16:55.355488-07:00","updated_at":"2025-10-25T12:18:32.616775-07:00","closed_at":"2025-10-25T12:18:32.616775-07:00"}
3535
{"id":"epithet-43","title":"Add comprehensive concurrency documentation and fix race conditions","description":"Add locking invariants documentation to Broker, Agent, Auth, and CertificateStore. Remove unused Agent.lock field. Fix race condition in sshd test helper by wrapping bytes.Buffer with thread-safe safeBuffer. All code now passes race detector tests.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-10-25T15:43:03.632286-07:00","updated_at":"2025-10-25T15:43:08.355558-07:00","closed_at":"2025-10-25T15:43:08.355558-07:00"}
36+
{"id":"epithet-44","title":"Expand policy matching beyond hostname to support per-user certificates","description":"Current limitation: Policy only matches on hostPattern, but different remote users connecting to the same host may need different certificates with different principals (e.g., deploy@server vs root@server).\n\nProblem: If user SSHs to deploy@server.example.com (gets cert with principal 'deploy'), then later SSHs to root@server.example.com, the cert store lookup finds the existing cert (matches *.example.com) but it only has 'deploy' principal, not 'root', causing SSH to fail.\n\nSolution: Expand Policy struct to match on additional connection fields beyond just hostname:\n- remoteUser (e.g., 'deploy', 'root', '*')\n- Potentially localUser, port, etc.\n\nDesign questions to resolve:\n1. Should matching be AND logic (all fields must match) or pattern-based with wildcards per field?\n2. Should lookup prefer exact matches first, then fall back to broader patterns? Or most-specific-match-wins?\n3. How do we handle wildcard patterns in multiple dimensions?\n\nThis affects:\n- pkg/policy/policy.go - Policy struct and Matches() method\n- pkg/broker/certs.go - CertificateStore.Lookup() logic\n- Policy server API - may need to return more granular policies","status":"open","priority":2,"issue_type":"task","created_at":"2025-10-25T18:39:38.843754-07:00","updated_at":"2025-10-25T18:39:38.843754-07:00"}
37+
{"id":"epithet-45","title":"Make broker socket path configurable to support multiple concurrent brokers","description":"Currently the broker socket path defaults to ~/.epithet/broker.sock, which prevents running multiple brokers concurrently. Need to make this configurable.\n\nUse case: User may want separate brokers for work vs personal, different CA servers, different match patterns, etc.\n\nExample setup:\n- Work broker: --broker-sock ~/.epithet/work-broker.sock --match *.work.example.com\n- Personal broker: --broker-sock ~/.epithet/personal-broker.sock --match *.personal.example.com\n\nChanges needed:\n1. epithet agent command: Already has default, keep it configurable via flag/config\n2. epithet match command: Add --broker flag to specify which broker to connect to (currently hardcoded in match.go)\n3. Update SSH config examples to show broker selection\n\nNote: Agent socket directory is already configurable via --agent-sock-dir flag ✓\n\nFiles to update:\n- cmd/epithet/match.go - Add --broker flag, use instead of hardcoded path\n- cmd/epithet/agent.go - Already has --broker-sock flag ✓\n- SSH config examples in docs\n","status":"open","priority":2,"issue_type":"task","created_at":"2025-10-25T18:57:24.191353-07:00","updated_at":"2025-10-25T18:57:24.191353-07:00"}
38+
{"id":"epithet-46","title":"Explore SSH Match exec behavior: can we fail just the Match or does it abort the entire SSH connection?","description":"","status":"open","priority":2,"issue_type":"task","created_at":"2025-10-25T19:44:34.301361-07:00","updated_at":"2025-10-25T19:44:34.301361-07:00"}
39+
{"id":"epithet-47","title":"Implement 401 retry logic in broker: clear token, re-auth, retry cert request (limit retries to prevent infinite loops)","description":"When CA returns 401 Unauthorized: 1) Clear the current token, 2) Invoke auth plugin (which may use refresh token from state or do full re-auth), 3) Retry cert request with new token. Limit retries to prevent infinite loops with buggy auth plugins (suggest max 2-3 attempts). Use immediate retries (no backoff delay) - if there's a persistent issue, user will see the error and can retry the SSH connection. If retries exhausted, fail the Match and log error to stderr per epithet-48.","status":"open","priority":1,"issue_type":"task","created_at":"2025-10-25T19:46:17.050687-07:00","updated_at":"2025-10-25T20:00:20.730178-07:00"}
40+
{"id":"epithet-48","title":"When epithet cannot obtain certificate, fail the Match (return non-zero) and log clear error to stderr","description":"When epithet cannot obtain a certificate (auth failures, CA errors, etc), the Match exec should: 1) Log clear, user-friendly error message to stderr explaining what went wrong (verbosity matching configured log level - helpful by default, more detail with -v flags), 2) Exit with non-zero status to fail the Match, 3) Allow SSH to fall through to subsequent Match blocks or default config. This enables breakglass/fallback scenarios where users have epithet Match blocks first, followed by special-case configs (e.g., breakglass@host with specific IdentityFile). Trade-off: May leak connection attempts to fallback systems, but this is acceptable to enable legitimate escape hatches.","status":"open","priority":1,"issue_type":"task","created_at":"2025-10-25T19:51:28.702502-07:00","updated_at":"2025-10-25T20:01:48.450958-07:00"}
41+
{"id":"epithet-49","title":"Handle agent creation failures: keep cert in store, fail Match with clear error about local system issue","description":"When certificate is valid but agent creation fails (socket directory permissions, disk space, etc): 1) Keep certificate in cert store (it's valid and may work on retry or for other connections), 2) Fail the Match with clear error explaining the agent creation problem (not a cert/auth issue), 3) User can fix local issue and retry. Agent creation failures are typically local system problems, not certificate/policy problems.","status":"open","priority":2,"issue_type":"task","created_at":"2025-10-25T20:03:10.81255-07:00","updated_at":"2025-10-25T20:03:10.81255-07:00"}
42+
{"id":"epithet-50","title":"Parse SSH certificate ValidBefore field to get actual expiry time instead of hardcoding 5 minutes","description":"Currently broker.Match() and ensureAgent() hardcode 5-minute expiry with TODO comments. Need to parse the SSH certificate to extract ValidBefore timestamp and use that for expiration tracking in agentEntry and PolicyCert. The golang.org/x/crypto/ssh library provides this functionality. Certificate is the source of truth for expiry time.","status":"open","priority":0,"issue_type":"task","created_at":"2025-10-25T20:05:43.536749-07:00","updated_at":"2025-10-25T20:06:06.813792-07:00"}
3643
{"id":"epithet-7","title":"Implement auth command invocation (stdin/stdout protocol)","description":"","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-22T16:09:07.715489-07:00","updated_at":"2025-10-22T16:09:07.715489-07:00","closed_at":"2025-10-22T22:18:39.64510665Z","dependencies":[{"issue_id":"epithet-7","depends_on_id":"epithet-1","type":"parent-child","created_at":"2025-10-22T16:09:07.72441-07:00","created_by":"import"}]}
3744
{"id":"epithet-8","title":"Implement auth state storage (map of user identity → state blob)","description":"","design":"Auth type already implements state cycling correctly in auth.go:127-171. Task is to change broker from single 'auth *Auth' to 'auths map[string]*Auth' where key is user identity (probably LocalUser from MatchRequest). Need method like GetOrCreateAuth(userID) that returns *Auth for that user.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-22T16:09:07.715489-07:00","updated_at":"2025-10-22T16:09:07.715489-07:00","closed_at":"2025-10-22T22:16:49.792978424Z","dependencies":[{"issue_id":"epithet-8","depends_on_id":"epithet-1","type":"parent-child","created_at":"2025-10-22T16:09:07.724683-07:00","created_by":"import"}]}
3845
{"id":"epithet-9","title":"Implement certificate storage (map of connection hash → certificate + expiry)","description":"","status":"closed","priority":1,"issue_type":"task","created_at":"2025-10-22T16:09:07.715489-07:00","updated_at":"2025-10-25T11:32:56.612656-07:00","closed_at":"2025-10-25T11:32:56.612656-07:00","dependencies":[{"issue_id":"epithet-9","depends_on_id":"epithet-1","type":"parent-child","created_at":"2025-10-22T16:09:07.724945-07:00","created_by":"import"}]}

CLAUDE.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,92 @@ The broker communicates with auth plugins using **keyed netstrings** (Type-Lengt
191191

192192
The CA uses cryptographic signing (via Rekor/Sigstore SSH signing) to authenticate requests to the policy server.
193193

194+
### Error Handling and Match Behavior
195+
196+
**IMPORTANT**: These design decisions affect how epithet interacts with SSH's Match exec behavior.
197+
198+
#### SSH Config Precedence
199+
- SSH uses **first match wins** for configuration parameters
200+
- More specific Match blocks should appear before general ones
201+
- When a Match exec returns non-zero, that Match block doesn't apply and SSH continues to the next Match or default config
202+
203+
#### Match Failure Strategy
204+
When epithet cannot obtain a certificate (auth failures, CA errors, agent creation failures):
205+
1. **Log clear error to stderr** - User-friendly message explaining what went wrong (verbosity matching configured log level)
206+
2. **Exit with non-zero status** - Fail the Match so SSH falls through to next config
207+
3. **Allow SSH fallback** - Enables breakglass/escape hatch scenarios
208+
209+
**Rationale:**
210+
- Enables breakglass accounts: users can have epithet Match blocks first, then special-case configs (e.g., `Match host *.example.com user breakglass` with specific IdentityFile)
211+
- If epithet fails the Match, SSH can try other auth methods (default keys, other agents)
212+
- Trade-off: May leak connection attempts to fallback systems, but this is acceptable to enable legitimate escape hatches
213+
- Users who need strict security can configure SSH with no fallbacks after epithet Match blocks
214+
215+
**Recommended SSH Config Structure:**
216+
```ssh_config
217+
# Epithet handling - first so it gets priority
218+
Match exec "epithet match --host %h --port %p --user %r --hash %C"
219+
IdentityAgent ~/.epithet/sockets/%C
220+
221+
# Breakglass/special cases - after epithet
222+
Match host *.example.com user breakglass
223+
IdentityFile ~/.ssh/breakglass_cert
224+
225+
# Default config last
226+
```
227+
228+
#### CA Error Handling
229+
230+
**HTTP 401 Unauthorized** - Token is invalid or expired:
231+
1. Clear the current token
232+
2. Invoke auth plugin (may use refresh token from state or do full re-auth)
233+
3. Retry cert request with new token
234+
4. Limit retries (2-3 attempts) to prevent infinite loops with buggy auth plugins
235+
5. Use immediate retries (no backoff) - if persistent issue, user will retry SSH connection
236+
6. If retries exhausted, fail the Match with clear error
237+
238+
**HTTP 403 Forbidden** - Authentication succeeded but policy denied the request:
239+
1. Keep the token (it's valid, just not authorized for this connection)
240+
2. Fail the Match with clear error explaining policy denial
241+
3. Do not retry (policy decision is intentional)
242+
243+
**HTTP 5xx Server Error** - Transient CA/policy server issue:
244+
1. Keep the token
245+
2. Fail the Match with clear error
246+
3. User can retry SSH connection
247+
248+
**HTTP 4xx Client Error** (other than 401/403):
249+
1. Keep the token
250+
2. Fail the Match with clear error
251+
3. Do not retry (likely a permanent client-side issue)
252+
253+
#### Auth Plugin Error Handling
254+
255+
**Exit 0 with error field** - User-facing auth failure (cancelled flow, MFA failed, invalid credentials):
256+
1. Keep the existing state (don't clear it)
257+
2. Fail the Match with the error message from auth plugin
258+
3. User can retry SSH connection when ready
259+
260+
**Non-zero exit** - Unexpected error (network issue, plugin crash, etc):
261+
1. Keep the existing state
262+
2. Retry up to limit (same as CA 401 retry limit)
263+
3. If retries exhausted, fail the Match with error
264+
4. Use immediate retries (no backoff)
265+
266+
#### Certificate and Agent Management
267+
268+
**Certificate Storage:**
269+
- Always store certificates obtained from CA, even if agent creation later fails
270+
- Certificates are bound to policies (hostPattern), not individual agents
271+
- Multiple agents (different connection hashes) may reuse the same certificate if policy matches
272+
- Keep certificates in store even on agent creation failures (cert is still valid)
273+
274+
**Agent Creation Failures:**
275+
- Typically local system issues (permissions, disk space, socket directory problems)
276+
- Keep certificate in store (it's valid, may work on retry)
277+
- Fail the Match with clear error explaining the local issue (not a cert/auth problem)
278+
- User can fix local issue and retry
279+
194280
## Development Commands
195281

196282
### Building

0 commit comments

Comments
 (0)