Skip to content

Commit 1dddc0e

Browse files
committed
docs: add feature specification for review issues (001-fix-review-issues)
- spec.md: 6 user stories addressing 13 review findings - plan.md: implementation phases with constitutional alignment - research.md: technical decisions (Raft state, snapshot mechanism, test patterns) - data-model.md: backup format V2 specification - tasks.md: 24 tasks organized by user story and priority - quickstart.md: implementation order and validation - contracts/backup-format-v2.md: V2 backup format contract - checklists/requirements.md: specification quality validation (all items pass)
1 parent b58c9ce commit 1dddc0e

File tree

23 files changed

+2547
-106
lines changed

23 files changed

+2547
-106
lines changed

.github/workflows/test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,4 +37,4 @@ jobs:
3737
mix deps.get
3838
3939
- name: Run tests
40-
run: mix test
40+
run: mix test --cover

.specify/memory/constitution.md

Lines changed: 112 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,14 @@
33
<!--
44
Sync Impact Report
55
==================
6-
Version change: 0.0.0 → 1.0.0 (initial ratification)
7-
Modified principles: N/A (initial)
6+
Version change: 1.0.0 → 1.1.0
7+
Modified principles: None renamed
88
Added sections:
9-
- Core Principles (7 principles)
10-
- Architectural Constraints
11-
- Development Workflow
12-
- Governance
13-
Removed sections: N/A
9+
- Principle VIII: Deterministic State Machine (6 correctness invariants)
10+
Removed sections: None
1411
Templates requiring updates:
15-
- .specify/templates/plan-template.md ✅ (Constitution Check section compatible)
12+
- .specify/templates/plan-template.md ✅ (Constitution Check section is
13+
dynamic — new principle auto-included)
1614
- .specify/templates/spec-template.md ✅ (Requirements section compatible)
1715
- .specify/templates/tasks-template.md ✅ (Phase structure compatible)
1816
Follow-up TODOs: None
@@ -22,70 +20,92 @@ Follow-up TODOs: None
2220

2321
### I. Consistency First
2422

25-
All write operations MUST go through Raft consensus to ensure strong consistency across the cluster. The system is designed as a CP (Consistent + Partition-tolerant) system where:
23+
All write operations MUST go through Raft consensus to ensure strong
24+
consistency across the cluster. The system is designed as a CP
25+
(Consistent + Partition-tolerant) system where:
2626

2727
- Writes require quorum acknowledgment before returning success
28-
- Reads default to leader consistency but support configurable consistency levels (`:eventual`, `:leader`, `:strong`)
29-
- No operation may sacrifice consistency for availability during network partitions
28+
- Reads default to leader consistency but support configurable
29+
consistency levels (`:eventual`, `:leader`, `:strong`)
30+
- No operation may sacrifice consistency for availability during
31+
network partitions
3032

31-
**Rationale**: As a distributed coordination system, incorrect data is worse than unavailable data. Applications relying on Concord for configuration, feature flags, or coordination require absolute certainty about data accuracy.
33+
**Rationale**: As a distributed coordination system, incorrect data is
34+
worse than unavailable data. Applications relying on Concord for
35+
configuration, feature flags, or coordination require absolute
36+
certainty about data accuracy.
3237

3338
### II. Embedded by Design
3439

35-
Concord MUST function as an embedded library that starts with the host application. This means:
40+
Concord MUST function as an embedded library that starts with the host
41+
application. This means:
3642

3743
- No separate infrastructure or external processes required
3844
- Application lifecycle controls Concord lifecycle
39-
- Configuration follows Elixir conventions (config files, environment variables)
45+
- Configuration follows Elixir conventions (config files, environment
46+
variables)
4047
- Zero operational overhead for single-node development
4148

42-
**Rationale**: Lowering the barrier to entry enables adoption. Developers should be able to add distributed coordination to their apps as easily as adding any other dependency.
49+
**Rationale**: Lowering the barrier to entry enables adoption.
50+
Developers should be able to add distributed coordination to their
51+
apps as easily as adding any other dependency.
4352

4453
### III. Performance Without Compromise
4554

46-
The system MUST maintain microsecond-level performance for reads and low-millisecond performance for writes:
55+
The system MUST maintain microsecond-level performance for reads and
56+
low-millisecond performance for writes:
4757

48-
- Read operations: target <10μs for ETS lookups
58+
- Read operations: target <10us for ETS lookups
4959
- Write operations: target <20ms for quorum commits
5060
- Throughput: maintain 600K+ ops/sec under load
5161
- All performance-critical paths MUST avoid blocking operations
5262

53-
**Rationale**: A coordination layer that introduces latency becomes a bottleneck. Performance MUST be a feature, not an afterthought.
63+
**Rationale**: A coordination layer that introduces latency becomes a
64+
bottleneck. Performance MUST be a feature, not an afterthought.
5465

5566
### IV. Observability as Infrastructure
5667

57-
Every operation MUST emit telemetry events. Observability is not optional:
68+
Every operation MUST emit telemetry events. Observability is not
69+
optional:
5870

5971
- All API operations emit `[:concord, :api, :*]` events
6072
- All internal operations emit `[:concord, :operation, :*]` events
6173
- State changes emit `[:concord, :state, :*]` events
6274
- OpenTelemetry tracing MUST be available for distributed debugging
6375
- Prometheus metrics MUST be exportable
6476

65-
**Rationale**: Distributed systems are inherently harder to debug. Without comprehensive observability, production issues become impossible to diagnose.
77+
**Rationale**: Distributed systems are inherently harder to debug.
78+
Without comprehensive observability, production issues become
79+
impossible to diagnose.
6680

6781
### V. Secure Defaults
6882

6983
Security MUST be enabled by default in production environments:
7084

7185
- Authentication required for all operations when `auth_enabled: true`
72-
- Token-based authentication with cryptographically secure token generation
86+
- Token-based authentication with cryptographically secure token
87+
generation
7388
- RBAC (Role-Based Access Control) for fine-grained permissions
7489
- TLS support for transport security
7590
- Audit logging for compliance requirements
7691

77-
**Rationale**: Security vulnerabilities in coordination systems can compromise entire application fleets. Secure-by-default prevents accidental exposure.
92+
**Rationale**: Security vulnerabilities in coordination systems can
93+
compromise entire application fleets. Secure-by-default prevents
94+
accidental exposure.
7895

7996
### VI. Test-Driven Quality
8097

8198
All features MUST have corresponding tests before merge:
8299

83100
- Unit tests for isolated component behavior
84-
- E2E tests for distributed scenarios (leader election, network partitions, node failures)
101+
- E2E tests for distributed scenarios (leader election, network
102+
partitions, node failures)
85103
- Tests run with `async: false` to avoid Ra cluster conflicts
86104
- State machine changes require cluster restart verification
87105

88-
**Rationale**: Distributed systems have subtle failure modes. Comprehensive testing is the only way to maintain confidence in correctness.
106+
**Rationale**: Distributed systems have subtle failure modes.
107+
Comprehensive testing is the only way to maintain confidence in
108+
correctness.
89109

90110
### VII. API Stability
91111

@@ -94,9 +114,53 @@ Public API changes MUST follow semantic versioning:
94114
- MAJOR: Breaking changes to `Concord.*` public functions
95115
- MINOR: New features, new optional parameters
96116
- PATCH: Bug fixes, performance improvements
97-
- State machine version changes MUST be backward compatible or include migration paths
98-
99-
**Rationale**: Applications depend on Concord for critical coordination. Breaking changes without warning erode trust.
117+
- State machine version changes MUST be backward compatible or include
118+
migration paths
119+
120+
**Rationale**: Applications depend on Concord for critical
121+
coordination. Breaking changes without warning erode trust.
122+
123+
### VIII. Deterministic State Machine
124+
125+
The Ra state machine (`Concord.StateMachine`) MUST remain
126+
deterministic and serialization-safe at all times. These six invariants
127+
are non-negotiable:
128+
129+
1. **Deterministic replay**: `apply/3` MUST be a pure function of
130+
`(meta, command, state)`. Time MUST come from `meta.system_time`
131+
(leader-assigned milliseconds), NEVER from `System.system_time` or
132+
any other wall-clock source. Use `meta_time(meta)` to convert to
133+
seconds.
134+
135+
2. **No anonymous functions in Raft state or log**: Index extractors,
136+
conditions, and any data entering the Raft log MUST use declarative
137+
specs (tuples like `{:map_get, :email}`, `{:nested, [:a, :b]}`,
138+
`{:identity}`, `{:element, n}`). Closures cause `:badfun` on
139+
deserialization across code versions or nodes.
140+
141+
3. **All mutations through Raft**: Auth tokens, RBAC roles/grants/ACLs,
142+
tenant definitions, backup restores, and any other state change MUST
143+
route through `:ra.process_command`. Direct ETS writes are ONLY
144+
acceptable as fallback when the cluster is not yet ready (`:noproc`).
145+
146+
4. **ETS tables are materialized views**: ETS is rebuilt from
147+
authoritative Raft state on `snapshot_installed/4`. ETS MUST NEVER
148+
be treated as the source of truth.
149+
150+
5. **Snapshots via `release_cursor` effect**: Ra does NOT have a
151+
`snapshot/1` callback. Snapshots MUST be emitted every 1000 commands
152+
as `{:release_cursor, index, state}` effects. The state MUST include
153+
ETS data captured by `build_release_cursor_state/1`.
154+
155+
6. **Pre-consensus evaluation**: `put_if`/`delete_if` MUST evaluate
156+
condition functions at the API layer, then convert to CAS commands
157+
with `expected: current_value` before entering the Raft log. This
158+
keeps the log deterministic.
159+
160+
**Rationale**: Violating any of these invariants causes state
161+
divergence between Raft replicas, data corruption on log replay, or
162+
deserialization failures across nodes. These are the most critical
163+
rules in the entire codebase.
100164

101165
## Architectural Constraints
102166

@@ -118,10 +182,16 @@ Public API changes MUST follow semantic versioning:
118182

119183
### Data Flow Invariants
120184

121-
1. All writes flow through: Client API → Auth → Validation → `:ra.process_command` → State Machine → ETS
122-
2. Reads may bypass Raft log via `:ra.consistent_query` or `:ra.local_query`
123-
3. Server ID format MUST be `{:concord_cluster, node()}` (not module-based)
124-
4. Query functions return `{:ok, result}`; Ra wraps as `{:ok, {:ok, result}, leader_info}`
185+
1. All writes flow through: Client API -> Auth -> Validation ->
186+
`:ra.process_command` -> State Machine -> ETS
187+
2. Reads may bypass Raft log via `:ra.consistent_query` or
188+
`:ra.local_query`
189+
3. Server ID format MUST be `{:concord_cluster, node()}`
190+
(not module-based)
191+
4. Query functions return `{:ok, result}`; Ra wraps as
192+
`{:ok, {:ok, result}, leader_info}`
193+
5. State machine correctness invariants per Principle VIII MUST be
194+
enforced at every layer
125195

126196
## Development Workflow
127197

@@ -141,25 +211,31 @@ Public API changes MUST follow semantic versioning:
141211

142212
### Commit Standards
143213

144-
- Semantic commit messages: `feat:`, `fix:`, `docs:`, `chore:`, `test:`, `refactor:`
214+
- Semantic commit messages: `feat:`, `fix:`, `docs:`, `chore:`,
215+
`test:`, `refactor:`
145216
- No auto-generated footers (Claude Code, Co-Authored-By)
146217
- Each commit should be atomic and independently reversible
147218

148219
## Governance
149220

150-
This constitution supersedes all other development practices for the Concord project.
221+
This constitution supersedes all other development practices for the
222+
Concord project.
151223

152224
### Amendment Process
153225

154-
1. Propose changes via pull request to `.specify/memory/constitution.md`
226+
1. Propose changes via pull request to
227+
`.specify/memory/constitution.md`
155228
2. Document rationale for each change
156229
3. Update dependent templates if principles change
157230
4. Increment version according to semantic rules
158231

159232
### Compliance
160233

161234
- All PRs MUST verify adherence to Core Principles
162-
- Complexity additions MUST be justified against Principle VII (simplicity via API stability)
235+
- Complexity additions MUST be justified against Principle VII
236+
(simplicity via API stability)
237+
- State machine changes MUST be reviewed against Principle VIII
238+
invariants
163239
- Constitution violations require explicit exception documentation
164240

165241
### Version Policy
@@ -168,4 +244,4 @@ This constitution supersedes all other development practices for the Concord pro
168244
- MINOR: New principle or section added
169245
- PATCH: Clarifications, wording improvements
170246

171-
**Version**: 1.0.0 | **Ratified**: 2025-12-03 | **Last Amended**: 2025-12-03
247+
**Version**: 1.1.0 | **Ratified**: 2025-12-03 | **Last Amended**: 2026-03-03

CLAUDE.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,3 +121,10 @@ See `docs/` for architectural documents:
121121
- `docs/DESIGN.md` — Original design blueprint
122122
- `docs/API_DESIGN.md` — HTTP API design
123123
- `docs/API_USAGE_EXAMPLES.md` — HTTP API usage examples
124+
125+
## Active Technologies
126+
- Elixir 1.18 / OTP 28 + Ra 2.17.1 (Raft), libcluster 3.5.0, Bandit 1.8.0, Plug 1.18.1 (001-fix-review-issues)
127+
- ETS (in-memory) with Ra snapshots for persistence (001-fix-review-issues)
128+
129+
## Recent Changes
130+
- 001-fix-review-issues: Added Elixir 1.18 / OTP 28 + Ra 2.17.1 (Raft), libcluster 3.5.0, Bandit 1.8.0, Plug 1.18.1

README.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
## Key Features
1313

1414
- **Strong Consistency** — Raft consensus ensures all nodes agree on data
15-
- **600K-870K ops/sec**1-7us latency for typical operations
15+
- **High Performance**ETS-backed reads with microsecond-level latency
1616
- **Embedded Design** — Starts with your app, no external infrastructure
1717
- **HTTP REST API** — Complete API with OpenAPI/Swagger documentation
1818
- **Configurable Consistency** — Choose eventual, leader, or strong per operation
@@ -81,13 +81,7 @@ iex --name n3@127.0.0.1 --cookie concord -S mix # Terminal 3
8181

8282
## Performance
8383

84-
| Operation | Throughput | Latency |
85-
|-----------|-----------|---------|
86-
| Small Values (100B) | 621K-870K ops/sec | 1-2us |
87-
| Medium Values (1KB) | 134K-151K ops/sec | 6-7us |
88-
| TTL Operations | 943K-25M ops/sec | 0.04-1us |
89-
| HTTP Health Check | 5K req/sec | 197us |
90-
| Memory Overhead | ~10 bytes per item ||
84+
Performance varies significantly depending on hardware, cluster size, network topology, and consistency level. ETS-backed reads are inherently fast, but actual throughput and latency depend on your deployment. Run `mix run benchmarks/run_benchmarks.exs` on your own hardware to get representative numbers.
9185

9286
## When to Use Concord
9387

@@ -102,6 +96,14 @@ iex --name n3@127.0.0.1 --cookie concord -S mix # Terminal 3
10296
| Primary Database | Avoid (use PostgreSQL) |
10397
| Large Blob Storage | Avoid (use S3) |
10498

99+
## Known Limitations
100+
101+
- **Bootstrap ETS Fallback**: Auth, RBAC, and tenant data written via ETS fallback during the bootstrap window (before a Raft cluster forms) is not replicated. Once the cluster establishes quorum, subsequent writes go through Raft consensus normally.
102+
103+
- **Node-Local Rate Limiting**: Multi-tenancy rate limiting is tracked per-node. A tenant can exceed its configured quota by up to N× across N nodes in the cluster.
104+
105+
- **Query TTL Clock Sensitivity**: TTL expiration checks in queries use wall-clock time (`System.system_time`) which may differ from leader-assigned time (`meta.system_time`) during clock drift between nodes.
106+
105107
## Comparison
106108

107109
| Feature | Concord | etcd | Consul | ZooKeeper |

0 commit comments

Comments
 (0)