Skip to content

Latest commit

 

History

History
293 lines (244 loc) · 13.5 KB

File metadata and controls

293 lines (244 loc) · 13.5 KB

Settla — Project TODO

Rule: Never move to the next stage until every success criterion in the current stage passes.


Phase 1: Project Foundation ✅

Goal: Repo structure, tooling, containerized infrastructure, multi-tenant schema.

  • 1.1 — Monorepo Setup & Tooling

    • go build ./... compiles
    • cd api/gateway && pnpm install && pnpm build succeeds
    • Structure matches layout
    • README mentions B2B, 50M/day scale, and high-throughput patterns
    • Every Go package has doc.go
    • Both Go binaries start and shut down gracefully
    • Gateway responds to GET /health
  • 1.2 — Docker & Infrastructure

    • make docker-up — all services healthy including TigerBeetle and PgBouncer
    • PgBouncer accepts connections on 6433, 6434, 6435
    • TigerBeetle responds on port 3001
    • NATS JetStream enabled
    • Application env vars point to PgBouncer, not raw Postgres
    • make docker-reset gives clean slate
  • 1.3 — Database Migrations & SQLC Setup

    • All migrations apply cleanly
    • 6 monthly partitions created per partitioned table
    • Tenant isolation: UNIQUE constraints are per-tenant
    • Capacity comments in migration files
    • SQLC generates and compiles
    • Seed creates both demo tenants

Phase 2: Domain Core (Go) ✅

Goal: Domain types, dual-backend ledger, settlement engine, in-memory treasury, smart router.

  • 2.1 — Domain Types & Interfaces

    • All tests pass (56/56), zero infrastructure imports
    • Ledger interface documented for dual-backend
    • TreasuryManager uses Reserve/Release (not Lock/Unlock)
    • All types include TenantID
    • Coverage 81.7% (>80%)
  • 2.2 — Settla Ledger (Dual Backend)

    • go test ./ledger/... -v -race — all 22 tests pass
    • PostEntries writes to TigerBeetle, not Postgres
    • GetBalance reads from TigerBeetle (authoritative)
    • GetEntries reads from Postgres (query layer)
    • Sync consumer populates Postgres from TigerBeetle
    • Idempotency works end-to-end
    • Write batching reduces round-trips (batch test confirms fewer TB calls)
    • System degrades gracefully if Postgres read-side is down
  • 2.3 — Settla Core (Settlement Engine)

    • All tests pass with -race
    • Tenant validation enforced
    • Uses Reserve/Release (not Lock)
    • Ledger entries use tenant account codes
    • go list confirms no imports of concrete modules
  • 2.4 — Settla Treasury (In-Memory Reservation)

    • go test ./treasury/... -v -race — all pass
    • Reserve takes <1μs (benchmark)
    • 10,000 concurrent reserves: no over-reservation
    • Complete tenant isolation
    • Background flush writes to DB
    • Crash recovery works (restart from DB state)
  • 2.5 — Settla Rail & Mock Providers

    • Routes sorted by score, insufficient liquidity filtered
    • Different tenants get different fees
    • Mock providers support GBP↔NGN corridor via USDT
    • All tests pass with -race

Phase 2.5: Settla Provider — Real Testnet Blockchain Integration (In Progress)

Goal: Real blockchain transactions on testnets. Fiat simulated, crypto real.

  • 2.5.1 — Wallet Management (WP-1 through WP-3)

    • BIP-44 HD derivation: Tron, Solana, Ethereum, Base
    • AES-256-GCM key encryption at rest
    • System wallets (system/{chain}/hot) and tenant wallets (tenant/{slug}/{chain})
    • Faucet integration: Tron Nile (automated), Solana Devnet (automated), Sepolia/Base (manual)
    • Private keys never appear in logs or error messages
  • 2.5.2 — Blockchain Clients (WP-4 through WP-7)

    • Tron Nile client: TRX + TRC20 balance, send, get tx, subscribe
    • Ethereum Sepolia + Base Sepolia client: ETH/ERC20, gas estimation, nonce management
    • Solana Devnet client: SOL + SPL token transfers, ATA creation
    • Blockchain registry: GetClient(chain), RPC failover, circuit breaker
    • Explorer URL generation for all four testnets
  • 2.5.3 — Settla Provider: FX Oracle & Fiat Simulator (WP-8, WP-9)

    • FX oracle: NGN/GBP/EUR/GHS/USD with ±0.15% jitter, cross rates, thread-safe
    • Fiat simulator: collection (PENDING → PROCESSING → COLLECTED) + payout (PAYOUT_INITIATED → COMPLETED)
    • Per-currency delays: NGN 3–5s, GBP 5–10s, USD 10–30s, EUR/GHS 5–10s
    • Configurable failure rate (default 2%)
  • 2.5.4 — On-Ramp Provider (WP-10)

    • ID() → "settla-onramp", fiat → stablecoin pairs (GBP/NGN/USD/EUR/GHS → USDT/USDC)
    • 30bps spread + minimum fee applied to quotes
    • Async flow: fiat collection → real blockchain send → GetStatus polling
    • Explorer URL in all transaction metadata
    • USDT defaults to Tron, USDC defaults to Ethereum/Base
  • 2.5.5 — Off-Ramp Provider (WP-11)

    • ID() → "settla-offramp", stablecoin → fiat pairs (USDT/USDC → GBP/NGN/USD/EUR/GHS)
    • 30bps spread applied (rate < 1 for provider profit on stablecoin→fiat)
    • Async flow: crypto receipt verification → fiat payout simulation
    • Falls back to simulated receipt when RPC unavailable (graceful degradation)
    • Deposit address (system hot wallet) returned on Execute
    • Explorer URL in all transaction metadata
  • 2.5.6 — Provider Registry & Router Integration (WP-12, WP-13)

    • SETTLA_PROVIDER_MODE env var: mock | testnet | live
    • Registry wires Settla on/off-ramp based on mode
    • Transfer API response includes blockchain_transactions with explorer URLs
    • Router includes explorer URLs in RouteInfo
  • 2.5.7 — Testnet Setup & Makefile (WP-14)

    • make testnet-setup initialises and funds wallets
    • make provider-mode-mock / make provider-mode-testnet
    • .env.example updated with all testnet variables
    • docker-compose.yml updated with testnet env vars

Phase 3: Event-Driven Infrastructure ✅

Goal: Partitioned NATS for parallel processing, Redis with local cache.

  • 3.1 — Settla Node (Partitioned NATS Workers)

    • Events partitioned by tenant hash
    • Same tenant's events always route to same partition
    • Different tenants processed in parallel
    • Full saga works through partitioned routing
    • Dev mode: single instance handles all partitions
  • 3.2 — Redis & Local Cache

    • Local cache auth lookup <1μs (benchmark) — 107ns measured
    • Two-level cache: local → Redis → DB
    • Rate limits approximate but correct over 5-second windows
    • Tenant isolation on all cache operations

Phase 4: Settla API (TypeScript) ✅

Goal: Fastify gateway with local tenant cache, gRPC connection pool, per-tenant webhooks.

  • 4.1 — Protocol Buffers & gRPC

    • make proto generates Go + TypeScript
    • gRPC server starts with high-throughput config
    • All tenant-scoped RPCs include tenant_id
  • 4.2 — Settla API Gateway (Fastify)

    • gRPC connection pool working (not per-request)
    • Auth resolves from local cache in <1ms on cache hit
    • Tenant isolation verified
    • Response serialization uses schema (not JSON.stringify)
    • OpenAPI spec valid
  • 4.3 — Webhook Dispatcher

    • Correct tenant's URL and HMAC secret
    • Retry and dead letter work
    • Worker pool handles concurrent delivery

Phase 5: Dashboard & Observability

Goal: Ops console with capacity monitoring, per-tenant metrics.

  • 5.1 — Settla Dashboard

    • Capacity page shows live throughput metrics
    • TigerBeetle write rate visible
    • Treasury flush lag visible
    • NATS partition queue depths visible
    • Per-tenant volume vs limit
  • 5.2 — Observability

    • Structured logging: slog (Go) with JSON/text handler, pino (TS) — service, version, tenant_id on every log
    • Prometheus metrics: Go (settla-server :8080/metrics, settla-node :9091/metrics), TS (gateway :3000/metrics, webhook :3001/metrics)
    • TigerBeetle write metrics (settla_ledger_tb_writes_total, _write_latency, _batch_size)
    • Treasury reservation latency metric (settla_treasury_reserve_latency_seconds, sub-microsecond buckets)
    • Treasury flush metrics (settla_treasury_flush_lag_seconds, _flush_duration)
    • Treasury balance/locked gauges per tenant/currency/location
    • PG sync lag metric (settla_ledger_pg_sync_lag_seconds)
    • NATS partition metrics (settla_nats_messages_total, _partition_queue_depth)
    • Transfer metrics (settla_transfers_total, _transfer_duration_seconds) with tenant/status/corridor labels
    • Provider metrics (settla_provider_requests_total, _latency_seconds)
    • gRPC interceptor metrics (settla_grpc_requests_total, _request_latency_seconds)
    • Gateway HTTP metrics (settla_gateway_requests_total, _request_duration_seconds, auth cache hits/misses)
    • Webhook delivery metrics (settla_webhook_deliveries_total, _delivery_duration_seconds)
    • Docker: Prometheus (prom/prometheus:v2.51.0, :9092) + Grafana (grafana:10.4.1, :3002)
    • 5 provisioned Grafana dashboards: Overview, Capacity Planning, Treasury Health, API Performance, Tenant Health
    • No PII in logs, metrics use judiciously low-cardinality labels

Phase 6: Integration & Demo

Goal: Wire everything, E2E tests, demo, capacity documentation.

  • 6.1 — End-to-End Integration

    • Both corridors work end-to-end (GBP→NGN, NGN→GBP)
    • TigerBeetle receives ledger writes, Postgres has synced data
    • Treasury reservations work under concurrent load
    • Complete tenant isolation
    • Per-tenant fees and limits enforced
    • 100 concurrent transfers: no over-reservation
    • Import boundaries enforced
  • 6.2 — Demo Script & Documentation

    • make demo runs all 5 scenarios
    • Burst scenario shows concurrent handling
    • README leads with B2B positioning and 50M/day scale
    • Capacity planning doc has real math
    • All 13 ADRs present with threshold-driven reasoning

Phase 7: Benchmarking & Capacity Proof

Goal: Prove 50M txn/day with measured results. Numbers for README and articles.

  • 7.1 — Component Benchmarks (Go)

    • make bench runs all benchmarks and produces bench-results.txt
    • All targets met (threshold comparison script shows all PASS — 76/76)
    • Treasury Reserve ~1.5-2μs measured (>500K/sec, 100x above 5K TPS needed)
    • Ledger batch throughput measured with mock TB (real TB: 1M+ TPS)
    • Concurrent reservation: no over-reservation detected
    • All benchmarks include allocation reporting (-benchmem)
    • Results reproducible across runs (targets set with variance headroom)
  • 7.2 — Integration Load Tests

    • make loadtest-quick completes in <5 minutes, all checks pass
    • Peak load (5,000 TPS): sustained for 10 min with p99 <50ms
    • Post-test verification: all consistency checks pass
    • Live dashboard shows real-time metrics during test
    • Report generated with throughput, latency percentiles, error rates
    • No goroutine leaks after test completion
    • Single tenant flood: no over-reservation detected
  • 7.3 — Soak Test & Profiling

    • make soak-short (15 min) passes all stability checks
    • No memory leaks detected (RSS growth <50MB)
    • No goroutine leaks (count stable ±5%)
    • No PgBouncer connection exhaustion
    • p99 latency degradation <20% from baseline
    • Report generated with all metrics
    • Profile comparison shows stable CPU/heap patterns
  • 7.4 — Chaos Testing

    • TigerBeetle restart: no money lost, transfers fail/refund cleanly
    • Postgres pause: system continues, catches up after recovery
    • NATS restart: no duplicates, all transfers complete eventually
    • Redis down: transfers still work (degraded caching)
    • Server crash: recovery from DB state, no over-reservation
    • PgBouncer saturation: queues but doesn't crash
    • ALL scenarios: ledger balanced, treasury consistent after recovery
  • 7.5 — Benchmark Report & Capacity Documentation

    • make report generates complete benchmark report
    • All sections show measured data (not estimates)
    • Extrapolation math is sound (measured peak → daily capacity)
    • README updated with real numbers
    • Capacity planning doc has measured vs required comparison
    • Report is reproducible

Final Validation (all 18 checks)

git clone && cp .env.example .env        # 1. Clean clone
make build                                # 2. Build
make docker-up && sleep 25                # 3. Infrastructure
make migrate-up && make db-seed           # 4. Database + tenants
make test                                 # 5. Unit tests
make test-integration                     # 6. Integration tests
make bench                                # 7. Component benchmarks
make loadtest-quick                       # 8. Load test (quick)
make soak-short                           # 9. Soak test (short)
make chaos                                # 10. Chaos tests
make report                               # 11. Full benchmark report
make demo                                 # 12. Demo
# 13. API verification (curl gateway)
# 14. Tenant isolation proof (cross-tenant 404)
# 15. Observability (Prometheus metrics)
# 16. Dashboard (capacity page)
make lint && go test -race ./...          # 17. Code quality
# 18. Module boundaries (no core→concrete imports)