|
| 1 | +## Context |
| 2 | + |
| 3 | +Teranode's catchup process is currently controlled by two independent, optional services: |
| 4 | + |
| 5 | +- **P2P service** (`services/p2p/`): `SyncCoordinator` monitors the FSM, selects a sync peer from its internal `PeerRegistry`, and sends a Kafka message to BlockValidation with the peer's DataHub URL. Catchup happens via HTTP. |
| 6 | +- **Legacy service** (`services/legacy/`): `SyncManager` in netsync detects height gaps and calls `CatchUpBlocks()` on the blockchain service directly. Blocks arrive via Bitcoin wire protocol. |
| 7 | + |
| 8 | +When both services run simultaneously: |
| 9 | + |
| 10 | +1. P2P sends Kafka → BlockValidation catches up from Peer Alice (HTTP), FSM → `CATCHING_BLOCKS` |
| 11 | +2. Legacy detects gap → calls `CatchUpBlocks()`, silently rejected (already catching up) |
| 12 | +3. Legacy continues pushing blocks from Peer Bob via wire protocol |
| 13 | +4. BlockValidation receives headers from Alice + blocks from Bob = mismatch/failure |
| 14 | + |
| 15 | +The FSM has separate states (`CATCHING_BLOCKS` vs `LEGACY_SYNCING`) but they're not a coordination mechanism — just indicators. The peer registry lives exclusively in the P2P service (`peer_registry.go`); Legacy has its own separate `peerStates` map. Neither service knows what the other is doing. |
| 16 | + |
| 17 | +## Goals / Non-Goals |
| 18 | + |
| 19 | +**Goals:** |
| 20 | + |
| 21 | +- Single source of truth for all known peers (P2P and Legacy) with uniform tracking |
| 22 | +- Single coordination point for catchup — no more fighting between services |
| 23 | +- Legacy peers visible in the peer registry with bytes tracking, reputation, and health metrics |
| 24 | +- Transport-agnostic catchup — BlockValidation doesn't care if the peer speaks HTTP or wire protocol |
| 25 | +- Eliminate `LEGACY_SYNCING` as a separate FSM state |
| 26 | + |
| 27 | +**Non-Goals:** |
| 28 | + |
| 29 | +- Replacing the Legacy service entirely (it still provides backward compatibility with Bitcoin P2P nodes) |
| 30 | +- Changing the wire protocol or HTTP catchup protocol themselves |
| 31 | +- Modifying block validation logic (only where blocks come from, not how they're validated) |
| 32 | +- Peer discovery changes (both services keep discovering peers their own way) |
| 33 | +- Distributed/replicated peer registry (single-node, in-memory with file backup) |
| 34 | + |
| 35 | +## Decisions |
| 36 | + |
| 37 | +### 1. Blockchain service hosts the centralized peer registry |
| 38 | + |
| 39 | +**Choice**: Move the peer registry into the blockchain service. Both P2P and Legacy register their discovered peers via gRPC. The registry is the single source of truth. |
| 40 | + |
| 41 | +**Alternatives considered**: |
| 42 | + |
| 43 | +- Keep in P2P, expose via gRPC: P2P is optional — if only Legacy is running, there's no registry. Defeats the purpose. |
| 44 | +- New standalone coordination service: Over-engineering for a registry + selector. Adds deployment complexity. |
| 45 | +- BlockValidation owns it: BlockValidation is a consumer of peer info, not a discovery service. Wrong responsibility. |
| 46 | + |
| 47 | +**Rationale**: Blockchain service is always running (it's the FSM owner), already has gRPC interfaces to both P2P and Legacy, and is the natural coordination point since it owns the chain state that determines when catchup is needed. |
| 48 | + |
| 49 | +### 2. PeerInfo includes transport type |
| 50 | + |
| 51 | +**Choice**: Add a `TransportType` field to `PeerInfo` with values `HTTP` and `WIRE_PROTOCOL`. The peer selector returns the best peer regardless of transport. BlockValidation uses the transport type to dispatch to the right fetcher. |
| 52 | + |
| 53 | +```go |
| 54 | +type TransportType string |
| 55 | +const ( |
| 56 | + TransportHTTP TransportType = "http" |
| 57 | + TransportWireProtocol TransportType = "wire" |
| 58 | +) |
| 59 | +``` |
| 60 | + |
| 61 | +**Alternatives considered**: |
| 62 | + |
| 63 | +- Separate registries per transport: Loses the ability to compare/rank peers across transports |
| 64 | +- Transport preference setting: Too rigid — should prefer the best peer, not the best transport |
| 65 | + |
| 66 | +**Rationale**: The peer selector already evaluates health, reputation, and height. Adding transport as a field lets the selector remain transport-agnostic while giving BlockValidation the info it needs to dispatch correctly. |
| 67 | + |
| 68 | +### 3. BlockValidation controls catchup, services become peer providers |
| 69 | + |
| 70 | +**Choice**: Remove catchup initiation from both `SyncCoordinator` (P2P) and `SyncManager` (Legacy). BlockValidation subscribes to "peer available" events from the centralized registry and initiates catchup when: |
| 71 | + |
| 72 | +- FSM is `RUNNING` |
| 73 | +- A registered peer has a height greater than local chain tip |
| 74 | +- No catchup is already in progress |
| 75 | + |
| 76 | +P2P and Legacy become peer providers: they discover peers, register them in the centralized registry, and respond to block/header fetch requests through their respective transports. |
| 77 | + |
| 78 | +**Alternatives considered**: |
| 79 | + |
| 80 | +- P2P controls catchup, Legacy requests through P2P: Doesn't work when only Legacy is running |
| 81 | +- Blockchain service initiates catchup: Blockchain would need to know about peer heights and transport, creating tight coupling. BlockValidation already has the catchup logic. |
| 82 | +- Keep current dual-initiation with a lock: Doesn't solve the "blocks from wrong peer" problem |
| 83 | + |
| 84 | +**Rationale**: BlockValidation already does the actual catchup work. Giving it initiation control eliminates the coordination gap. The centralized registry provides the peer selection data it needs. |
| 85 | + |
| 86 | +### 4. Transport interface for block/header fetching |
| 87 | + |
| 88 | +**Choice**: Define a `CatchupTransport` interface in BlockValidation: |
| 89 | + |
| 90 | +```go |
| 91 | +type CatchupTransport interface { |
| 92 | + FetchHeaders(ctx context.Context, peer *PeerInfo, locator []*chainhash.Hash) ([]*wire.BlockHeader, error) |
| 93 | + FetchBlock(ctx context.Context, peer *PeerInfo, hash *chainhash.Hash) (*wire.MsgBlock, error) |
| 94 | + FetchSubtrees(ctx context.Context, peer *PeerInfo, hash *chainhash.Hash) ([]*Subtree, error) |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +Two implementations: |
| 99 | + |
| 100 | +- `HTTPTransport`: Current behavior — fetches from peer's DataHub URL (existing code in `catchup.go`) |
| 101 | +- `WireTransport`: Delegates to Legacy service via gRPC — Legacy fetches using Bitcoin wire protocol and returns the result |
| 102 | + |
| 103 | +**Alternatives considered**: |
| 104 | + |
| 105 | +- Legacy pushes blocks to a channel: Current broken approach — blocks arrive unsolicited from a peer BlockValidation didn't choose |
| 106 | +- Direct wire protocol in BlockValidation: Would duplicate Legacy's wire protocol implementation |
| 107 | +- Abstract at Kafka level: Too loose — need request/response semantics, not fire-and-forget |
| 108 | + |
| 109 | +**Rationale**: Clean separation of concerns. BlockValidation decides what to fetch and from whom. The transport implementations handle how. Legacy's wire protocol expertise stays in Legacy. |
| 110 | + |
| 111 | +### 5. Remove `LEGACY_SYNCING` FSM state, unify to `CATCHING_BLOCKS` |
| 112 | + |
| 113 | +**Choice**: Delete `LEGACY_SYNCING` state and the `LEGACYSYNC` event. All catchup uses `CATCHING_BLOCKS`. Quick validation is controlled by the existing `blockvalidation_catchup_allow_quick_validation` setting, not by which FSM state is active. |
| 114 | + |
| 115 | +**Alternatives considered**: |
| 116 | + |
| 117 | +- Keep both states for backward compatibility: Two states for the same thing is the source of confusion. Clean break is better. |
| 118 | +- Add a third "coordinated catchup" state: Just renaming the problem |
| 119 | + |
| 120 | +**Rationale**: `LEGACY_SYNCING` only existed because Legacy initiated sync separately. With centralized orchestration, there's one catchup path and one FSM state for it. |
| 121 | + |
| 122 | +### 6. Legacy peers get full PeerInfo tracking |
| 123 | + |
| 124 | +**Choice**: When Legacy connects to a Bitcoin P2P peer, it registers that peer in the centralized registry with: |
| 125 | + |
| 126 | +- `TransportType: "wire"` |
| 127 | +- Bytes sent/received tracking (from Legacy's existing per-peer counters) |
| 128 | +- Block height (from version message exchange) |
| 129 | +- Reputation score (starts at 50, updated based on catchup success/failure like P2P peers) |
| 130 | + |
| 131 | +**Alternatives considered**: |
| 132 | + |
| 133 | +- Minimal registration (just ID + height): Loses the ability to properly rank Legacy peers against P2P peers |
| 134 | +- Legacy manages its own metrics, registry just stores ID: Splits the responsibility, harder to maintain |
| 135 | + |
| 136 | +**Rationale**: For the peer selector to make good choices across transports, it needs comparable data for all peers. Legacy already tracks most of this internally — it just needs to report it to the centralized registry. |
| 137 | + |
| 138 | +## Risks / Trade-offs |
| 139 | + |
| 140 | +- **[New gRPC dependency between Legacy/P2P and Blockchain]** → Both services need a new blockchain client method for peer registration. Mitigation: Blockchain service is already a dependency for both. Adding peer registration methods is incremental. |
| 141 | + |
| 142 | +- **[Legacy wire transport adds gRPC round-trip]** → When BlockValidation catches up from a Legacy peer, it calls Legacy via gRPC, which fetches via wire protocol. Extra hop. Mitigation: This only affects the control path (request/response), not data throughput. The alternative (blocks arriving unsolicited) is worse. |
| 143 | + |
| 144 | +- **[Breaking change: LEGACY_SYNCING removal]** → Any code checking `FSMStateType_LEGACYSYNCING` breaks. Mitigation: Search codebase for all references. Limited to blockchain FSM callers and monitoring/alerting code. |
| 145 | + |
| 146 | +- **[Blockchain service grows in responsibility]** → Adding peer registry to blockchain service increases its scope. Mitigation: The registry is a simple data store with well-defined interfaces. It doesn't add business logic to blockchain — just state tracking. |
| 147 | + |
| 148 | +- **[Migration period with both old and new code]** → During rollout, some deployments may have old P2P/Legacy that still try to initiate catchup. Mitigation: Phase the rollout — centralized registry first (additive), then migrate catchup initiation (behavioral change), then remove old paths. |
| 149 | + |
| 150 | +## Migration Plan |
| 151 | + |
| 152 | +1. **Phase 1 - Centralized peer registry** (additive, no behavioral changes): |
| 153 | + |
| 154 | + - Add peer registration gRPC methods to blockchain service |
| 155 | + - P2P registers its peers in blockchain registry (in addition to its local registry) |
| 156 | + - Legacy registers its peers in blockchain registry |
| 157 | + - Both continue operating as before — dual registration, no behavior change |
| 158 | + - Verify: all peers visible through blockchain service API |
| 159 | + |
| 160 | +2. **Phase 2 - Transport interface** (additive): |
| 161 | + |
| 162 | + - Implement `CatchupTransport` interface in BlockValidation |
| 163 | + - Implement `HTTPTransport` (extract from existing catchup code) |
| 164 | + - Implement `WireTransport` (gRPC bridge to Legacy service) |
| 165 | + - Add new gRPC methods to Legacy for on-demand block/header fetching |
| 166 | + - Verify: both transports work independently |
| 167 | + |
| 168 | +3. **Phase 3 - Catchup orchestration migration** (behavioral change, feature-flagged): |
| 169 | + |
| 170 | + - BlockValidation reads peers from centralized registry |
| 171 | + - BlockValidation initiates catchup using transport interface |
| 172 | + - Feature flag: `catchup_use_centralized_orchestration` (default false) |
| 173 | + - When enabled: P2P's SyncCoordinator and Legacy's SyncManager stop initiating catchup |
| 174 | + - Verify: catchup works with flag on, both transports |
| 175 | + |
| 176 | +4. **Phase 4 - Cleanup**: |
| 177 | + |
| 178 | + - Remove `LEGACY_SYNCING` FSM state |
| 179 | + - Remove catchup initiation code from P2P SyncCoordinator |
| 180 | + - Remove catchup initiation code from Legacy SyncManager |
| 181 | + - Remove P2P's local peer registry (use centralized only) |
| 182 | + - Remove feature flag, make centralized orchestration the only path |
| 183 | + |
| 184 | +## Open Questions |
| 185 | + |
| 186 | +- Should the centralized registry live in blockchain service or in a new lightweight coordination service? Blockchain is practical but increases its scope. |
| 187 | +- Should Legacy's `WireTransport` support parallel block fetching (multiple `getdata` messages in flight), or is sequential sufficient for catchup? |
| 188 | +- What's the migration story for existing deployments that only run Legacy (no P2P)? The centralized registry needs to work when only Legacy registers peers. |
| 189 | +- Should the peer selector have a transport preference setting for operators who know their network topology (e.g., "prefer wire protocol peers" or "prefer HTTP peers")? |
0 commit comments