Skip to content

Commit cd79d69

Browse files
committed
Add improved legacy vs peer syncing proposal
1 parent f699145 commit cd79d69

File tree

8 files changed

+648
-0
lines changed

8 files changed

+648
-0
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-03-09
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
## Context
2+
3+
Teranode's catchup process is currently controlled by two independent, optional services:
4+
5+
- **P2P service** (`services/p2p/`): `SyncCoordinator` monitors the FSM, selects a sync peer from its internal `PeerRegistry`, and sends a Kafka message to BlockValidation with the peer's DataHub URL. Catchup happens via HTTP.
6+
- **Legacy service** (`services/legacy/`): `SyncManager` in netsync detects height gaps and calls `CatchUpBlocks()` on the blockchain service directly. Blocks arrive via Bitcoin wire protocol.
7+
8+
When both services run simultaneously:
9+
10+
1. P2P sends Kafka → BlockValidation catches up from Peer Alice (HTTP), FSM → `CATCHING_BLOCKS`
11+
2. Legacy detects gap → calls `CatchUpBlocks()`, silently rejected (already catching up)
12+
3. Legacy continues pushing blocks from Peer Bob via wire protocol
13+
4. BlockValidation receives headers from Alice + blocks from Bob = mismatch/failure
14+
15+
The FSM has separate states (`CATCHING_BLOCKS` vs `LEGACY_SYNCING`) but they're not a coordination mechanism — just indicators. The peer registry lives exclusively in the P2P service (`peer_registry.go`); Legacy has its own separate `peerStates` map. Neither service knows what the other is doing.
16+
17+
## Goals / Non-Goals
18+
19+
**Goals:**
20+
21+
- Single source of truth for all known peers (P2P and Legacy) with uniform tracking
22+
- Single coordination point for catchup — no more fighting between services
23+
- Legacy peers visible in the peer registry with bytes tracking, reputation, and health metrics
24+
- Transport-agnostic catchup — BlockValidation doesn't care if the peer speaks HTTP or wire protocol
25+
- Eliminate `LEGACY_SYNCING` as a separate FSM state
26+
27+
**Non-Goals:**
28+
29+
- Replacing the Legacy service entirely (it still provides backward compatibility with Bitcoin P2P nodes)
30+
- Changing the wire protocol or HTTP catchup protocol themselves
31+
- Modifying block validation logic (only where blocks come from, not how they're validated)
32+
- Peer discovery changes (both services keep discovering peers their own way)
33+
- Distributed/replicated peer registry (single-node, in-memory with file backup)
34+
35+
## Decisions
36+
37+
### 1. Blockchain service hosts the centralized peer registry
38+
39+
**Choice**: Move the peer registry into the blockchain service. Both P2P and Legacy register their discovered peers via gRPC. The registry is the single source of truth.
40+
41+
**Alternatives considered**:
42+
43+
- Keep in P2P, expose via gRPC: P2P is optional — if only Legacy is running, there's no registry. Defeats the purpose.
44+
- New standalone coordination service: Over-engineering for a registry + selector. Adds deployment complexity.
45+
- BlockValidation owns it: BlockValidation is a consumer of peer info, not a discovery service. Wrong responsibility.
46+
47+
**Rationale**: Blockchain service is always running (it's the FSM owner), already has gRPC interfaces to both P2P and Legacy, and is the natural coordination point since it owns the chain state that determines when catchup is needed.
48+
49+
### 2. PeerInfo includes transport type
50+
51+
**Choice**: Add a `TransportType` field to `PeerInfo` with values `HTTP` and `WIRE_PROTOCOL`. The peer selector returns the best peer regardless of transport. BlockValidation uses the transport type to dispatch to the right fetcher.
52+
53+
```go
54+
type TransportType string
55+
const (
56+
TransportHTTP TransportType = "http"
57+
TransportWireProtocol TransportType = "wire"
58+
)
59+
```
60+
61+
**Alternatives considered**:
62+
63+
- Separate registries per transport: Loses the ability to compare/rank peers across transports
64+
- Transport preference setting: Too rigid — should prefer the best peer, not the best transport
65+
66+
**Rationale**: The peer selector already evaluates health, reputation, and height. Adding transport as a field lets the selector remain transport-agnostic while giving BlockValidation the info it needs to dispatch correctly.
67+
68+
### 3. BlockValidation controls catchup, services become peer providers
69+
70+
**Choice**: Remove catchup initiation from both `SyncCoordinator` (P2P) and `SyncManager` (Legacy). BlockValidation subscribes to "peer available" events from the centralized registry and initiates catchup when:
71+
72+
- FSM is `RUNNING`
73+
- A registered peer has a height greater than local chain tip
74+
- No catchup is already in progress
75+
76+
P2P and Legacy become peer providers: they discover peers, register them in the centralized registry, and respond to block/header fetch requests through their respective transports.
77+
78+
**Alternatives considered**:
79+
80+
- P2P controls catchup, Legacy requests through P2P: Doesn't work when only Legacy is running
81+
- Blockchain service initiates catchup: Blockchain would need to know about peer heights and transport, creating tight coupling. BlockValidation already has the catchup logic.
82+
- Keep current dual-initiation with a lock: Doesn't solve the "blocks from wrong peer" problem
83+
84+
**Rationale**: BlockValidation already does the actual catchup work. Giving it initiation control eliminates the coordination gap. The centralized registry provides the peer selection data it needs.
85+
86+
### 4. Transport interface for block/header fetching
87+
88+
**Choice**: Define a `CatchupTransport` interface in BlockValidation:
89+
90+
```go
91+
type CatchupTransport interface {
92+
FetchHeaders(ctx context.Context, peer *PeerInfo, locator []*chainhash.Hash) ([]*wire.BlockHeader, error)
93+
FetchBlock(ctx context.Context, peer *PeerInfo, hash *chainhash.Hash) (*wire.MsgBlock, error)
94+
FetchSubtrees(ctx context.Context, peer *PeerInfo, hash *chainhash.Hash) ([]*Subtree, error)
95+
}
96+
```
97+
98+
Two implementations:
99+
100+
- `HTTPTransport`: Current behavior — fetches from peer's DataHub URL (existing code in `catchup.go`)
101+
- `WireTransport`: Delegates to Legacy service via gRPC — Legacy fetches using Bitcoin wire protocol and returns the result
102+
103+
**Alternatives considered**:
104+
105+
- Legacy pushes blocks to a channel: Current broken approach — blocks arrive unsolicited from a peer BlockValidation didn't choose
106+
- Direct wire protocol in BlockValidation: Would duplicate Legacy's wire protocol implementation
107+
- Abstract at Kafka level: Too loose — need request/response semantics, not fire-and-forget
108+
109+
**Rationale**: Clean separation of concerns. BlockValidation decides what to fetch and from whom. The transport implementations handle how. Legacy's wire protocol expertise stays in Legacy.
110+
111+
### 5. Remove `LEGACY_SYNCING` FSM state, unify to `CATCHING_BLOCKS`
112+
113+
**Choice**: Delete `LEGACY_SYNCING` state and the `LEGACYSYNC` event. All catchup uses `CATCHING_BLOCKS`. Quick validation is controlled by the existing `blockvalidation_catchup_allow_quick_validation` setting, not by which FSM state is active.
114+
115+
**Alternatives considered**:
116+
117+
- Keep both states for backward compatibility: Two states for the same thing is the source of confusion. Clean break is better.
118+
- Add a third "coordinated catchup" state: Just renaming the problem
119+
120+
**Rationale**: `LEGACY_SYNCING` only existed because Legacy initiated sync separately. With centralized orchestration, there's one catchup path and one FSM state for it.
121+
122+
### 6. Legacy peers get full PeerInfo tracking
123+
124+
**Choice**: When Legacy connects to a Bitcoin P2P peer, it registers that peer in the centralized registry with:
125+
126+
- `TransportType: "wire"`
127+
- Bytes sent/received tracking (from Legacy's existing per-peer counters)
128+
- Block height (from version message exchange)
129+
- Reputation score (starts at 50, updated based on catchup success/failure like P2P peers)
130+
131+
**Alternatives considered**:
132+
133+
- Minimal registration (just ID + height): Loses the ability to properly rank Legacy peers against P2P peers
134+
- Legacy manages its own metrics, registry just stores ID: Splits the responsibility, harder to maintain
135+
136+
**Rationale**: For the peer selector to make good choices across transports, it needs comparable data for all peers. Legacy already tracks most of this internally — it just needs to report it to the centralized registry.
137+
138+
## Risks / Trade-offs
139+
140+
- **[New gRPC dependency between Legacy/P2P and Blockchain]** → Both services need a new blockchain client method for peer registration. Mitigation: Blockchain service is already a dependency for both. Adding peer registration methods is incremental.
141+
142+
- **[Legacy wire transport adds gRPC round-trip]** → When BlockValidation catches up from a Legacy peer, it calls Legacy via gRPC, which fetches via wire protocol. Extra hop. Mitigation: This only affects the control path (request/response), not data throughput. The alternative (blocks arriving unsolicited) is worse.
143+
144+
- **[Breaking change: LEGACY_SYNCING removal]** → Any code checking `FSMStateType_LEGACYSYNCING` breaks. Mitigation: Search codebase for all references. Limited to blockchain FSM callers and monitoring/alerting code.
145+
146+
- **[Blockchain service grows in responsibility]** → Adding peer registry to blockchain service increases its scope. Mitigation: The registry is a simple data store with well-defined interfaces. It doesn't add business logic to blockchain — just state tracking.
147+
148+
- **[Migration period with both old and new code]** → During rollout, some deployments may have old P2P/Legacy that still try to initiate catchup. Mitigation: Phase the rollout — centralized registry first (additive), then migrate catchup initiation (behavioral change), then remove old paths.
149+
150+
## Migration Plan
151+
152+
1. **Phase 1 - Centralized peer registry** (additive, no behavioral changes):
153+
154+
- Add peer registration gRPC methods to blockchain service
155+
- P2P registers its peers in blockchain registry (in addition to its local registry)
156+
- Legacy registers its peers in blockchain registry
157+
- Both continue operating as before — dual registration, no behavior change
158+
- Verify: all peers visible through blockchain service API
159+
160+
2. **Phase 2 - Transport interface** (additive):
161+
162+
- Implement `CatchupTransport` interface in BlockValidation
163+
- Implement `HTTPTransport` (extract from existing catchup code)
164+
- Implement `WireTransport` (gRPC bridge to Legacy service)
165+
- Add new gRPC methods to Legacy for on-demand block/header fetching
166+
- Verify: both transports work independently
167+
168+
3. **Phase 3 - Catchup orchestration migration** (behavioral change, feature-flagged):
169+
170+
- BlockValidation reads peers from centralized registry
171+
- BlockValidation initiates catchup using transport interface
172+
- Feature flag: `catchup_use_centralized_orchestration` (default false)
173+
- When enabled: P2P's SyncCoordinator and Legacy's SyncManager stop initiating catchup
174+
- Verify: catchup works with flag on, both transports
175+
176+
4. **Phase 4 - Cleanup**:
177+
178+
- Remove `LEGACY_SYNCING` FSM state
179+
- Remove catchup initiation code from P2P SyncCoordinator
180+
- Remove catchup initiation code from Legacy SyncManager
181+
- Remove P2P's local peer registry (use centralized only)
182+
- Remove feature flag, make centralized orchestration the only path
183+
184+
## Open Questions
185+
186+
- Should the centralized registry live in blockchain service or in a new lightweight coordination service? Blockchain is practical but increases its scope.
187+
- Should Legacy's `WireTransport` support parallel block fetching (multiple `getdata` messages in flight), or is sequential sufficient for catchup?
188+
- What's the migration story for existing deployments that only run Legacy (no P2P)? The centralized registry needs to work when only Legacy registers peers.
189+
- Should the peer selector have a transport preference setting for operators who know their network topology (e.g., "prefer wire protocol peers" or "prefer HTTP peers")?
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
## Why
2+
3+
When both P2P and Legacy services are running, they independently initiate catchup and fight over the blockchain FSM. P2P's `SyncCoordinator` sends a Kafka message to trigger catchup from one peer (via HTTP), while Legacy's `SyncManager` calls `CatchUpBlocks()` directly and pushes blocks from a different peer (via wire protocol). The FSM silently rejects the second caller, but blocks from both peers can still arrive at BlockValidation, causing header/block mismatches, failed catchups, and confused state.
4+
5+
The root cause is that catchup coordination is split across two optional services with no shared state: the peer registry lives exclusively in the P2P service, Legacy has its own separate peer tracking, and neither knows what the other is doing. The FSM (`CATCHING_BLOCKS` vs `LEGACY_SYNCING`) is a state indicator, not a coordination mechanism.
6+
7+
Both services are optional — you can run one, both, or neither. The single-service paths work fine. The dual-service path is broken.
8+
9+
## What Changes
10+
11+
- **Centralize peer registry**: Move peer tracking out of the optional P2P service into a shared location (blockchain service or new coordination layer) so both P2P and Legacy peers are tracked uniformly. Legacy peers appear in the registry with proper bytes tracking, reputation scores, and the same metrics as P2P peers.
12+
- **Centralize catchup orchestration**: One coordination point decides which peer to sync from, regardless of whether that peer speaks HTTP (P2P) or wire protocol (Legacy). BlockValidation controls the catchup process; P2P and Legacy register as peer providers rather than independently triggering catchup.
13+
- **Unified peer model**: Legacy peers show up as first-class entries in the peer registry with a transport type field (HTTP vs wire protocol), bytes sent/received tracking, reputation scoring, and health metrics — same as P2P peers.
14+
- **Eliminate FSM state duplication**: Remove the separate `LEGACY_SYNCING` state. Catchup is catchup regardless of transport. Quick validation is controlled by the `blockvalidation_catchup_allow_quick_validation` setting, not by which service initiated the sync.
15+
- **Transport abstraction for catchup**: BlockValidation's catchup code fetches headers and blocks through a transport interface that supports both HTTP (current P2P path) and wire protocol (current Legacy path), selected based on the chosen peer's transport type.
16+
17+
## Capabilities
18+
19+
### New Capabilities
20+
21+
- `centralized-peer-registry`: Shared peer registry accessible to both P2P and Legacy services, tracking all peers uniformly with transport type, bytes metrics, reputation, and health scores.
22+
- `catchup-orchestration`: Single coordination point for catchup that selects the best peer from the unified registry and dispatches catchup through the appropriate transport, eliminating the P2P/Legacy fight.
23+
- `peer-transport-abstraction`: Transport interface for catchup operations supporting HTTP (P2P DataHub) and wire protocol (Legacy Bitcoin P2P) behind a common API.
24+
25+
### Modified Capabilities
26+
27+
## Impact
28+
29+
- **Code**: `services/p2p/peer_registry.go` (extract and generalize), `services/p2p/sync_coordinator.go` (becomes peer provider, not catchup initiator), `services/legacy/netsync/manager.go` (becomes peer provider, not catchup initiator), `services/blockvalidation/catchup.go` (uses transport interface), `services/blockchain/Server.go` (hosts centralized registry, FSM simplification), `services/blockchain/fsm.go` (remove `LEGACY_SYNCING` state)
30+
- **Configuration**: New settings for centralized peer registry persistence, transport selection preferences
31+
- **APIs**: New gRPC methods on blockchain service for peer registration and query; internal API changes for catchup transport
32+
- **Breaking**: `LEGACY_SYNCING` FSM state removed — any code checking for this state needs updating. **BREAKING**
33+
- **Dependencies**: No new external dependencies

0 commit comments

Comments
 (0)