Skip to content

Commit d117b28

Browse files
2colorlidel
andauthored
feat: add active peer probing and caching (#90)
* feat: add cached peer book with higher ttls * feat: initial implementation of active peer probing * feat: use the cached router * chore: go mod tidy * feat: log probe duration * chore: log in probe loop * fix: update peer state if doesn't exist * fix: add addresses to cached address book * fix: wrap with cached router only if available * feat: make everything a little bit better * chore: small refinements * test: add test for cached addr book * chore: rename files * feat: add options to cached addr book fix test by allowing private ips * feat: add instrumentation * fix: thread safety * docs: update changelog * fix: small fixes * fix: simplify cached router * feat(metric): cached_router_peer_addr_lookups this adds metric for evaluating all addr lookups someguy_cached_router_peer_addr_lookups{cache="unused|hit|miss",origin="providers|peers"} I've also wired up FindPeers for completeness. * Apply suggestions from code review Co-authored-by: Marcin Rataj <[email protected]> * Update CHANGELOG.md Co-authored-by: Marcin Rataj <[email protected]> * chore: use service name for namespace * fix: type errors and missing imports * feat: add queue probe * Revert "feat: add queue probe" This reverts commit 75f1bf2. * chore: simplify composite literal * fix: implement custom cache fallback iterator * fix: add cancel and simplify * fix: move select to Val function * fix: concurrency bug from the ongoingLookups * chore: clean up comments * fix: add lint ignores * docs: update changelog * fix: increase bucket sizes for probe duration * chore: remove unused peer state fields save some memory * feat: enable caching for FindPeer in cached router * fix: handle peer not found case * Apply suggestions from code review Co-authored-by: Marcin Rataj <[email protected]> * fix: wait longer during cleanup function * test: remove bitswap record test * refactor: extract connectedness checks to a func * fix: set ttl for both signed and unsigned addrs * fix: prevent race condition * feat: use 2q-lru cache for peer state 2q-lru tracks both frequently and recently used entries separately * chore: remove return count we don't need the return count with the 2q-lru cache and the peerAddrLookups metric * test: improve reliability of tests mock the libp2p host and use a real event bus * fix: record failed connections * feat: add exponential backoff for probes/peer lookups * fix: return peers with no addrs that wont probe * fix: brittle test * feat: add probed peers counter * fix: adjust probe duration metric buckets * fix: prevent race conditions * feat: increase cache size and add max backoff * fix: omit providers whose peer cannot be found * chore: remove unused function * deps: upgrade go-libp2p * fix: avoid using the cache in FindPeers * fix: do not return cached results for FindPeers * refactor: small optimisation * chore: re-add comment * Apply suggestions from code review Co-authored-by: Marcin Rataj <[email protected]> * Apply suggestions from code review Co-authored-by: Marcin Rataj <[email protected]> * fix: use separate context for dispatched jobs * fix: ensure proper cleanup of cache fallback iter * Update main.go Co-authored-by: Marcin Rataj <[email protected]> * fix: formatting * fix: let consumer handle cleanup * fix: remove from address book when removed from peer state * fix: use normal lru cache instead of 2Q * fix: update the metric when removing from the peer cache * fix: increase max backoff to 48 hours When the max backoff duration is reached and a connection attempt fails we clear the cached addresses and state. Since this state is useful to prevent unncessary attempts to dispatch a find peer we should keep it for as long as a provider record is valid for. * feat: add env var for recently connected ttl * feat: add env var to control active probing * fix: bug from closing the iterator twice no need to close the channel. just the source iterator * docs: update comment * docs: improve changelog * test: fix background test * feat(metrics): track online vs offline probe ratio --------- Co-authored-by: Daniel N <[email protected]> Co-authored-by: Marcin Rataj <[email protected]>
1 parent 4023bba commit d117b28

File tree

11 files changed

+1354
-32
lines changed

11 files changed

+1354
-32
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,14 @@ The following emojis are used to highlight certain changes:
1515

1616
### Added
1717

18+
- Peer addresses are cached for 48h to match [provider record expiration on Amino DHT](https://github.com/libp2p/go-libp2p-kad-dht/blob/v0.28.1/amino/defaults.go#L40-L43).
19+
- In the background, someguy probes cached peers at most once per hour (`PeerProbeThreshold`) by attempting to dial them to keep their multiaddrs up to date. If a peer is not reachable, an exponential backoff is applied to reduce the frequency of probing. If a cached peer is unreachable for more than 48h (`MaxBackoffDuration`), it is removed from the cache.
20+
- Someguy now augments providers missing addresses in `FindProviders` with cached addresses. If a peer is encountered with no cached addresses, `FindPeer` is dispatched in the background and the result is streamed in the reponse. Providers for which no addresses can be found, are omitted from the response.
21+
- This can be enabled via `SOMEGUY_CACHED_ADDR_BOOK=true|false` (enabled by default)
22+
- Two additional configuration options for the `cachedAddrBook` implementation:
23+
- `SOMEGUY_CACHED_ADDR_BOOK_ACTIVE_PROBING` whether to actively probe cached peers in the background to keep their multiaddrs up to date.
24+
- `SOMEGUY_CACHED_ADDR_BOOK_RECENT_TTL` to adjust the TTL for cached addresses of recently connected peers.
25+
1826
### Changed
1927

2028
### Removed

cached_addr_book.go

Lines changed: 354 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,354 @@
1+
package main
2+
3+
import (
4+
"context"
5+
"io"
6+
"sync"
7+
"sync/atomic"
8+
"time"
9+
10+
lru "github.com/hashicorp/golang-lru/v2"
11+
"github.com/ipfs/boxo/routing/http/types"
12+
"github.com/libp2p/go-libp2p-kad-dht/amino"
13+
"github.com/libp2p/go-libp2p/core/event"
14+
"github.com/libp2p/go-libp2p/core/host"
15+
"github.com/libp2p/go-libp2p/core/network"
16+
"github.com/libp2p/go-libp2p/core/peer"
17+
"github.com/libp2p/go-libp2p/core/peerstore"
18+
"github.com/libp2p/go-libp2p/p2p/host/peerstore/pstoremem"
19+
"github.com/libp2p/go-libp2p/p2p/protocol/circuitv2/relay"
20+
ma "github.com/multiformats/go-multiaddr"
21+
manet "github.com/multiformats/go-multiaddr/net"
22+
"github.com/prometheus/client_golang/prometheus"
23+
"github.com/prometheus/client_golang/prometheus/promauto"
24+
)
25+
26+
const (
27+
Subsystem = "cached_addr_book"
28+
29+
// The default TTL to keep recently connected peers' multiaddrs for
30+
DefaultRecentlyConnectedAddrTTL = amino.DefaultProvideValidity
31+
32+
// Connected peers don't expire until they disconnect
33+
ConnectedAddrTTL = peerstore.ConnectedAddrTTL
34+
35+
// How long to wait since last connection before probing a peer again
36+
PeerProbeThreshold = time.Hour
37+
38+
// How often to run the probe peers loop
39+
ProbeInterval = time.Minute * 15
40+
41+
// How many concurrent probes to run at once
42+
MaxConcurrentProbes = 20
43+
44+
// How long to wait for a connect in a probe to complete.
45+
// The worst case is a peer behind a relay, so we use the relay connect timeout.
46+
ConnectTimeout = relay.ConnectTimeout
47+
48+
// How many peers to cache in the peer state cache
49+
// 1_000_000 is 10x the default number of signed peer records cached by the memory address book.
50+
PeerCacheSize = 1_000_000
51+
52+
// Maximum backoff duration for probing a peer. After this duration, we will stop
53+
// trying to connect to the peer and remove it from the cache.
54+
MaxBackoffDuration = amino.DefaultProvideValidity
55+
56+
probeResult = "result"
57+
probeResultOnline = "online"
58+
probeResultOffline = "offline"
59+
)
60+
61+
var (
62+
probeDurationHistogram = promauto.NewHistogram(prometheus.HistogramOpts{
63+
Name: "probe_duration_seconds",
64+
Namespace: name,
65+
Subsystem: Subsystem,
66+
Help: "Duration of peer probing operations in seconds",
67+
// Buckets probe durations from 5s to 15 minutes
68+
Buckets: []float64{5, 10, 30, 60, 120, 300, 600, 900},
69+
})
70+
71+
probedPeersCounter = promauto.NewCounterVec(prometheus.CounterOpts{
72+
Name: "probed_peers",
73+
Subsystem: Subsystem,
74+
Namespace: name,
75+
Help: "Number of peers probed",
76+
},
77+
[]string{probeResult},
78+
)
79+
80+
peerStateSize = promauto.NewGauge(prometheus.GaugeOpts{
81+
Name: "peer_state_size",
82+
Subsystem: Subsystem,
83+
Namespace: name,
84+
Help: "Number of peers object currently in the peer state",
85+
})
86+
)
87+
88+
type peerState struct {
89+
lastConnTime time.Time // last time we successfully connected to this peer
90+
lastFailedConnTime time.Time // last time we failed to find or connect to this peer
91+
connectFailures uint // number of times we've failed to connect to this peer
92+
}
93+
94+
type cachedAddrBook struct {
95+
addrBook peerstore.AddrBook // memory address book
96+
peerCache *lru.Cache[peer.ID, peerState] // LRU cache with additional metadata about peer
97+
probingEnabled bool
98+
isProbing atomic.Bool
99+
allowPrivateIPs bool // for testing
100+
recentlyConnectedTTL time.Duration
101+
}
102+
103+
type AddrBookOption func(*cachedAddrBook) error
104+
105+
func WithAllowPrivateIPs() AddrBookOption {
106+
return func(cab *cachedAddrBook) error {
107+
cab.allowPrivateIPs = true
108+
return nil
109+
}
110+
}
111+
112+
func WithRecentlyConnectedTTL(ttl time.Duration) AddrBookOption {
113+
return func(cab *cachedAddrBook) error {
114+
cab.recentlyConnectedTTL = ttl
115+
return nil
116+
}
117+
}
118+
119+
func WithActiveProbing(enabled bool) AddrBookOption {
120+
return func(cab *cachedAddrBook) error {
121+
cab.probingEnabled = enabled
122+
return nil
123+
}
124+
}
125+
126+
func newCachedAddrBook(opts ...AddrBookOption) (*cachedAddrBook, error) {
127+
peerCache, err := lru.New[peer.ID, peerState](PeerCacheSize)
128+
if err != nil {
129+
return nil, err
130+
}
131+
132+
cab := &cachedAddrBook{
133+
peerCache: peerCache,
134+
addrBook: pstoremem.NewAddrBook(),
135+
recentlyConnectedTTL: DefaultRecentlyConnectedAddrTTL, // Set default value
136+
}
137+
138+
for _, opt := range opts {
139+
err := opt(cab)
140+
if err != nil {
141+
return nil, err
142+
}
143+
}
144+
logger.Infof("Using TTL of %s for recently connected peers", cab.recentlyConnectedTTL)
145+
logger.Infof("Probing enabled: %t", cab.probingEnabled)
146+
return cab, nil
147+
}
148+
149+
func (cab *cachedAddrBook) background(ctx context.Context, host host.Host) {
150+
sub, err := host.EventBus().Subscribe([]interface{}{
151+
&event.EvtPeerIdentificationCompleted{},
152+
&event.EvtPeerConnectednessChanged{},
153+
})
154+
if err != nil {
155+
logger.Errorf("failed to subscribe to peer identification events: %v", err)
156+
return
157+
}
158+
defer sub.Close()
159+
160+
probeTicker := time.NewTicker(ProbeInterval)
161+
defer probeTicker.Stop()
162+
163+
for {
164+
select {
165+
case <-ctx.Done():
166+
cabCloser, ok := cab.addrBook.(io.Closer)
167+
if ok {
168+
errClose := cabCloser.Close()
169+
if errClose != nil {
170+
logger.Warnf("failed to close addr book: %v", errClose)
171+
}
172+
}
173+
return
174+
case ev := <-sub.Out():
175+
switch ev := ev.(type) {
176+
case event.EvtPeerIdentificationCompleted:
177+
pState, exists := cab.peerCache.Peek(ev.Peer)
178+
if !exists {
179+
pState = peerState{}
180+
}
181+
pState.lastConnTime = time.Now()
182+
pState.lastFailedConnTime = time.Time{} // reset failed connection time
183+
pState.connectFailures = 0 // reset connect failures on successful connection
184+
cab.peerCache.Add(ev.Peer, pState)
185+
peerStateSize.Set(float64(cab.peerCache.Len())) // update metric
186+
187+
ttl := cab.getTTL(host.Network().Connectedness(ev.Peer))
188+
if ev.SignedPeerRecord != nil {
189+
logger.Debug("Caching signed peer record")
190+
cab, ok := peerstore.GetCertifiedAddrBook(cab.addrBook)
191+
if ok {
192+
_, err := cab.ConsumePeerRecord(ev.SignedPeerRecord, ttl)
193+
if err != nil {
194+
logger.Warnf("failed to consume signed peer record: %v", err)
195+
}
196+
}
197+
} else {
198+
logger.Debug("No signed peer record, caching listen addresses")
199+
// We don't have a signed peer record, so we use the listen addresses
200+
cab.addrBook.AddAddrs(ev.Peer, ev.ListenAddrs, ttl)
201+
}
202+
case event.EvtPeerConnectednessChanged:
203+
// If the peer is not connected or limited, we update the TTL
204+
if !hasValidConnectedness(ev.Connectedness) {
205+
cab.addrBook.UpdateAddrs(ev.Peer, ConnectedAddrTTL, cab.recentlyConnectedTTL)
206+
}
207+
}
208+
case <-probeTicker.C:
209+
if !cab.probingEnabled {
210+
logger.Debug("Probing disabled, skipping")
211+
continue
212+
}
213+
if cab.isProbing.Load() {
214+
logger.Debug("Skipping peer probe, still running")
215+
continue
216+
}
217+
logger.Debug("Starting to probe peers")
218+
cab.isProbing.Store(true)
219+
go cab.probePeers(ctx, host)
220+
}
221+
}
222+
}
223+
224+
// Loops over all peers with addresses and probes them if they haven't been probed recently
225+
func (cab *cachedAddrBook) probePeers(ctx context.Context, host host.Host) {
226+
defer cab.isProbing.Store(false)
227+
228+
start := time.Now()
229+
defer func() {
230+
duration := time.Since(start).Seconds()
231+
probeDurationHistogram.Observe(duration)
232+
logger.Debugf("Finished probing peers in %s", duration)
233+
}()
234+
235+
var wg sync.WaitGroup
236+
// semaphore channel to limit the number of concurrent probes
237+
semaphore := make(chan struct{}, MaxConcurrentProbes)
238+
239+
for i, p := range cab.addrBook.PeersWithAddrs() {
240+
if hasValidConnectedness(host.Network().Connectedness(p)) {
241+
continue // don't probe connected peers
242+
}
243+
244+
if !cab.ShouldProbePeer(p) {
245+
continue
246+
}
247+
248+
addrs := cab.addrBook.Addrs(p)
249+
250+
if !cab.allowPrivateIPs {
251+
addrs = ma.FilterAddrs(addrs, manet.IsPublicAddr)
252+
}
253+
254+
if len(addrs) == 0 {
255+
continue // no addresses to probe
256+
}
257+
258+
wg.Add(1)
259+
semaphore <- struct{}{}
260+
go func() {
261+
defer func() {
262+
<-semaphore // Release semaphore
263+
wg.Done()
264+
}()
265+
ctx, cancel := context.WithTimeout(ctx, ConnectTimeout)
266+
defer cancel()
267+
logger.Debugf("Probe %d: PeerID: %s, Addrs: %v", i+1, p, addrs)
268+
// if connect succeeds and identify runs, the background loop will take care of updating the peer state and cache
269+
err := host.Connect(ctx, peer.AddrInfo{
270+
ID: p,
271+
Addrs: addrs,
272+
})
273+
if err != nil {
274+
logger.Debugf("failed to connect to peer %s: %v", p, err)
275+
cab.RecordFailedConnection(p)
276+
probedPeersCounter.WithLabelValues(probeResultOffline).Inc()
277+
} else {
278+
probedPeersCounter.WithLabelValues(probeResultOnline).Inc()
279+
}
280+
}()
281+
}
282+
wg.Wait()
283+
}
284+
285+
// Returns the cached addresses for a peer, incrementing the return count
286+
func (cab *cachedAddrBook) GetCachedAddrs(p peer.ID) []types.Multiaddr {
287+
cachedAddrs := cab.addrBook.Addrs(p)
288+
289+
if len(cachedAddrs) == 0 {
290+
return nil
291+
}
292+
293+
result := make([]types.Multiaddr, 0, len(cachedAddrs)) // convert to local Multiaddr type 🙃
294+
for _, addr := range cachedAddrs {
295+
result = append(result, types.Multiaddr{Multiaddr: addr})
296+
}
297+
return result
298+
}
299+
300+
// Update the peer cache with information about a failed connection
301+
// This should be called when a connection attempt to a peer fails
302+
func (cab *cachedAddrBook) RecordFailedConnection(p peer.ID) {
303+
pState, exists := cab.peerCache.Peek(p)
304+
if !exists {
305+
pState = peerState{}
306+
}
307+
now := time.Now()
308+
// once probing of offline peer reached MaxBackoffDuration and still failed,
309+
// we opportunistically remove the dead peer from cache to save time on probing it further
310+
if exists && pState.connectFailures > 1 && now.Sub(pState.lastFailedConnTime) > MaxBackoffDuration {
311+
cab.peerCache.Remove(p)
312+
peerStateSize.Set(float64(cab.peerCache.Len())) // update metric
313+
// remove the peer from the addr book. Otherwise it will be probed again in the probe loop
314+
cab.addrBook.ClearAddrs(p)
315+
return
316+
}
317+
pState.lastFailedConnTime = now
318+
pState.connectFailures++
319+
cab.peerCache.Add(p, pState)
320+
}
321+
322+
// Returns true if we should probe a peer (either by dialing known addresses or by dispatching a FindPeer)
323+
// based on the last failed connection time and connection failures
324+
func (cab *cachedAddrBook) ShouldProbePeer(p peer.ID) bool {
325+
pState, exists := cab.peerCache.Peek(p)
326+
if !exists {
327+
return true // default to probing if the peer is not in the cache
328+
}
329+
330+
var backoffDuration time.Duration
331+
if pState.connectFailures > 0 {
332+
// Calculate backoff only if we have failures
333+
// this is effectively 2^(connectFailures - 1) * PeerProbeThreshold
334+
// A single failure results in a 1 hour backoff and each additional failure doubles the backoff
335+
backoffDuration = PeerProbeThreshold * time.Duration(1<<(pState.connectFailures-1))
336+
backoffDuration = min(backoffDuration, MaxBackoffDuration) // clamp to max backoff duration
337+
} else {
338+
backoffDuration = PeerProbeThreshold
339+
}
340+
341+
// Only dispatch if we've waited long enough based on the backoff
342+
return time.Since(pState.lastFailedConnTime) > backoffDuration
343+
}
344+
345+
func hasValidConnectedness(connectedness network.Connectedness) bool {
346+
return connectedness == network.Connected || connectedness == network.Limited
347+
}
348+
349+
func (cab *cachedAddrBook) getTTL(connectedness network.Connectedness) time.Duration {
350+
if hasValidConnectedness(connectedness) {
351+
return ConnectedAddrTTL
352+
}
353+
return cab.recentlyConnectedTTL
354+
}

0 commit comments

Comments
 (0)