Skip to content

Commit 4fbff64

Browse files
committed
connectd: fix race when we supply a new address.
This shows up as a flake in test_route_by_old_scid: ``` # Now restart l2, make sure it remembers the original! l2.restart() > l2.rpc.connect(l1.info['id'], 'localhost', l1.port) tests/test_splicing.py:554: ... > raise RpcError(method, payload, resp['error']) E pyln.client.lightning.RpcError: RPC call failed: method: connect, payload: {'id': '0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518', 'host': 'localhost', 'port': 33837}, error: {'code': 400, 'message': 'Unable to connect, no address known for peer'} ``` This is because it's already (auto)connecting, and fails. This failure is reported, before we've added the new address (once we add the new address, we connect fine, but it's too late!): ``` lightningd-2 2025-12-08T02:39:18.241Z DEBUG gossipd: REPLY WIRE_GOSSIPD_NEW_BLOCKHEIGHT_REPLY with 0 fds lightningd-2 2025-12-08T02:39:18.320Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Initializing important peer with 0 addresses lightningd-2 2025-12-08T02:39:18.320Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Failed connected out: Unable to connect, no address known for peer lightningd-2 2025-12-08T02:39:18.344Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Will try reconnect in 1 seconds lightningd-2 2025-12-08T02:39:18.344Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Initializing important peer with 1 addresses lightningd-2 2025-12-08T02:39:18.344Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Connected out, starting crypto lightningd-2 2025-12-08T02:39:18.344Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Adding 1 addresses to important peer lightningd-2 2025-12-08T02:39:18.345Z DEBUG 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Connected out, starting crypto {'run_id': 256236335046680576, 'github_repository': 'ElementsProject/lightning', 'github_sha': '555f1ac461d151064aad6fc83b94a0290e2e9d5d', 'github_ref': 'refs/pull/8767/merge', 'github_ref_name': 'HEAD', 'github_run_id': 20013689862, 'github_head_ref': 'fixup-backfill-bug', 'github_run_number': 14774, 'github_base_ref': 'master', 'github_run_attempt': '1', 'testname': 'test_route_by_old_scid', 'start_time': 1765161493, 'end_time': 1765161558, 'outcome': 'fail'} lightningd-2 2025-12-08T02:39:18.421Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ lightningd-2 2025-12-08T02:39:18.421Z DEBUG hsmd: Client: Received message 1 from client lightningd-2 2025-12-08T02:39:18.453Z DEBUG 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-hsmd: Got WIRE_HSMD_ECDH_REQ lightningd-2 2025-12-08T02:39:18.453Z DEBUG hsmd: Client: Received message 1 from client --------------------------- Captured stdout teardown --------------------------- ```
1 parent 9e6da60 commit 4fbff64

File tree

2 files changed

+15
-5
lines changed

2 files changed

+15
-5
lines changed

lightningd/connect_control.c

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -277,10 +277,21 @@ static void connect_failed(struct lightningd *ld,
277277
connect_nsec,
278278
connect_attempted);
279279

280-
/* We can have multiple connect commands: fail them all */
281-
while ((c = find_connect(ld, id)) != NULL) {
282-
/* They delete themselves from list */
283-
was_pending(command_fail(c->cmd, errcode, "%s", errmsg));
280+
/* There's a race between autoreconnect and connect commands. This
281+
* matters because the autoreconnect might have failed, but that was before
282+
* the connect_to_peer command gave connectd a new address. This we wait for
283+
* one we explicitly asked for before failing.
284+
*
285+
* A similar pattern could occur with multiple connect commands, however connectd
286+
* does simply combine those, so we don't get a response per request, and it's a
287+
* very rare corner case (which, unlike the above, doesn't happen in CI!).
288+
*/
289+
if (strstarts(connect_reason, "connect command")) {
290+
/* We can have multiple connect commands: fail them all */
291+
while ((c = find_connect(ld, id)) != NULL) {
292+
/* They delete themselves from list */
293+
was_pending(command_fail(c->cmd, errcode, "%s", errmsg));
294+
}
284295
}
285296
}
286297

tests/test_splicing.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -499,7 +499,6 @@ def test_splice_stuck_htlc(node_factory, bitcoind, executor):
499499
assert l1.db_query("SELECT count(*) as c FROM channeltxs;")[0]['c'] == 0
500500

501501

502-
@pytest.mark.flaky(reruns=5)
503502
@unittest.skipIf(TEST_NETWORK != 'regtest', 'elementsd doesnt yet support PSBT features we need')
504503
def test_route_by_old_scid(node_factory, bitcoind):
505504
l1, l2, l3 = node_factory.line_graph(3, wait_for_announce=True, opts={'experimental-splicing': None, 'may_reconnect': True})

0 commit comments

Comments
 (0)