Skip to content

Commit d34a9ab

Browse files
author
Marko Petzold
committed
feat: Implement cluster IP auto-refresh during node authentication to prevent stale connections
1 parent 6220823 commit d34a9ab

File tree

9 files changed

+524
-33
lines changed

9 files changed

+524
-33
lines changed

BUGFIX-CLUSTER-IP-REFRESH.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Cluster IP Auto-Refresh Fix
2+
3+
## Problem Summary
4+
5+
When router nodes in a Crossbar.io cluster restart (e.g., Kubernetes pod rescheduling), they receive new IP addresses. However, proxy workers were connecting to stale IP addresses stored in the master's database, causing connection failures.
6+
7+
**Symptoms:**
8+
- WAMP clients unable to connect to realms via proxy
9+
- Proxy workers timing out with "Connection refused" errors
10+
- Router nodes running with new IPs (e.g., 10.108.3.17) but database containing old IPs (e.g., 10.108.1.12)
11+
12+
## Root Cause
13+
14+
The cluster_ip field in the master database was not being updated when router nodes reconnected with new IP addresses. The system had several issues:
15+
16+
1. **WAMP meta events not enabled**: The management realm didn't have `wamp.session.on_join` meta events enabled, so session join handlers never fired
17+
2. **Wrong update location**: Initial attempts to update cluster_ip in session join handlers (`_on_session_startup`) failed because meta events weren't being published
18+
3. **Stale data returned**: The authenticator returned the old `node.authextra` from the database, overwriting the current cluster_ip sent by the node
19+
20+
## Solution
21+
22+
Update the cluster_ip during authentication phase, before the session joins. This doesn't rely on WAMP meta events and executes on every node connection.
23+
24+
### Implementation
25+
26+
**Key Changes:**
27+
28+
1. **Node sends cluster_ip in authextra** (`crossbar/edge/node/management.py`)
29+
- Reads `CROSSBAR_NODE_CLUSTER_IP` from environment variable
30+
- Falls back to `127.0.0.1` if not set
31+
- Sends cluster_ip in authextra during authentication
32+
33+
2. **Authenticator updates database** (`crossbar/master/node/authenticator.py`)
34+
- Extracts `cluster_ip` from incoming `details['authextra']`
35+
- Compares with database `node.cluster_ip`
36+
- Updates database if different (with write transaction)
37+
- Updates both `node.cluster_ip` and `node.authextra['cluster_ip']`
38+
- Logs IP changes for observability
39+
40+
3. **Cleanup**
41+
- Removed cluster_ip from key file generation (`crossbar/common/key.py`)
42+
- Removed cluster_ip from auto-pairing logic (`crossbar/master/node/controller.py`)
43+
- Removed redundant database update from session join handler (`crossbar/master/mrealm/controller.py`)
44+
45+
### Code Flow
46+
47+
```
48+
Router Node Restart
49+
50+
Read CROSSBAR_NODE_CLUSTER_IP from environment (pod IP)
51+
52+
Connect to master with cluster_ip in authextra
53+
54+
Authenticator._auth_node() extracts incoming cluster_ip
55+
56+
Compare incoming_cluster_ip vs database node.cluster_ip
57+
58+
If different: Update database + log change
59+
60+
Return updated authextra to node
61+
62+
ApplicationRealmMonitor reads node.cluster_ip from database
63+
64+
Configure proxy backend connections with current IP
65+
66+
Proxy workers connect to correct router IP
67+
```
68+
69+
## Testing
70+
71+
### Local Docker Verification
72+
```bash
73+
# Build and deploy
74+
just build_amd
75+
docker-compose up -d
76+
77+
# Check router environment
78+
docker exec crossbar_router_realm1 env | grep CROSSBAR_NODE_CLUSTER_IP
79+
80+
# Verify authenticator receives cluster_ip
81+
docker logs crossbar_master 2>&1 | grep "Node authentication received"
82+
83+
# Check for IP changes (on pod restart)
84+
docker logs crossbar_master 2>&1 | grep "cluster IP changed"
85+
86+
# Verify proxy connections succeed
87+
docker logs crossbar_proxy1 2>&1 | grep "proxy backend session joined"
88+
```
89+
90+
### Kubernetes/GKE Verification
91+
```bash
92+
# Check router pod IP
93+
kubectl get pod crossbar-router-realm1-sfs-0 -o wide
94+
95+
# Verify environment variable
96+
kubectl exec crossbar-router-realm1-sfs-0 -- env | grep CROSSBAR_NODE_CLUSTER_IP
97+
98+
# Check master logs for IP updates
99+
kubectl logs crossbar-master-0 | grep "cluster IP changed"
100+
101+
# Verify proxy connections
102+
kubectl logs crossbar-proxy-realm1-0 | grep "proxy backend session joined"
103+
104+
# Test pod restart
105+
kubectl delete pod crossbar-router-realm1-sfs-0
106+
# Wait for pod to restart with new IP
107+
kubectl logs crossbar-master-0 | grep "cluster IP changed"
108+
```
109+
110+
## Environment Configuration
111+
112+
Router nodes must set `CROSSBAR_NODE_CLUSTER_IP` to their reachable IP address or hostname:
113+
114+
### Kubernetes StatefulSet
115+
```yaml
116+
env:
117+
- name: CROSSBAR_NODE_CLUSTER_IP
118+
valueFrom:
119+
fieldRef:
120+
fieldPath: status.podIP
121+
```
122+
123+
### Docker Compose
124+
```yaml
125+
environment:
126+
CROSSBAR_NODE_CLUSTER_IP: crossbar_router_realm1 # hostname or IP
127+
```
128+
129+
## Important Notes
130+
131+
1. **Hostnames supported**: The cluster_ip can be either an IP address or a resolvable hostname. Twisted's TCP client automatically resolves DNS names.
132+
133+
2. **Authentication-time update**: The cluster_ip update happens during authentication, not during session join. This is critical and doesn't depend on WAMP meta events.
134+
135+
3. **Backward compatibility**: Old key files with cluster_ip are still supported (cluster_ip in allowed_tags), but new key generation doesn't include it.
136+
137+
4. **No restart required**: When a router pod restarts with a new IP, the master database updates automatically on the next authentication. Proxy workers pick up the new IP from the database.
138+
139+
5. **Observability**: Log messages show when cluster IPs change:
140+
```
141+
Node router_realm1 cluster IP changed from 10.108.1.12 to 10.108.3.17 - updating database during authentication
142+
```
143+
144+
## Files Modified
145+
146+
- `crossbar/edge/node/management.py` - Send cluster_ip in authextra
147+
- `crossbar/master/node/authenticator.py` - Update database during authentication
148+
- `crossbar/master/mrealm/controller.py` - Removed redundant update logic
149+
- `crossbar/common/key.py` - Removed cluster_ip from key generation
150+
- `crossbar/master/node/controller.py` - Removed cluster_ip from auto-pairing
151+
152+
## Related Issues
153+
154+
- PR #2137: Resilient Proxy node and Router node management
155+
- Kubernetes pod IP changes on rescheduling
156+
- StatefulSet pod lifecycle management

BUGFIX-FORCE-REREGISTER.md

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Force Re-registration Fix for Stale RLink Registrations
2+
3+
## Problem Summary
4+
5+
When router nodes disconnect and reconnect via RLink (router-to-router links), their previous procedure registrations become stale on the remote router. When the RLink reconnects and tries to re-register the same procedures, it receives `wamp.error.procedure_already_exists` errors, preventing the procedures from being available in the cluster.
6+
7+
**Symptoms:**
8+
- RLink connections succeed but procedures don't get registered
9+
- `procedure_already_exists` errors in logs during RLink registration
10+
- Procedures unavailable on remote routers after RLink reconnection
11+
- Manual cleanup required to restore functionality
12+
13+
## Root Cause
14+
15+
When an RLink session disconnects unexpectedly (network issue, pod restart, etc.), the remote router doesn't immediately clean up the registrations made by that RLink. When the RLink reconnects:
16+
17+
1. **Stale registrations remain**: The old registrations from the previous RLink session are still active
18+
2. **Standard registration fails**: New registration attempt gets `procedure_already_exists` error
19+
3. **No automatic cleanup**: Without `force_reregister`, there's no mechanism to replace stale registrations
20+
21+
## Solution
22+
23+
Implement automatic retry with `force_reregister=True` when RLink encounters `procedure_already_exists` errors. This allows the new RLink session to forcefully replace stale registrations from previous sessions.
24+
25+
### Implementation
26+
27+
**Key Changes:**
28+
29+
1. **Dealer supports force_reregister** (`crossbar/router/dealer.py`)
30+
- Added `force_reregister` option to `Register` message handling
31+
- When `force_reregister=True`, kicks out all other observers before registering
32+
- Sends `UNREGISTERED` messages to kicked observers
33+
- Deletes and recreates observation if all observers were kicked
34+
35+
2. **RLink automatic retry logic** (`crossbar/worker/rlink.py`)
36+
- Preserves original `force_reregister` setting from registration details
37+
- First tries registration with original settings
38+
- On `procedure_already_exists` error:
39+
- If original didn't use `force_reregister`, retries with `force_reregister=True`
40+
- If original already used `force_reregister`, logs error (possible race condition)
41+
- Handles stale registrations from previous RLink connections
42+
43+
### Code Flow
44+
45+
```
46+
RLink Session Connects
47+
48+
Forwards registrations from local router to remote router
49+
50+
First attempt: register(force_reregister=False) # or original setting
51+
52+
├─ Success → Registration complete
53+
54+
└─ procedure_already_exists error
55+
56+
Check if original used force_reregister
57+
58+
If not, retry: register(force_reregister=True)
59+
60+
Dealer kicks out stale observers (previous RLink session)
61+
62+
Sends UNREGISTERED to stale sessions
63+
64+
Deletes old observation, creates new one
65+
66+
Registration succeeds with new RLink session
67+
```
68+
69+
## Code Details
70+
71+
### Dealer Force Re-registration Logic
72+
73+
```python
74+
if register.force_reregister and registration:
75+
# Kick out all other observers, but not the session doing the re-registration
76+
observers_to_kick = [obs for obs in registration.observers if obs != session]
77+
78+
for obs in observers_to_kick:
79+
self._registration_map.drop_observer(obs, registration)
80+
kicked = message.Unregistered(
81+
0,
82+
registration=registration.id,
83+
reason="wamp.error.unregistered",
84+
)
85+
self._router.send(obs, kicked)
86+
87+
# If we kicked out all observers, delete the observation so it can be recreated
88+
if observers_to_kick and len(registration.observers) == len(observers_to_kick):
89+
self._registration_map.delete_observation(registration)
90+
```
91+
92+
### RLink Retry Logic
93+
94+
```python
95+
# First try with original settings
96+
try:
97+
reg = yield other.register(on_call,
98+
uri,
99+
options=RegisterOptions(
100+
details_arg='details',
101+
invoke=invoke,
102+
match=match,
103+
force_reregister=original_force_reregister,
104+
forward_for=forward_for,
105+
))
106+
except ApplicationError as e:
107+
if e.error == 'wamp.error.procedure_already_exists':
108+
# If procedure already exists AND original didn't use force_reregister,
109+
# retry with force_reregister=True to replace stale registration.
110+
if not original_force_reregister:
111+
other_leg = 'local' if self.IS_REMOTE_LEG else 'remote'
112+
self.log.debug(
113+
f"procedure {uri} already exists on {other_leg}, "
114+
f"retrying with force_reregister=True")
115+
try:
116+
reg = yield other.register(on_call,
117+
uri,
118+
options=RegisterOptions(
119+
details_arg='details',
120+
invoke=invoke,
121+
match=match,
122+
force_reregister=True,
123+
forward_for=forward_for,
124+
))
125+
except Exception as retry_e:
126+
self.log.error(f"failed to force-reregister {uri}: {retry_e}")
127+
return
128+
```
129+
130+
## Testing
131+
132+
### Local Testing
133+
```bash
134+
# Start cluster with router and RLink
135+
docker-compose up -d
136+
137+
# Check RLink connection
138+
docker logs crossbar_router_realm1 2>&1 | grep -i rlink
139+
140+
# Verify procedure registrations
141+
docker logs crossbar_router_realm1 2>&1 | grep "forward-register"
142+
143+
# Simulate disconnect/reconnect
144+
docker restart crossbar_router_realm1
145+
146+
# Check for force_reregister retry messages
147+
docker logs crossbar_router_realm1 2>&1 | grep "retrying with force_reregister=True"
148+
149+
# Verify procedures are available
150+
# Test RPC calls to procedures
151+
```
152+
153+
### Kubernetes Testing
154+
```bash
155+
# Check RLink status
156+
kubectl logs crossbar-router-realm1-sfs-0 | grep rlink
157+
158+
# Delete pod to simulate reconnection
159+
kubectl delete pod crossbar-router-realm1-sfs-0
160+
161+
# Watch for reconnection and registration
162+
kubectl logs -f crossbar-router-realm1-sfs-0 | grep -E "rlink|force_reregister|procedure_already_exists"
163+
164+
# Verify procedures registered successfully
165+
kubectl logs crossbar-router-realm1-sfs-0 | grep "forward-register.*success"
166+
```
167+
168+
## Edge Cases Handled
169+
170+
1. **Session already registered**: If the current session is already registered for the procedure, it won't kick itself out
171+
2. **Original force_reregister=True**: If the original registration already used `force_reregister`, a conflict indicates a race condition or multiple RLinks
172+
3. **All observers kicked**: If all observers are removed, the observation is deleted and recreated cleanly
173+
4. **Retry failure**: If the retry with `force_reregister=True` also fails, the error is logged and the registration is abandoned
174+
175+
## Important Notes
176+
177+
1. **Automatic cleanup**: Stale registrations are automatically replaced without manual intervention
178+
2. **Session preservation**: The current session won't kick itself out if it's already registered
179+
3. **Non-destructive**: If the original registration used `force_reregister=True`, we don't retry to avoid loops
180+
4. **Backward compatible**: Existing code without `force_reregister` continues to work normally
181+
5. **RLink-specific**: This primarily benefits RLink (router-to-router) connections where stale registrations are common
182+
183+
## Observability
184+
185+
Log messages to watch for:
186+
187+
### Successful force re-registration:
188+
```
189+
procedure com.example.procedure already exists on remote, retrying with force_reregister=True
190+
```
191+
192+
### Force re-registration conflict (race condition):
193+
```
194+
procedure com.example.procedure already exists even though we used force_reregister=True.
195+
Race condition or multiple rlinks?
196+
```
197+
198+
### Observer kicked out:
199+
```
200+
UNREGISTERED message sent to session (kicked by force_reregister)
201+
```
202+
203+
## Files Modified
204+
205+
- `crossbar/router/dealer.py` - Added force_reregister handling in `processRegister`
206+
- `crossbar/worker/rlink.py` - Added automatic retry with force_reregister on conflict
207+
208+
## Related Issues
209+
210+
- PR #2137: Resilient Proxy node and Router node management
211+
- RLink session lifecycle management
212+
- Stale registration cleanup
213+
- Router cluster resilience
214+
215+
## Migration Notes
216+
217+
No migration required. The fix is backward compatible:
218+
- Existing registrations continue to work normally
219+
- Only activates on `procedure_already_exists` errors
220+
- Original registration behavior preserved for non-RLink sessions

0 commit comments

Comments
 (0)