RecordEvolution
diff --git a/‎BUGFIX-CLUSTER-IP-REFRESH.md‎
Lines changed: 156 additions & 0 deletions b/‎BUGFIX-CLUSTER-IP-REFRESH.md‎
Lines changed: 156 additions & 0 deletions
diff --git a/‎BUGFIX-FORCE-REREGISTER.md‎
Lines changed: 220 additions & 0 deletions b/‎BUGFIX-FORCE-REREGISTER.md‎
Lines changed: 220 additions & 0 deletions
@@ -0,0 +1,156 @@
+# Cluster IP Auto-Refresh Fix
+
+## Problem Summary
+
+When router nodes in a Crossbar.io cluster restart (e.g., Kubernetes pod rescheduling), they receive new IP addresses. However, proxy workers were connecting to stale IP addresses stored in the master's database, causing connection failures.
+
+**Symptoms:**
+- WAMP clients unable to connect to realms via proxy
+- Proxy workers timing out with "Connection refused" errors
+- Router nodes running with new IPs (e.g., 10.108.3.17) but database containing old IPs (e.g., 10.108.1.12)
+
+## Root Cause
+
+The cluster_ip field in the master database was not being updated when router nodes reconnected with new IP addresses. The system had several issues:
+
+1. **WAMP meta events not enabled**: The management realm didn't have `wamp.session.on_join` meta events enabled, so session join handlers never fired
+2. **Wrong update location**: Initial attempts to update cluster_ip in session join handlers (`_on_session_startup`) failed because meta events weren't being published
+3. **Stale data returned**: The authenticator returned the old `node.authextra` from the database, overwriting the current cluster_ip sent by the node
+
+## Solution
+
+Update the cluster_ip during authentication phase, before the session joins. This doesn't rely on WAMP meta events and executes on every node connection.
+
+### Implementation
+
+**Key Changes:**
+
+1. **Node sends cluster_ip in authextra** (`crossbar/edge/node/management.py`)
+   - Reads `CROSSBAR_NODE_CLUSTER_IP` from environment variable
+   - Falls back to `127.0.0.1` if not set
+   - Sends cluster_ip in authextra during authentication
+
+2. **Authenticator updates database** (`crossbar/master/node/authenticator.py`)
+   - Extracts `cluster_ip` from incoming `details['authextra']`
+   - Compares with database `node.cluster_ip`
+   - Updates database if different (with write transaction)
+   - Updates both `node.cluster_ip` and `node.authextra['cluster_ip']`
+   - Logs IP changes for observability
+
+3. **Cleanup** 
+   - Removed cluster_ip from key file generation (`crossbar/common/key.py`)
+   - Removed cluster_ip from auto-pairing logic (`crossbar/master/node/controller.py`)
+   - Removed redundant database update from session join handler (`crossbar/master/mrealm/controller.py`)
+
+### Code Flow
+
+```
+Router Node Restart
+    ↓
+Read CROSSBAR_NODE_CLUSTER_IP from environment (pod IP)
+    ↓
+Connect to master with cluster_ip in authextra
+    ↓
+Authenticator._auth_node() extracts incoming cluster_ip
+    ↓
+Compare incoming_cluster_ip vs database node.cluster_ip
+    ↓
+If different: Update database + log change
+    ↓
+Return updated authextra to node
+    ↓
+ApplicationRealmMonitor reads node.cluster_ip from database
+    ↓
+Configure proxy backend connections with current IP
+    ↓
+Proxy workers connect to correct router IP
+```
+
+## Testing
+
+### Local Docker Verification
+```bash
+# Build and deploy
+just build_amd
+docker-compose up -d
+
+# Check router environment
+docker exec crossbar_router_realm1 env | grep CROSSBAR_NODE_CLUSTER_IP
+
+# Verify authenticator receives cluster_ip
+docker logs crossbar_master 2>&1 | grep "Node authentication received"
+
+# Check for IP changes (on pod restart)
+docker logs crossbar_master 2>&1 | grep "cluster IP changed"
+
+# Verify proxy connections succeed
+docker logs crossbar_proxy1 2>&1 | grep "proxy backend session joined"
+```
+
+### Kubernetes/GKE Verification
+```bash
+# Check router pod IP
+kubectl get pod crossbar-router-realm1-sfs-0 -o wide
+
+# Verify environment variable
+kubectl exec crossbar-router-realm1-sfs-0 -- env | grep CROSSBAR_NODE_CLUSTER_IP
+
+# Check master logs for IP updates
+kubectl logs crossbar-master-0 | grep "cluster IP changed"
+
+# Verify proxy connections
+kubectl logs crossbar-proxy-realm1-0 | grep "proxy backend session joined"
+
+# Test pod restart
+kubectl delete pod crossbar-router-realm1-sfs-0
+# Wait for pod to restart with new IP
+kubectl logs crossbar-master-0 | grep "cluster IP changed"
+```
+
+## Environment Configuration
+
+Router nodes must set `CROSSBAR_NODE_CLUSTER_IP` to their reachable IP address or hostname:
+
+### Kubernetes StatefulSet
+```yaml
+env:
+- name: CROSSBAR_NODE_CLUSTER_IP
+  valueFrom:
+    fieldRef:
+      fieldPath: status.podIP
+```
+
+### Docker Compose
+```yaml
+environment:
+  CROSSBAR_NODE_CLUSTER_IP: crossbar_router_realm1  # hostname or IP
+```
+
+## Important Notes
+
+1. **Hostnames supported**: The cluster_ip can be either an IP address or a resolvable hostname. Twisted's TCP client automatically resolves DNS names.
+
+2. **Authentication-time update**: The cluster_ip update happens during authentication, not during session join. This is critical and doesn't depend on WAMP meta events.
+
+3. **Backward compatibility**: Old key files with cluster_ip are still supported (cluster_ip in allowed_tags), but new key generation doesn't include it.
+
+4. **No restart required**: When a router pod restarts with a new IP, the master database updates automatically on the next authentication. Proxy workers pick up the new IP from the database.
+
+5. **Observability**: Log messages show when cluster IPs change:
+   ```
+   Node router_realm1 cluster IP changed from 10.108.1.12 to 10.108.3.17 - updating database during authentication
+   ```
+
+## Files Modified
+
+- `crossbar/edge/node/management.py` - Send cluster_ip in authextra
+- `crossbar/master/node/authenticator.py` - Update database during authentication
+- `crossbar/master/mrealm/controller.py` - Removed redundant update logic
+- `crossbar/common/key.py` - Removed cluster_ip from key generation
+- `crossbar/master/node/controller.py` - Removed cluster_ip from auto-pairing
+
+## Related Issues
+
+- PR #2137: Resilient Proxy node and Router node management
+- Kubernetes pod IP changes on rescheduling
+- StatefulSet pod lifecycle management
@@ -0,0 +1,220 @@
+# Force Re-registration Fix for Stale RLink Registrations
+
+## Problem Summary
+
+When router nodes disconnect and reconnect via RLink (router-to-router links), their previous procedure registrations become stale on the remote router. When the RLink reconnects and tries to re-register the same procedures, it receives `wamp.error.procedure_already_exists` errors, preventing the procedures from being available in the cluster.
+
+**Symptoms:**
+- RLink connections succeed but procedures don't get registered
+- `procedure_already_exists` errors in logs during RLink registration
+- Procedures unavailable on remote routers after RLink reconnection
+- Manual cleanup required to restore functionality
+
+## Root Cause
+
+When an RLink session disconnects unexpectedly (network issue, pod restart, etc.), the remote router doesn't immediately clean up the registrations made by that RLink. When the RLink reconnects:
+
+1. **Stale registrations remain**: The old registrations from the previous RLink session are still active
+2. **Standard registration fails**: New registration attempt gets `procedure_already_exists` error
+3. **No automatic cleanup**: Without `force_reregister`, there's no mechanism to replace stale registrations
+
+## Solution
+
+Implement automatic retry with `force_reregister=True` when RLink encounters `procedure_already_exists` errors. This allows the new RLink session to forcefully replace stale registrations from previous sessions.
+
+### Implementation
+
+**Key Changes:**
+
+1. **Dealer supports force_reregister** (`crossbar/router/dealer.py`)
+   - Added `force_reregister` option to `Register` message handling
+   - When `force_reregister=True`, kicks out all other observers before registering
+   - Sends `UNREGISTERED` messages to kicked observers
+   - Deletes and recreates observation if all observers were kicked
+
+2. **RLink automatic retry logic** (`crossbar/worker/rlink.py`)
+   - Preserves original `force_reregister` setting from registration details
+   - First tries registration with original settings
+   - On `procedure_already_exists` error:
+     - If original didn't use `force_reregister`, retries with `force_reregister=True`
+     - If original already used `force_reregister`, logs error (possible race condition)
+   - Handles stale registrations from previous RLink connections
+
+### Code Flow
+
+```
+RLink Session Connects
+    ↓
+Forwards registrations from local router to remote router
+    ↓
+First attempt: register(force_reregister=False)  # or original setting
+    ↓
+    ├─ Success → Registration complete
+    │
+    └─ procedure_already_exists error
+        ↓
+        Check if original used force_reregister
+        ↓
+        If not, retry: register(force_reregister=True)
+        ↓
+        Dealer kicks out stale observers (previous RLink session)
+        ↓
+        Sends UNREGISTERED to stale sessions
+        ↓
+        Deletes old observation, creates new one
+        ↓
+        Registration succeeds with new RLink session
+```
+
+## Code Details
+
+### Dealer Force Re-registration Logic
+
+```python
+if register.force_reregister and registration:
+    # Kick out all other observers, but not the session doing the re-registration
+    observers_to_kick = [obs for obs in registration.observers if obs != session]
+    
+    for obs in observers_to_kick:
+        self._registration_map.drop_observer(obs, registration)
+        kicked = message.Unregistered(
+            0,
+            registration=registration.id,
+            reason="wamp.error.unregistered",
+        )
+        self._router.send(obs, kicked)
+    
+    # If we kicked out all observers, delete the observation so it can be recreated
+    if observers_to_kick and len(registration.observers) == len(observers_to_kick):
+        self._registration_map.delete_observation(registration)
+```
+
+### RLink Retry Logic
+
+```python
+# First try with original settings
+try:
+    reg = yield other.register(on_call,
+                               uri,
+                               options=RegisterOptions(
+                                   details_arg='details',
+                                   invoke=invoke,
+                                   match=match,
+                                   force_reregister=original_force_reregister,
+                                   forward_for=forward_for,
+                               ))
+except ApplicationError as e:
+    if e.error == 'wamp.error.procedure_already_exists':
+        # If procedure already exists AND original didn't use force_reregister,
+        # retry with force_reregister=True to replace stale registration.
+        if not original_force_reregister:
+            other_leg = 'local' if self.IS_REMOTE_LEG else 'remote'
+            self.log.debug(
+                f"procedure {uri} already exists on {other_leg}, "
+                f"retrying with force_reregister=True")
+            try:
+                reg = yield other.register(on_call,
+                                           uri,
+                                           options=RegisterOptions(
+                                               details_arg='details',
+                                               invoke=invoke,
+                                               match=match,
+                                               force_reregister=True,
+                                               forward_for=forward_for,
+                                           ))
+            except Exception as retry_e:
+                self.log.error(f"failed to force-reregister {uri}: {retry_e}")
+                return
+```
+
+## Testing
+
+### Local Testing
+```bash
+# Start cluster with router and RLink
+docker-compose up -d
+
+# Check RLink connection
+docker logs crossbar_router_realm1 2>&1 | grep -i rlink
+
+# Verify procedure registrations
+docker logs crossbar_router_realm1 2>&1 | grep "forward-register"
+
+# Simulate disconnect/reconnect
+docker restart crossbar_router_realm1
+
+# Check for force_reregister retry messages
+docker logs crossbar_router_realm1 2>&1 | grep "retrying with force_reregister=True"
+
+# Verify procedures are available
+# Test RPC calls to procedures
+```
+
+### Kubernetes Testing
+```bash
+# Check RLink status
+kubectl logs crossbar-router-realm1-sfs-0 | grep rlink
+
+# Delete pod to simulate reconnection
+kubectl delete pod crossbar-router-realm1-sfs-0
+
+# Watch for reconnection and registration
+kubectl logs -f crossbar-router-realm1-sfs-0 | grep -E "rlink|force_reregister|procedure_already_exists"
+
+# Verify procedures registered successfully
+kubectl logs crossbar-router-realm1-sfs-0 | grep "forward-register.*success"
+```
+
+## Edge Cases Handled
+
+1. **Session already registered**: If the current session is already registered for the procedure, it won't kick itself out
+2. **Original force_reregister=True**: If the original registration already used `force_reregister`, a conflict indicates a race condition or multiple RLinks
+3. **All observers kicked**: If all observers are removed, the observation is deleted and recreated cleanly
+4. **Retry failure**: If the retry with `force_reregister=True` also fails, the error is logged and the registration is abandoned
+
+## Important Notes
+
+1. **Automatic cleanup**: Stale registrations are automatically replaced without manual intervention
+2. **Session preservation**: The current session won't kick itself out if it's already registered
+3. **Non-destructive**: If the original registration used `force_reregister=True`, we don't retry to avoid loops
+4. **Backward compatible**: Existing code without `force_reregister` continues to work normally
+5. **RLink-specific**: This primarily benefits RLink (router-to-router) connections where stale registrations are common
+
+## Observability
+
+Log messages to watch for:
+
+### Successful force re-registration:
+```
+procedure com.example.procedure already exists on remote, retrying with force_reregister=True
+```
+
+### Force re-registration conflict (race condition):
+```
+procedure com.example.procedure already exists even though we used force_reregister=True. 
+Race condition or multiple rlinks?
+```
+
+### Observer kicked out:
+```
+UNREGISTERED message sent to session (kicked by force_reregister)
+```
+
+## Files Modified
+
+- `crossbar/router/dealer.py` - Added force_reregister handling in `processRegister`
+- `crossbar/worker/rlink.py` - Added automatic retry with force_reregister on conflict
+
+## Related Issues
+
+- PR #2137: Resilient Proxy node and Router node management
+- RLink session lifecycle management
+- Stale registration cleanup
+- Router cluster resilience
+
+## Migration Notes
+
+No migration required. The fix is backward compatible:
+- Existing registrations continue to work normally
+- Only activates on `procedure_already_exists` errors
+- Original registration behavior preserved for non-RLink sessions