Multihomed support for client socket in snet by juagargi · Pull Request #53 · juagargi/scion

juagargi · 2025-09-04T13:30:15Z

This PR adds a basic multihomed support for sockets created using snet.
It allows to create SCION snet sockets without binding them to a local address.
The changes should not affect any previous code that created sockets binding them to a local address.
Two use cases are relevant, and used as examples:

A client opens a socket to send a message to a given remote, not needing to bind the socket to any interface/address in particular.
A server opens a socket to listen to any connection, not bound to any interface/address in particular.

In the first case, the local address is found via (a syscall to the kernel) creating a regular UDP socket with the specified destination, letting the kernel check the routing table and deciding what local IP address the socket will have.

In the second case, the packet received contains the last hop, which is used to create a regular UDP socket, similar to the case above. This time, the socket is not bound to the local address, and the local address is used only to allow typical constructs in the server such as

// Read data from the client.
n, remoteAddr, err := conn.ReadFrom(buff)
...
// Reply to the client.
n, err = conn.WriteTo([]byte(buff), remoteAddr)

In both cases no network traffic is created, and only the syscalls are needed to communicate with the kernel.

Additionally, there is a mechanism that reduces the amount of syscalls needed to open an unbound snet socket, and it is based on caching previous syscalls to the kernel.
If we denote outbound(remote) the function that returns the local IP address for a given remote address (the next hop of a SCION path), we keep a map of these results. If the cached map doesn't contain an entry for a given remote, the regular outbound(remote) is executed, thus issuing syscalls.

The cache is valid only for as long as the local addresses do not change, or there is a change in the routing table. We assume that a change of the routing table will be always accompanied by a change on the local interfaces' configuration, be it in the form of their local addresses, or creation/removal of some of these interfaces. If such an event is detected, the whole cache is cleared.

The detection of the changes should be done via e.g. netlink, but this PR is not using it, instead relying on the much simpler approach of querying the local interface state every T seconds, where T is defined here as 1 second.
This implies that in the case of a change of interface local address, or creation/deletion of an interface, there could be traffic being sent using a wrong local address (for at most T seconds).

This PR enables binding of SCION sockets to unspecified IP, with the following uncovered cases (i.e., clients/servers that rely on this feature may lose connectivity, drop packets):

Routing changes on the endhost that do not imply interface changes.
In case of a change of interface local address, or creation/deletion of an interface, there could be traffic being sent using a wrong local address (for at most T seconds).

In order to tackle this shortcoming one should implement a different solution using e.g. netlink.

In the case of using a dispatcher to obtain or send the packets, although the snet socket would NOT require to be unbound to any local address, this PR should not introduce any visible effects.

This change is

Via the creation of a UDP socket with destination the next hop, those snet sockets that were created without binding them to a local address, can now be used regularly. There is a additional cache that allows skipping the UDP socket creation if the address to dial to has been previously seen. There is a recurrent refresh function running in the background that checks that the host interfaces' local addresses didn't change, and if they did, clears the cache.

…ate getter.

JordiSubira

Currently this change tackles routing changes that imply a change in the interface/address list, i.e., the 2 cases you describe at interfaces.go. You should explicitly mention that this change does not cover routing changes on the endhost that do not imply interface changes. In other words, the source address on the SCION packet may have a wrong value if this ever happens.

Just to clarify the scenario I am think of:

Interface 0 with IP address A.A.A.A
Interface 1 with IP address B.B.B.B
Destination X.X.X.X
Initially our routing table uses interface A.A.A.A:

default via A.A.A.A dev eth0 proto dhcp metric 100 
default via B.B.B.B dev eth1 proto dhcp metric 200

Due to some network change due to a dynamic protocol or manual configuration the endhost routing table gets updated to

default via A.A.A.A dev eth0 proto dhcp metric 100 
default via B.B.B.B dev eth1 proto dhcp metric 200
X.X.X.X/16 via B.B.B.B dev eth1

However, the internal cache doesn't get updated based on the description. Similarly, I haven't checked whether the code handles the fact that the interface is down although still appears on the list with the same assigned IP address.

Perhaps this is not relevant for the use case this PR is trying to initially support, but it is important to mention in case this PR is ever used as referenced to merge in scionproto.

@JordiSubira reviewed 8 of 14 files at r1.
Reviewable status: 8 of 14 files reviewed, 6 unresolved discussions

pkg/snet/multihomed/outbound_addr.go line 26 at r1 (raw file):

// It relies on a previously populated table that maps remote addresses to egress addresses.
// If the remote is not present, it is added.
func OutboundIP(raddr *net.UDPAddr) (net.IP, error) {

Nit: change parameter name to nextHop normally we use raddr as endhost remote address.

pkg/snet/multihomed/outbound_addr.go line 31 at r1 (raw file):

	defer muRemoteToEgress.RLocker().Unlock()

	remote, _ := netip.AddrFromSlice(raddr.IP)

It will panic if raddr is nil. I guess most on the inputs are constructed properly by previous library calls, but I am normally in favor of handling this returning an error than just panicking...

pkg/snet/multihomed/outbound_addr.go line 57 at r1 (raw file):

	}
	return eg, nil
}

The lock logic here seems a bit contrived, could we simplify for sth like this:

func OutboundIP(raddr *net.UDPAddr) (net.IP, error) {
    remote, _ := netip.AddrFromSlice(raddr.IP)
    
    muRemoteToEgress.RLock()
    egress, ok := remoteToEgress[remote]
    muRemoteToEgress.RUnlock()
    
    if ok {
        return net.IP(egress.AsSlice()), nil
    }

    eg, err := dialRemote(raddr)
    if err != nil {
        return nil, err
    }
    
    egress, _ = netip.AddrFromSlice(eg)
    muRemoteToEgress.Lock()
    if len(remoteToEgress) < MaxAllowedCacheSize {
        remoteToEgress[remote] = egress
    }
    muRemoteToEgress.Unlock()
    
    return eg, nil
}
``

Code quote:

	// Check if the table contains an entry.
	muRemoteToEgress.RLock()
	defer muRemoteToEgress.RLocker().Unlock()

	remote, _ := netip.AddrFromSlice(raddr.IP)
	egress, ok := remoteToEgress[remote]
	if ok {
		return net.IP(egress.AsSlice()), nil
	}

	// Not found, find it and add it. The dialing involves a syscall, but no network traffic.
	eg, err := dialRemote(raddr)
	if err != nil {
		return nil, err
	}
	egress, _ = netip.AddrFromSlice(eg)

	// Check if our cache is not too big already.
	if len(remoteToEgress) < MaxAllowedCacheSize {
		// Beware of the RWMutex, it's read-locked already.
		muRemoteToEgress.RUnlock()
		muRemoteToEgress.Lock()
		// Check again to avoid race conditions. Could skipped it if exact size is not important.
		if len(remoteToEgress) < MaxAllowedCacheSize {
			remoteToEgress[remote] = egress
		}
		muRemoteToEgress.Unlock()
		muRemoteToEgress.RLock() // because we always read-unlock in this function.
	}
	return eg, nil
}

pkg/snet/conn.go line 79 at r1 (raw file):

	// If acting as a client, find the local address of the interface used to reach
	// the NextHop toward the remote.
	if local.Host.IP.IsUnspecified() && o.remote != nil {

Keep the non-nil check on local.host as a safeguard to panic.

pkg/snet/conn.go line 87 at r1 (raw file):

				"remote", o.remote)
		}
		local.Host.IP = localIP

Perhaps I am not understanding this completely, but it seems that you fix the local address upon creation for clients. What happens for long-lived connections in which the nextHop address changes. One can use snet.WriteTo(msg, remote) por a remote that has changed the path, for instance what pan.Conn does. In this case, the local.Host.IP for the client would need to change, wouldn't it?

pkg/snet/reader.go line 96 at r1 (raw file):

	// we would like to replace this logic (e.g., using IP_PKTINFO, with its caveats).
	pktAddrPort := netip.AddrPortFrom(pkt.Destination.Host.IP(), udp.DstPort)
	if c.local.IA != pkt.Destination.IA ||

I am a bit reluctant on just removing, this check. The reason is that we decouple the underlay from the SCION layer completely. We should clearly document this for applications so that library users understand the implications. More concretely, that SCION_SRC_IP/PORT in the packet doesn't correspond to the underlay for which the packet was received.

juagargi

Correct, I thought it was clear in the PR description, but I found no documentation on the code itself.
Now I have added some sentences in the NewCookedConn documentation clarifying that the returned Conn reacts to routing changes only if they come with a change in the local network interface set.

Reviewable status: 6 of 14 files reviewed, 5 unresolved discussions (waiting on @JordiSubira)

pkg/snet/conn.go line 79 at r1 (raw file):

Previously, JordiSubira wrote…

Keep the non-nil check on local.host as a safeguard to panic.

We have a type assertion right above setting local.Host to a *net.UDPAddr, AFAIK it will already panic or set local.Host to a non nil value, is that correct? If so, do you want me to check pconn.LocalAddr before is type asserted?

pkg/snet/conn.go line 87 at r1 (raw file):

Previously, JordiSubira wrote…

Perhaps I am not understanding this completely, but it seems that you fix the local address upon creation for clients. What happens for long-lived connections in which the nextHop address changes. One can use snet.WriteTo(msg, remote) por a remote that has changed the path, for instance what pan.Conn does. In this case, the local.Host.IP for the client would need to change, wouldn't it?

What you say is correct, I think this code may not be very robust: when the connection is created via NewCookedConn the socket gets the local IP to that that reaches the next hop at this moment, but after some time, the next hop maybe different. In this case, the local IP maybe different too.
Since PacketConn offers WriteTo with the address of the next hop specified as a parameter, the construction of the socket in NewCookedConn should not attempt to get the local IP address at that moment in time, but always leave it to the WriteTo logic, where the next hop address is known.

pkg/snet/reader.go line 96 at r1 (raw file):

Previously, JordiSubira wrote…

I am a bit reluctant on just removing, this check. The reason is that we decouple the underlay from the SCION layer completely. We should clearly document this for applications so that library users understand the implications. More concretely, that SCION_SRC_IP/PORT in the packet doesn't correspond to the underlay for which the packet was received.

Understood. My rationale for removing it is to decouple routing/packet delivery from the snet library if possible, i.e. "if the packet arrived here, there must have been a reason for it to arrive". Please let me know if you think this is not appropriate at this stage of the scionproto development and on the other hand if this seems okay, where I should write some clarification on how UDP packets are used as underlay for SCION packets.

pkg/snet/multihomed/outbound_addr.go line 31 at r1 (raw file):

Previously, JordiSubira wrote…

It will panic if raddr is nil. I guess most on the inputs are constructed properly by previous library calls, but I am normally in favor of handling this returning an error than just panicking...

Done.

pkg/snet/multihomed/outbound_addr.go line 57 at r1 (raw file):

Previously, JordiSubira wrote…

The lock logic here seems a bit contrived, could we simplify for sth like this:

func OutboundIP(raddr *net.UDPAddr) (net.IP, error) {
    remote, _ := netip.AddrFromSlice(raddr.IP)
    
    muRemoteToEgress.RLock()
    egress, ok := remoteToEgress[remote]
    muRemoteToEgress.RUnlock()
    
    if ok {
        return net.IP(egress.AsSlice()), nil
    }

    eg, err := dialRemote(raddr)
    if err != nil {
        return nil, err
    }
    
    egress, _ = netip.AddrFromSlice(eg)
    muRemoteToEgress.Lock()
    if len(remoteToEgress) < MaxAllowedCacheSize {
        remoteToEgress[remote] = egress
    }
    muRemoteToEgress.Unlock()
    
    return eg, nil
}
``

Correct, your version is not only more elegant, is also locking less amount of time. I'm using it.

JordiSubira

To me it was not clear on the description. Aside from the in-line comment to the Conn constructor, can you add in the PR description:

This PR enables binding of SCION sockets to unspecified IP, with the following uncovered cases (i.e., clients/servers that rely on this feature may lose connectivity, drop packets):

Routing changes on the endhost that do not imply interface changes.

In case of a change of interface local address, or creation/deletion of an interface, there could be traffic being sent using a wrong local address (for at most T seconds).

In order to tackle this shortcoming one should implement a different solution using e.g. netlink.

Reviewable status: 6 of 14 files reviewed, 3 unresolved discussions (waiting on @juagargi)

pkg/snet/conn.go line 79 at r1 (raw file):

Previously, juagargi (Juan A. Garcia Pardo) wrote…

We have a type assertion right above setting local.Host to a *net.UDPAddr, AFAIK it will already panic or set local.Host to a non nil value, is that correct? If so, do you want me to check pconn.LocalAddr before is type asserted?

Yes, you are right I missed that one, don't think it needs to be checked then.

pkg/snet/reader.go line 96 at r1 (raw file):

Previously, juagargi (Juan A. Garcia Pardo) wrote…

Understood. My rationale for removing it is to decouple routing/packet delivery from the snet library if possible, i.e. "if the packet arrived here, there must have been a reason for it to arrive". Please let me know if you think this is not appropriate at this stage of the scionproto development and on the other hand if this seems okay, where I should write some clarification on how UDP packets are used as underlay for SCION packets.

If we are going to allow this, the check will even be better at snet.OpenRaw().

This is not about the packet arriving, is about the information on the packet being meaningful, i.e., the src address on the SCION layer being decoupled from the underlay. So far, this wasn't the case for snet (we didn't allow for Unspecified address, not even before the dispatcher-less change).

This may have implications, not all of them completely obvious perhaps at the moment, but I can try to come up with an "artifact" example: A SCION application that listens on 0.0.0.0 wants to display how many packets is receiving per interface. One could think that I can use pkt.Destination.Host (populated at snet.PacketConn.ReadFrom(pkt)) and infer from that address the interface for which I am receiving packets. This is not correct, because one could send a SCION packet to the underlay X.X.X.X with SCION destination address Y.Y.Y.Y. Of course in this model, "the attacker" is local (meaning within the AS).

I am not saying we have to care about this "misuse case" which as I stated is a bit of an artifact, what I am saying is that we ever have this change on production, this has to be documented. My suggestion is even having it documented at snet.OpenRaw() because the example would be relevant for applications that read SCIONPackets and not only the payload.

pkg/snet/multihomed/outbound_addr.go line 57 at r1 (raw file):

Previously, juagargi (Juan A. Garcia Pardo) wrote…

Correct, your version is not only more elegant, is also locking less amount of time. I'm using it.

Sorry, why is it necessary the size check outside? Now you don't have a read lock. In any case, the write lock that you acquire before the second check already locks for the right size before trying to write in it, doesn't it?

Code snippet:

	if len(remoteToEgress) < MaxAllowedCacheSize {
		// Beware of the RWMutex, it's read-locked already.
		muRemoteToEgress.Lock()
		// Check again to avoid race conditions. Could skipped it if exact size is not important.
		if len(remoteToEgress) < MaxAllowedCacheSize {
			remoteToEgress[remote] = egress
		}
		muRemoteToEgress.Unlock()
	}

pkg/snet/snet.go line 95 at r2 (raw file):

// OpenRaw returns a PacketConn which listens on the specified address.
// Nil or unspecified addresses are not supported.

Remove this comment

juagargi

Done. I've added your suggestion after the paragraph that starts with

The detection of the changes should be done ...

Reviewable status: 6 of 14 files reviewed, 3 unresolved discussions (waiting on @JordiSubira and @marcfrei)

pkg/snet/reader.go line 96 at r1 (raw file):

Previously, JordiSubira wrote…

If we are going to allow this, the check will even be better at snet.OpenRaw().

This is not about the packet arriving, is about the information on the packet being meaningful, i.e., the src address on the SCION layer being decoupled from the underlay. So far, this wasn't the case for snet (we didn't allow for Unspecified address, not even before the dispatcher-less change).

This may have implications, not all of them completely obvious perhaps at the moment, but I can try to come up with an "artifact" example: A SCION application that listens on 0.0.0.0 wants to display how many packets is receiving per interface. One could think that I can use pkt.Destination.Host (populated at snet.PacketConn.ReadFrom(pkt)) and infer from that address the interface for which I am receiving packets. This is not correct, because one could send a SCION packet to the underlay X.X.X.X with SCION destination address Y.Y.Y.Y. Of course in this model, "the attacker" is local (meaning within the AS).

I am not saying we have to care about this "misuse case" which as I stated is a bit of an artifact, what I am saying is that we ever have this change on production, this has to be documented. My suggestion is even having it documented at snet.OpenRaw() because the example would be relevant for applications that read SCIONPackets and not only the payload.

It makes sense to disallow cases like the one you mentioned.
I think the proper way would be t have a more solid mechanism of delivery (e.g. using the Linux kernel with a SCION socket, where I could bind my socket to my ISD-AS but no IP in particular for example, forcing reception of only SCION packets with destination my IA), instead of checking in snet the reasons of a correct delivery of the packet. But we don't have that delivery mechanism yet.
So after talking to you offline, we have doubts on whether we should always perform the check that the scion host address is one of our local interfaces or not.
Tagging @marcfrei as he has been working with NAT, and this behavior affects NAT on the receiver side.

pkg/snet/snet.go line 95 at r2 (raw file):

Previously, JordiSubira wrote…

Remove this comment

Done.

pkg/snet/multihomed/outbound_addr.go line 57 at r1 (raw file):

Previously, JordiSubira wrote…

Sorry, why is it necessary the size check outside? Now you don't have a read lock. In any case, the write lock that you acquire before the second check already locks for the right size before trying to write in it, doesn't it?

Done.
I left it unintendedly because I'm a donkey. Now it locks to write, but only if the size is correct. If we consider that there is a huge number of concurrent threads, mostly likely to hit this code when the cache is already at MaxAllowedCacheSize, I could change the code to do a double lock:

Get the read lock.
Check the size, if the cache is already big enough, bail.
Because the size of the cache was not too big, release the read lock, acquire a write lock.
Check again to avoid races between other threads possibly adding to the cache between steps 2 and 3.
Increment cache, release write lock.

Let me know if you think performance of the locking is critical under these conditions, and I'll change the code to the proposed steps.

juagargi

Question @JordiSubira you may know: how can I create a test that always runs in a docker container, with certain setup (multiple local interfaces, etc), even spawning more than one container and network?

Reviewable status: 6 of 14 files reviewed, 3 unresolved discussions (waiting on @JordiSubira and @marcfrei)

marcfrei · 2025-09-17T07:42:34Z

As far as I can see, there's no negative interaction between this PR and the planned NAT address discovery (https://docs.scion.org/en/latest/dev/design/NAT-address-discovery.html). With NAT support, we would learn the SCION underlay source address via STUN, though. There will thus be no need to look up the outbound IP via multihomed.OutboundIP if STUN is used: https://github.com/jdslab/scion/blob/03154d48691cfc8cfd6d2cd3bc9af8a6b0cc0e35/demo/stun/test-client/main.go#L125

marcfrei · 2025-09-18T08:55:57Z

pkg/snet/multihomed/interfaces.go

+	remoteToEgress   map[netip.Addr]netip.Addr = make(map[netip.Addr]netip.Addr)
+	muRemoteToEgress sync.RWMutex              = sync.RWMutex{}


Wouldn't it be better to use sync.Map for this?

I just reviewed the new sync.Map that uses atomic, it is quite fast modulo the type assertion needed to get back the netip.Addr. All in full considered, it's >50% slower than a plain map and a r/w lock. I've added some benchmarks to the code to check this, please take a look.
These are my results:

goos: linux goarch: amd64 pkg: github.com/scionproto/scion/pkg/snet/multihomed cpu: Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz BenchmarkSyncMapWrites-8 2841009 522.4 ns/op BenchmarkSyncMapReads-8 4197412 257.4 ns/op BenchmarkMuMapWrites-8 4080162 332.0 ns/op BenchmarkMuMapReads-8 6577942 215.5 ns/op

Whenever sync.Map supports generic types (which is planned according to golang/go#71076) I would definitely use sync.Map instead, as its semantics are exactly the ones we need here and it's much less verbose than locking/unlocking every time.

marcfrei · 2025-09-18T08:57:31Z

pkg/snet/multihomed/outbound_addr.go

+	// Check if the table contains an entry.
+	muRemoteToEgress.RLock()
+	egress, ok := remoteToEgress[remote]
+	muRemoteToEgress.RLocker().Unlock()


muRemoteToEgress.RUnlock(), right?

marcfrei · 2025-09-18T08:59:44Z

pkg/snet/multihomed/outbound_addr.go

+
+	// Check if our cache is not too big already.
+	if len(remoteToEgress) < MaxAllowedCacheSize {
+		// Beware of the RWMutex, it's read-locked already.


Is this correct, didn't we unlock on line 37?

Done. It was a stale comment.

marcfrei · 2025-09-18T09:01:12Z

pkg/snet/multihomed/outbound_addr.go

+	egress, _ = netip.AddrFromSlice(eg)
+
+	// Check if our cache is not too big already.
+	if len(remoteToEgress) < MaxAllowedCacheSize {


Is this safe in the Go memory model (https://go.dev/ref/mem)? If yes, it would help me to have a comment explaining why. Do we need the optimization at all?

The code has now changed due to the very small performance optimization vs the introduced reasoning complexity. IIRC go test -race didn't complain, but let me answer the question anyway to check my sanity.
Is this safe? I think so, the rationale being that the first greater-than condition may be evaluated with a stale value, but it's okay because:

It will be evaluated again with the synchronized value if it's less than the threshold.

The len(remoteToEgress) value transitions only to equal or bigger, sequentially as time moves forward, regardless of the execution order of the threads (monotonicity). With this and the point above, we ensure that stale values are always checked when growing. If already equal to MaxAllowedCacheSize they are not synchronized, but it is not needed as they will never decrease (monotonicity). This is similar to a double checked lock.
I hope this makes sense. Please check with me if there are mistakes here, as I keep on using these ideas nowadays.

Addendum: I'm leaving the text above to expose my train of thought. But the second point, monotonicity, is not true: there is the case where size decreases, namely when calling invalidateAll. This could make a thread skip the insertion of the IP address value into the map, which although not being extremely harmful, is not what I explained above.

marcfrei · 2025-09-18T09:03:11Z

pkg/snet/multihomed/interfaces.go

+	// The local addresses are stored in an atomic pointer to allow tests to inspect the
+	// internal value of it without data races.
+	localAddresses = atomic.Pointer[[]netip.Addr]{}
+	ticker         = time.NewTicker(CheckInterfacesPeriod)


Could be local to continuousCheckInterfaces, couldn't it?

marcfrei · 2025-09-18T09:10:09Z

pkg/snet/multihomed/interfaces.go

+	// The local addresses are stored in an atomic pointer to allow tests to inspect the
+	// internal value of it without data races.


Is this added testability really worth the increased structural complexity?

Granted, it's horrible. I'm removing it, with the new changes this is not necessary.

marcfrei · 2025-09-18T09:16:54Z

pkg/snet/multihomed/interfaces.go

+	return addrs
+}
+
+func equalAddressList(a, b []netip.Addr) bool {


Use slices.Equal instead?

The atomic.Pointer seems not needed anymore, tests with -race work alright without warnings.

juagargi

Thanks Marc. If/when I open a PR against scionproto I will make sure to mention the interaction with NAT, so that whichever one gets merged second, can reuse the logic of finding out the egress IP addresses.
I have also added a comment in the OutboundIP public function referring to the interaction between NAT-STUN and OutboundIP

Reviewable status: 6 of 14 files reviewed, 10 unresolved discussions (waiting on @JordiSubira and @marcfrei)

juagargi · 2025-09-25T10:34:15Z

pkg/snet/multihomed/interfaces.go

+	remoteToEgress   map[netip.Addr]netip.Addr = make(map[netip.Addr]netip.Addr)
+	muRemoteToEgress sync.RWMutex              = sync.RWMutex{}


I just reviewed the new sync.Map that uses atomic, it is quite fast modulo the type assertion needed to get back the netip.Addr. All in full considered, it's >50% slower than a plain map and a r/w lock. I've added some benchmarks to the code to check this, please take a look.
These are my results:

goos: linux goarch: amd64 pkg: github.com/scionproto/scion/pkg/snet/multihomed cpu: Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz BenchmarkSyncMapWrites-8 2841009 522.4 ns/op BenchmarkSyncMapReads-8 4197412 257.4 ns/op BenchmarkMuMapWrites-8 4080162 332.0 ns/op BenchmarkMuMapReads-8 6577942 215.5 ns/op

Whenever sync.Map supports generic types (which is planned according to golang/go#71076) I would definitely use sync.Map instead, as its semantics are exactly the ones we need here and it's much less verbose than locking/unlocking every time.

juagargi · 2025-09-25T10:34:16Z

pkg/snet/multihomed/interfaces.go

+	// The local addresses are stored in an atomic pointer to allow tests to inspect the
+	// internal value of it without data races.


Granted, it's horrible. I'm removing it, with the new changes this is not necessary.

juagargi · 2025-09-25T10:34:16Z

pkg/snet/multihomed/interfaces.go

+	// The local addresses are stored in an atomic pointer to allow tests to inspect the
+	// internal value of it without data races.
+	localAddresses = atomic.Pointer[[]netip.Addr]{}
+	ticker         = time.NewTicker(CheckInterfacesPeriod)


juagargi · 2025-09-25T10:34:16Z

pkg/snet/multihomed/interfaces.go

+	return addrs
+}
+
+func equalAddressList(a, b []netip.Addr) bool {


juagargi · 2025-09-25T10:34:17Z

pkg/snet/multihomed/outbound_addr.go

+	// Check if the table contains an entry.
+	muRemoteToEgress.RLock()
+	egress, ok := remoteToEgress[remote]
+	muRemoteToEgress.RLocker().Unlock()


juagargi · 2025-09-25T10:34:17Z

pkg/snet/multihomed/outbound_addr.go

+	egress, _ = netip.AddrFromSlice(eg)
+
+	// Check if our cache is not too big already.
+	if len(remoteToEgress) < MaxAllowedCacheSize {


The code has now changed due to the very small performance optimization vs the introduced reasoning complexity. IIRC go test -race didn't complain, but let me answer the question anyway to check my sanity.
Is this safe? I think so, the rationale being that the first greater-than condition may be evaluated with a stale value, but it's okay because:

It will be evaluated again with the synchronized value if it's less than the threshold.

The len(remoteToEgress) value transitions only to equal or bigger, sequentially as time moves forward, regardless of the execution order of the threads (monotonicity). With this and the point above, we ensure that stale values are always checked when growing. If already equal to MaxAllowedCacheSize they are not synchronized, but it is not needed as they will never decrease (monotonicity). This is similar to a double checked lock.
I hope this makes sense. Please check with me if there are mistakes here, as I keep on using these ideas nowadays.

Addendum: I'm leaving the text above to expose my train of thought. But the second point, monotonicity, is not true: there is the case where size decreases, namely when calling invalidateAll. This could make a thread skip the insertion of the IP address value into the map, which although not being extremely harmful, is not what I explained above.

juagargi · 2025-09-25T10:34:18Z

pkg/snet/multihomed/outbound_addr.go

+
+	// Check if our cache is not too big already.
+	if len(remoteToEgress) < MaxAllowedCacheSize {
+		// Beware of the RWMutex, it's read-locked already.


Done. It was a stale comment.

juagargi added 9 commits July 11, 2025 11:40

New test using tiny topo.

05f3e2e

Refactor test.

4130d2a

First phase of multihoming test.

22188e0

Temporary changes.

556ae83

Limit cache size.

24bcfff

Make multihomed tests more robust to concurrent running.

4c82100

Fix bazel files.

0e172fb

Fix gateway/dataplane TestNoLeak by allowing to stop the continous st…

e9c911b

…ate getter.

juagargi mentioned this pull request Sep 4, 2025

snet.Conn support for roaming use cases netsec-ethz/scion#175

Open

JordiSubira reviewed Sep 8, 2025

View reviewed changes

juagargi added 2 commits September 10, 2025 12:04

Do not obtain local address at Conn creation, do it at WriteTo.

0afc6b5

More efficient & clear mutex locking in OutboundIP.

0aced0a

juagargi commented Sep 10, 2025

View reviewed changes

JordiSubira reviewed Sep 10, 2025

View reviewed changes

juagargi added 2 commits September 10, 2025 16:25

Remove no longer valid comment.

a870b4c

Remove not needed double locking of the cache in OutboundIP.

4c97ec4

juagargi requested a review from JordiSubira September 10, 2025 15:30

juagargi commented Sep 10, 2025

View reviewed changes

marcfrei reviewed Sep 18, 2025

View reviewed changes

juagargi added 7 commits September 25, 2025 09:42

The ticker object is local to continuousCheckInterfaces.

7f7d90f

Add benchmarks measuring sync.Map and RWMutex+map.

79eb9f3

Use slices.Equal instead of custom function.

ee6080e

Use RUnlock() instead of RLocker().Unlock().

c5e99ee

Fix comment.

d1d931c

Simplify storage of localAddresses to just a slice.

d6fa8de

The atomic.Pointer seems not needed anymore, tests with -race work alright without warnings.

Add comment clarifying interactions with NAT+STUN.

ad12ede

juagargi commented Sep 25, 2025

View reviewed changes

		remoteToEgress map[netip.Addr]netip.Addr = make(map[netip.Addr]netip.Addr)
		muRemoteToEgress sync.RWMutex = sync.RWMutex{}

		// The local addresses are stored in an atomic pointer to allow tests to inspect the
		// internal value of it without data races.

Conversation

juagargi commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JordiSubira left a comment

Choose a reason for hiding this comment

Uh oh!

juagargi left a comment

Choose a reason for hiding this comment

Uh oh!

JordiSubira left a comment

Choose a reason for hiding this comment

Uh oh!

juagargi left a comment

Choose a reason for hiding this comment

Uh oh!

juagargi left a comment

Choose a reason for hiding this comment

Uh oh!

marcfrei commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcfrei Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juagargi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juagargi commented Sep 4, 2025 •

edited

Loading

marcfrei commented Sep 17, 2025 •

edited

Loading

marcfrei Sep 18, 2025 •

edited

Loading