Skip to content

Commit 03a000b

Browse files
committed
Merge branch 'nh-group-refcnt'
Nikolay Aleksandrov says: ==================== net: nexthop: fix refcount issues when replacing groups This set fixes a refcount bug when replacing nexthop groups and modifying routes. It is complex because the objects look valid when debugging memory dumps, but we end up having refcount dependency between unlinked objects which can never be released, so in turn they cannot free their resources and refcounts. The problem happens because we can have stale IPv6 per-cpu dsts in nexthops which were removed from a group. Even though the IPv6 gen is bumped, the dsts won't be released until traffic passes through them or the nexthop is freed, that can take arbitrarily long time, and even worse we can create a scenario[1] where it can never be released. The fix is to release the IPv6 per-cpu dsts of replaced nexthops after an RCU grace period so no new ones can be created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which is used by the nexthop code only when necessary. We can further optimize group replacement, but that is more suited for net-next as these patches would have to be backported to stable releases. v2: patch 02: update commit msg patch 03: check for mausezahn before testing and make a few comments more verbose [1] This info is also present in patch 02's commit message. Initial state: $ ip nexthop list id 200 via 2002:db8::2 dev bridge.10 scope link onlink id 201 via 2002:db8::3 dev bridge scope link onlink id 203 group 201/200 $ ip -6 route 2001:db8::10 nhid 203 metric 1024 pref medium nexthop via 2002:db8::3 dev bridge weight 1 onlink nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink Create rt6_info through one of the multipath legs, e.g.: $ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10 (pkt_inj is just a custom packet generator, nothing special) Then remove that leg from the group by replace (let's assume it is id 200 in this case): $ ip nexthop replace id 203 group 201 Now remove the IPv6 route: $ ip -6 route del 2001:db8::10/128 The route won't be really deleted due to the stale rt6_info holding 1 refcnt in nexthop id 200. At this point we have the following reference count dependency: (deleted) IPv6 route holds 1 reference over nhid 203 nh 203 holds 1 ref over id 201 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info Now to create circular dependency between nh 200 and the IPv6 route, and also to get a reference over nh 200, restore nhid 200 in the group: $ ip nexthop replace id 203 group 201/200 And now we have a permanent circular dependncy because nhid 203 holds a reference over nh 200 and 201, but the route holds a ref over nh 203 and is deleted. To trigger the bug just delete the group (nhid 203): $ ip nexthop del id 203 It won't really be deleted due to the IPv6 route dependency, and now we have 2 unlinked and deleted objects that reference each other: the group and the IPv6 route. Since the group drops the reference it holds over its entries at free time (i.e. its own refcount needs to drop to 0) that will never happen and we get a permanent ref on them, since one of the entries holds a reference over the IPv6 route it will also never be released. At this point the dependencies are: (deleted, only unlinked) IPv6 route holds reference over group nh 203 (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info This is the last point where it can be fixed by running traffic through nh 200, and specifically through the same CPU so the rt6_info (dst) will get released due to the IPv6 genid, that in turn will free the IPv6 route, which in turn will free the ref count over the group nh 203. If nh 200 is deleted at this point, it will never be released due to the ref from the unlinked group 203, it will only be unlinked: $ ip nexthop del id 200 $ ip nexthop $ Now we can never release that stale rt6_info, we have IPv6 route with ref over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with rt6_info (dst) with ref over the net device and the IPv6 route. All of these objects are only unlinked, and cannot be released, thus they can't release their ref counts. Message from syslogd@dev at Nov 19 14:04:10 ... kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 Message from syslogd@dev at Nov 19 14:04:20 ... kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 ==================== Signed-off-by: David S. Miller <[email protected]>
2 parents 4177d5b + 02ebe49 commit 03a000b

File tree

6 files changed

+108
-2
lines changed

6 files changed

+108
-2
lines changed

include/net/ip6_fib.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -485,6 +485,7 @@ int fib6_nh_init(struct net *net, struct fib6_nh *fib6_nh,
485485
struct fib6_config *cfg, gfp_t gfp_flags,
486486
struct netlink_ext_ack *extack);
487487
void fib6_nh_release(struct fib6_nh *fib6_nh);
488+
void fib6_nh_release_dsts(struct fib6_nh *fib6_nh);
488489

489490
int call_fib6_entry_notifiers(struct net *net,
490491
enum fib_event_type event_type,

include/net/ipv6_stubs.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ struct ipv6_stub {
4747
struct fib6_config *cfg, gfp_t gfp_flags,
4848
struct netlink_ext_ack *extack);
4949
void (*fib6_nh_release)(struct fib6_nh *fib6_nh);
50+
void (*fib6_nh_release_dsts)(struct fib6_nh *fib6_nh);
5051
void (*fib6_update_sernum)(struct net *net, struct fib6_info *rt);
5152
int (*ip6_del_rt)(struct net *net, struct fib6_info *rt, bool skip_notify);
5253
void (*fib6_rt_update)(struct net *net, struct fib6_info *rt,

net/ipv4/nexthop.c

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1899,15 +1899,36 @@ static void remove_nexthop(struct net *net, struct nexthop *nh,
18991899
/* if any FIB entries reference this nexthop, any dst entries
19001900
* need to be regenerated
19011901
*/
1902-
static void nh_rt_cache_flush(struct net *net, struct nexthop *nh)
1902+
static void nh_rt_cache_flush(struct net *net, struct nexthop *nh,
1903+
struct nexthop *replaced_nh)
19031904
{
19041905
struct fib6_info *f6i;
1906+
struct nh_group *nhg;
1907+
int i;
19051908

19061909
if (!list_empty(&nh->fi_list))
19071910
rt_cache_flush(net);
19081911

19091912
list_for_each_entry(f6i, &nh->f6i_list, nh_list)
19101913
ipv6_stub->fib6_update_sernum(net, f6i);
1914+
1915+
/* if an IPv6 group was replaced, we have to release all old
1916+
* dsts to make sure all refcounts are released
1917+
*/
1918+
if (!replaced_nh->is_group)
1919+
return;
1920+
1921+
/* new dsts must use only the new nexthop group */
1922+
synchronize_net();
1923+
1924+
nhg = rtnl_dereference(replaced_nh->nh_grp);
1925+
for (i = 0; i < nhg->num_nh; i++) {
1926+
struct nh_grp_entry *nhge = &nhg->nh_entries[i];
1927+
struct nh_info *nhi = rtnl_dereference(nhge->nh->nh_info);
1928+
1929+
if (nhi->family == AF_INET6)
1930+
ipv6_stub->fib6_nh_release_dsts(&nhi->fib6_nh);
1931+
}
19111932
}
19121933

19131934
static int replace_nexthop_grp(struct net *net, struct nexthop *old,
@@ -2247,7 +2268,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old,
22472268
err = replace_nexthop_single(net, old, new, extack);
22482269

22492270
if (!err) {
2250-
nh_rt_cache_flush(net, old);
2271+
nh_rt_cache_flush(net, old, new);
22512272

22522273
__remove_nexthop(net, new, NULL);
22532274
nexthop_put(new);

net/ipv6/af_inet6.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1026,6 +1026,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
10261026
.ip6_mtu_from_fib6 = ip6_mtu_from_fib6,
10271027
.fib6_nh_init = fib6_nh_init,
10281028
.fib6_nh_release = fib6_nh_release,
1029+
.fib6_nh_release_dsts = fib6_nh_release_dsts,
10291030
.fib6_update_sernum = fib6_update_sernum_stub,
10301031
.fib6_rt_update = fib6_rt_update,
10311032
.ip6_del_rt = ip6_del_rt,

net/ipv6/route.c

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3680,6 +3680,25 @@ void fib6_nh_release(struct fib6_nh *fib6_nh)
36803680
fib_nh_common_release(&fib6_nh->nh_common);
36813681
}
36823682

3683+
void fib6_nh_release_dsts(struct fib6_nh *fib6_nh)
3684+
{
3685+
int cpu;
3686+
3687+
if (!fib6_nh->rt6i_pcpu)
3688+
return;
3689+
3690+
for_each_possible_cpu(cpu) {
3691+
struct rt6_info *pcpu_rt, **ppcpu_rt;
3692+
3693+
ppcpu_rt = per_cpu_ptr(fib6_nh->rt6i_pcpu, cpu);
3694+
pcpu_rt = xchg(ppcpu_rt, NULL);
3695+
if (pcpu_rt) {
3696+
dst_dev_put(&pcpu_rt->dst);
3697+
dst_release(&pcpu_rt->dst);
3698+
}
3699+
}
3700+
}
3701+
36833702
static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg,
36843703
gfp_t gfp_flags,
36853704
struct netlink_ext_ack *extack)

tools/testing/selftests/net/fib_nexthops.sh

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -629,6 +629,66 @@ ipv6_fcnal()
629629
log_test $? 0 "Nexthops removed on admin down"
630630
}
631631

632+
ipv6_grp_refs()
633+
{
634+
if [ ! -x "$(command -v mausezahn)" ]; then
635+
echo "SKIP: Could not run test; need mausezahn tool"
636+
return
637+
fi
638+
639+
run_cmd "$IP link set dev veth1 up"
640+
run_cmd "$IP link add veth1.10 link veth1 up type vlan id 10"
641+
run_cmd "$IP link add veth1.20 link veth1 up type vlan id 20"
642+
run_cmd "$IP -6 addr add 2001:db8:91::1/64 dev veth1.10"
643+
run_cmd "$IP -6 addr add 2001:db8:92::1/64 dev veth1.20"
644+
run_cmd "$IP -6 neigh add 2001:db8:91::2 lladdr 00:11:22:33:44:55 dev veth1.10"
645+
run_cmd "$IP -6 neigh add 2001:db8:92::2 lladdr 00:11:22:33:44:55 dev veth1.20"
646+
run_cmd "$IP nexthop add id 100 via 2001:db8:91::2 dev veth1.10"
647+
run_cmd "$IP nexthop add id 101 via 2001:db8:92::2 dev veth1.20"
648+
run_cmd "$IP nexthop add id 102 group 100"
649+
run_cmd "$IP route add 2001:db8:101::1/128 nhid 102"
650+
651+
# create per-cpu dsts through nh 100
652+
run_cmd "ip netns exec me mausezahn -6 veth1.10 -B 2001:db8:101::1 -A 2001:db8:91::1 -c 5 -t tcp "dp=1-1023, flags=syn" >/dev/null 2>&1"
653+
654+
# remove nh 100 from the group to delete the route potentially leaving
655+
# a stale per-cpu dst which holds a reference to the nexthop's net
656+
# device and to the IPv6 route
657+
run_cmd "$IP nexthop replace id 102 group 101"
658+
run_cmd "$IP route del 2001:db8:101::1/128"
659+
660+
# add both nexthops to the group so a reference is taken on them
661+
run_cmd "$IP nexthop replace id 102 group 100/101"
662+
663+
# if the bug described in commit "net: nexthop: release IPv6 per-cpu
664+
# dsts when replacing a nexthop group" exists at this point we have
665+
# an unlinked IPv6 route (but not freed due to stale dst) with a
666+
# reference over the group so we delete the group which will again
667+
# only unlink it due to the route reference
668+
run_cmd "$IP nexthop del id 102"
669+
670+
# delete the nexthop with stale dst, since we have an unlinked
671+
# group with a ref to it and an unlinked IPv6 route with ref to the
672+
# group, the nh will only be unlinked and not freed so the stale dst
673+
# remains forever and we get a net device refcount imbalance
674+
run_cmd "$IP nexthop del id 100"
675+
676+
# if a reference was lost this command will hang because the net device
677+
# cannot be removed
678+
timeout -s KILL 5 ip netns exec me ip link del veth1.10 >/dev/null 2>&1
679+
680+
# we can't cleanup if the command is hung trying to delete the netdev
681+
if [ $? -eq 137 ]; then
682+
return 1
683+
fi
684+
685+
# cleanup
686+
run_cmd "$IP link del veth1.20"
687+
run_cmd "$IP nexthop flush"
688+
689+
return 0
690+
}
691+
632692
ipv6_grp_fcnal()
633693
{
634694
local rc
@@ -734,6 +794,9 @@ ipv6_grp_fcnal()
734794

735795
run_cmd "$IP nexthop add id 108 group 31/24"
736796
log_test $? 2 "Nexthop group can not have a blackhole and another nexthop"
797+
798+
ipv6_grp_refs
799+
log_test $? 0 "Nexthop group replace refcounts"
737800
}
738801

739802
ipv6_res_grp_fcnal()

0 commit comments

Comments
 (0)