Skip to content

Commit 61fad68

Browse files
wdebruijdavem330
authored andcommitted
net/packet: tpacket_rcv: avoid a producer race condition
PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This is particularly likely with few, large, slots (e.g., GSO packets). Synchronize kernel thread ownership of rx ring slots with a bitmap. Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL while holding the sk receive queue lock. They release this lock before copying and set tp_status to TP_STATUS_USER to release to userspace when done. During copying, another writer may take the lock, also see TP_STATUS_KERNEL, and start writing to the same slot. Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a slot, test and set with the lock held. To release race-free, update tp_status and owner bit as a transaction, so take the lock again. This is the one of a variety of discussed options (see Link below): * instead of a shadow ring, embed the data in the slot itself, such as in tp_padding. But any test for this field may match a value left by userspace, causing deadlock. * avoid the lock on release. This leaves a small race if releasing the shadow slot before setting TP_STATUS_USER. The below reproducer showed that this race is not academic. If releasing the slot after tp_status, the race is more subtle. See the first link for details. * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store of two fields. But, legacy applications may interpret all non-zero tp_status as owned by the user. As libpcap does. So this is possible only opt-in by newer processes. It can be added as an optional mode. * embed the struct at the tail of pg_vec to avoid extra allocation. The implementation proved no less complex than a separate field. The additional locking cost on release adds contention, no different than scaling on multicore or multiqueue h/w. In practice, below reproducer nor small packet tcpdump showed a noticeable change in perf report in cycles spent in spinlock. Where contention is problematic, packet sockets support mitigation through PACKET_FANOUT. And we can consider adding opt-in state TP_KERNEL_OWNED. Easy to reproduce by running multiple netperf or similar TCP_STREAM flows concurrently with `tcpdump -B 129 -n greater 60000`. Based on an earlier patchset by Jon Rosen. See links below. I believe this issue goes back to the introduction of tpacket_rcv, which predates git history. Link: https://www.mail-archive.com/[email protected]/msg237222.html Suggested-by: Jon Rosen <[email protected]> Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: Jon Rosen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent e1f8f78 commit 61fad68

File tree

2 files changed

+25
-1
lines changed

2 files changed

+25
-1
lines changed

net/packet/af_packet.c

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2173,6 +2173,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
21732173
struct timespec64 ts;
21742174
__u32 ts_status;
21752175
bool is_drop_n_account = false;
2176+
unsigned int slot_id = 0;
21762177
bool do_vnet = false;
21772178

21782179
/* struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
@@ -2275,6 +2276,13 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
22752276
if (!h.raw)
22762277
goto drop_n_account;
22772278

2279+
if (po->tp_version <= TPACKET_V2) {
2280+
slot_id = po->rx_ring.head;
2281+
if (test_bit(slot_id, po->rx_ring.rx_owner_map))
2282+
goto drop_n_account;
2283+
__set_bit(slot_id, po->rx_ring.rx_owner_map);
2284+
}
2285+
22782286
if (do_vnet &&
22792287
virtio_net_hdr_from_skb(skb, h.raw + macoff -
22802288
sizeof(struct virtio_net_hdr),
@@ -2380,7 +2388,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
23802388
#endif
23812389

23822390
if (po->tp_version <= TPACKET_V2) {
2391+
spin_lock(&sk->sk_receive_queue.lock);
23832392
__packet_set_status(po, h.raw, status);
2393+
__clear_bit(slot_id, po->rx_ring.rx_owner_map);
2394+
spin_unlock(&sk->sk_receive_queue.lock);
23842395
sk->sk_data_ready(sk);
23852396
} else {
23862397
prb_clear_blk_fill_status(&po->rx_ring);
@@ -4277,6 +4288,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
42774288
{
42784289
struct pgv *pg_vec = NULL;
42794290
struct packet_sock *po = pkt_sk(sk);
4291+
unsigned long *rx_owner_map = NULL;
42804292
int was_running, order = 0;
42814293
struct packet_ring_buffer *rb;
42824294
struct sk_buff_head *rb_queue;
@@ -4362,6 +4374,12 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
43624374
}
43634375
break;
43644376
default:
4377+
if (!tx_ring) {
4378+
rx_owner_map = bitmap_alloc(req->tp_frame_nr,
4379+
GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO);
4380+
if (!rx_owner_map)
4381+
goto out_free_pg_vec;
4382+
}
43654383
break;
43664384
}
43674385
}
@@ -4391,6 +4409,8 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
43914409
err = 0;
43924410
spin_lock_bh(&rb_queue->lock);
43934411
swap(rb->pg_vec, pg_vec);
4412+
if (po->tp_version <= TPACKET_V2)
4413+
swap(rb->rx_owner_map, rx_owner_map);
43944414
rb->frame_max = (req->tp_frame_nr - 1);
43954415
rb->head = 0;
43964416
rb->frame_size = req->tp_frame_size;
@@ -4422,6 +4442,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
44224442
}
44234443

44244444
out_free_pg_vec:
4445+
bitmap_free(rx_owner_map);
44254446
if (pg_vec)
44264447
free_pg_vec(pg_vec, order, req->tp_block_nr);
44274448
out:

net/packet/internal.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,10 @@ struct packet_ring_buffer {
7070

7171
unsigned int __percpu *pending_refcnt;
7272

73-
struct tpacket_kbdq_core prb_bdqc;
73+
union {
74+
unsigned long *rx_owner_map;
75+
struct tpacket_kbdq_core prb_bdqc;
76+
};
7477
};
7578

7679
extern struct mutex fanout_mutex;

0 commit comments

Comments
 (0)