Skip to content

Latest commit

 

History

History
319 lines (268 loc) · 15.7 KB

File metadata and controls

319 lines (268 loc) · 15.7 KB

CVE-2025-37797

Overview

First, use prefetch sidechannel to bypass KASLR. Then, use the TOCTOU bug to activate an emptied hfsc class. This causes the hfsc qdisc to improperly track the class' reference count, allowing us to obtain a UAF hfsc class. We can then perform a write-what-where to achieve ROP. For Mitigation instance, we use the same exploit technique as CVE-2025-37798.

Qdisc set-up

First, we will set-up qdiscs as follows:

          1:0 (drr)
          /        \
         1:1        1:2
         /            \ 
      2:0 (hfsc)      10:0 (plug)
      /       \
     2:1      2:2
    /           \
3:0 (netem)     0:0 (default qdisc)
    |
   3:1
    |
4:0 (blackhole)

The drr qdisc is merely to attach a plug qdisc child, so that the hfsc qdisc will not prematurely dequeue packets.

This is done in setup_tree(). Recall that the vulnerability arises when we add a FSC curve to a hfsc class. hfsc classes must be initialized with either a RSC curve or a FSC curve or both.

static int
hfsc_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
		  struct nlattr **tca, unsigned long *arg,
		  struct netlink_ext_ack *extack)
{
    // [...]
	if (rsc == NULL && fsc == NULL)
		return -EINVAL;

Our victim class will be hfsc class 2:1. Since we will subsequently add a FSC curve to it to trigger the vulnerability, it must be initially created with only a RSC curve. The other hfsc class 2:2 can be created with a FSC curve from the start.

Of particular interest is the netem - blackhole set-up. Our goal with this set-up is to successfully enqueue a packet into the netem qdisc, but for the dequeue routine to drop that same packet. To do so, we utilize the netem qdisc's delay functionality.

static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
			 struct sk_buff **to_free)
{
    // [...]
    	cb = netem_skb_cb(skb);
    	if (q->gap == 0 ||		/* not doing reordering */
	    q->counter < q->gap - 1 ||	/* inside last reordering gap */
	    q->reorder < get_crandom(&q->reorder_cor, &q->prng)) {
		u64 now;
		s64 delay;

		delay = tabledist(q->latency, q->jitter,
				  &q->delay_cor, &q->prng, q->delay_dist);                   // [1]

		now = ktime_get_ns();

        // [...]

		cb->time_to_send = now + delay;                                      // [2]
		++q->counter;
		tfifo_enqueue(skb, sch);                                             // [3]
        // [...]

At [1], a delay is calculated for the incoming packet. This delay is based on multiple properties, including the q->latency value supplied when creating the netem qdisc. At [2], the time to send is written to the socket buffer's control buffer (cb). At [3], the skb is enqueued into the netem qdisc's internal queue.

When the netem dequeue routine is called, the packet is only dequeued if the time_to_send is past.

static struct sk_buff *netem_dequeue(struct Qdisc *sch)
{
	struct netem_sched_data *q = qdisc_priv(sch);
	struct sk_buff *skb;

tfifo_dequeue:
	skb = __qdisc_dequeue_head(&sch->q);
	if (skb) {
deliver:
		qdisc_qstats_backlog_dec(sch, skb);
		qdisc_bstats_update(sch, skb);
		return skb;
	}
	skb = netem_peek(q);                                                     // [4]
	if (skb) {
		u64 time_to_send;
		u64 now = ktime_get_ns();

		time_to_send = netem_skb_cb(skb)->time_to_send;                      // [5]
		if (q->slot.slot_next && q->slot.slot_next < time_to_send)
			get_slot_next(q, now);

		if (time_to_send <= now && q->slot.slot_next <= now) {               // [6]
			netem_erase_head(q, skb);
			q->t_len--;
			skb->next = NULL;
			skb->prev = NULL;
			skb->dev = qdisc_dev(sch);

			if (q->slot.slot_next) {
				q->slot.packets_left--;
				q->slot.bytes_left -= qdisc_pkt_len(skb);
				if (q->slot.packets_left <= 0 ||
				    q->slot.bytes_left <= 0)
					get_slot_next(q, now);
			}

			if (q->qdisc) {
				unsigned int pkt_len = qdisc_pkt_len(skb);
				struct sk_buff *to_free = NULL;
				int err;

				err = qdisc_enqueue(skb, q->qdisc, &to_free);                // [7]
				kfree_skb_list(to_free);
				if (err != NET_XMIT_SUCCESS) {
					if (net_xmit_drop_count(err))
						qdisc_qstats_drop(sch);
					sch->qstats.backlog -= pkt_len;
					sch->q.qlen--;
					qdisc_tree_reduce_backlog(sch, 1, pkt_len);
				}
				goto tfifo_dequeue;                                          // [8]
			}
			sch->q.qlen--;
			goto deliver;
		}
        // [...]

At [4], the packet at the front of the netem's internal queues is retrieved. At [5], its time_to_send is obtained and checked against the current time at [6]. The dequeue proceeds only if time_to_send is past. At [7], the packet is enqueued into the netem's child qdisc (in this case, the blackhole qdisc 4:0). The routine is finalized at [8].

This behaviour is important because of how netem_enqueue() is called from its parent qdisc, the hfsc qdisc 2:0.

static int
hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
{
    // [...]
	first = !cl->qdisc->q.qlen;
	err = qdisc_enqueue(skb, cl->qdisc, to_free);                            // [9]
	if (unlikely(err != NET_XMIT_SUCCESS)) {
		if (net_xmit_drop_count(err)) {
			cl->qstats.drops++;
			qdisc_qstats_drop(sch);
		}
		return err;
	}

	if (first) {
		if (cl->cl_flags & HFSC_RSC)
			init_ed(cl, len);
		if (cl->cl_flags & HFSC_FSC)
			init_vf(cl, len);
		if (cl->cl_flags & HFSC_RSC)
			cl->qdisc->ops->peek(cl->qdisc);                                 // [10]
	}
    // [...]

The parent enqueue routine enqueues the packet into the child qdisc at [9], which calls netem_enqueue(). Because this is the first packet inserted into the child qdisc, first is true. Subsequently, peek() is called on the netem qdisc at [10] because the hfsc class was created with the RSC flag. Recall from the discussion in vulnerability.md that it is exactly this peek call that will trigger the dequeue call on the child qdisc. If this call succeeds, then we cannot trigger the vulnerability in the hfsc_change_class() function later. This is why we need to add a delay to the netem packet, so that this first peek call will not dequeue that packet.

This also demonstrates why we need a plug qdisc 10:0 as a sibling to the target hfsc qdisc. Without the plug, our enqueued packet may be dequeued from the hfsc qdisc by network traffic that we do not control.

Resuming from [7], we continue understanding the netem - blackhole relationship. After time_to_send is past, the packet can be enqueued into the child blackhole qdisc. The blackhole qdisc will drop all packets that we attempt to enqueue into it. This is not the only qdisc with such a property, but it is the most straightforward to be used.

static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch,
			     struct sk_buff **to_free)
{
	qdisc_drop(skb, sch, to_free);
	return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
}

Triggering the bug

After setting up the qdiscs (and sending a packet to the plug qdisc), we are ready to trigger the bug. We will first send a packet to the netem qdisc 3:0. The packet passes through the drr -> hfsc -> netem, where a delay is added to the packet and stored in time_to_send. Then, hfsc_enqueue() will call peek() on the netem qdisc, triggering netem_dequeue(). Since the time_to_send is not past (due to the delay), the packet is not dequeued, remaining in the netem's internal queue.

Then, we will wait for the delay to be over. During this time, the plugged qdisc 10:0 prevents other network traffic from dequeueing the packet. When time_to_send is past, we can trigger the vulnerability by calling hfsc_change_class() with the HFSC_FSC flag. In other words, we add a FSC curve to the target hfsc class 2:1. As explained in vulnerability.md, this will call qdisc_peek_len() ([11]), which triggers a call to netem_dequeue().

static int
hfsc_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
		  struct nlattr **tca, unsigned long *arg,
		  struct netlink_ext_ack *extack)
{
    // [...]
		if (cl->qdisc->q.qlen != 0) {
			int len = qdisc_peek_len(cl->qdisc);                             // [11]

			if (cl->cl_flags & HFSC_RSC) {
				if (old_flags & HFSC_RSC)
					update_ed(cl, len);
				else
					init_ed(cl, len);
			}

			if (cl->cl_flags & HFSC_FSC) {
				if (old_flags & HFSC_FSC)
					update_vf(cl, 0, cur_time);
				else
					init_vf(cl, len);                                        // [12]
			}
		}
        // [...]

Because time_to_send is now past, the packet is enqueued into the blackhole qdisc at [7], which promptly drops the packet. Because the packet is dropped, the return value at [7] has the __NET_XMIT_BYPASS flag set. This causes netem_dequeue() to propagate the dropped packet up the qdisc hierarchy using qdisc_tree_reduce_backlog(). This will result in the hfsc class being deactivated. At this point, the netem - blackhole subtree is emptied. Then, the change class routine continues execution. It detects that we are adding the HFSC_FSC flag, so it initializes (activates) the class with init_vf() at [12].

static void
init_vf(struct hfsc_class *cl, unsigned int len)
{
	struct hfsc_class *max_cl;
	struct rb_node *n;
	u64 vt, f, cur_time;
	int go_active;

	cur_time = 0;
	go_active = 1;
	for (; cl->cl_parent != NULL; cl = cl->cl_parent) {
		if (go_active && cl->cl_nactive++ == 0)
			go_active = 1;
		else
			go_active = 0;

		if (go_active) {
            // [...]
			vttree_insert(cl);
			cftree_insert(cl);

This marks the class as active and inserts the class into the hfsc qdisc's internal trees. Typically, update_vf() is used to deactivate the class and remove it from the internal trees. However, since the class 2:1 is already empty, it does not have any opportunities to trigger update_vf(). When deleting the hfsc class, hfsc_delete_class() assumes that vt tree removal was already previously handled and does not check for it. The class is eventually freed in hfsc_destroy_class(). This leaves a dangling reference to the class in the vt tree, giving our UAF.

The bug-triggering logic is encapsulated in trigger_bug(), and we delete the hfsc class 2:1 in setup_uaf().

LPE

From this point on, there are many documented strategies to achieve LPE.

For LTS and COS, we use the strategy outlined in CVE-2023-4623. There are 2 differences in our exploit: we use struct user_key_payload to reclaim instead of simple_xattr, and a different ROP chain. First, in spray_keyring(), reclaim the UAF class with struct user_key_payload, which contents we can control. This is elastic size, and allocated with GFP_KERNEL, so it is allocated in the same cache as the hfsc_class (in fact, all qdisc classes are allocated with GFP_KERNEL).

static int
hfsc_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
		  struct nlattr **tca, unsigned long *arg,
		  struct netlink_ext_ack *extack)
{
	// [...]
	cl = kzalloc(sizeof(struct hfsc_class), GFP_KERNEL);
int user_preparse(struct key_preparsed_payload *prep)
{
	struct user_key_payload *upayload;
	size_t datalen = prep->datalen;

	if (datalen <= 0 || datalen > 32767 || !prep->data)
		return -EINVAL;

	upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);

This method manipulates internal hfsc_class pointers to obtain a 8-byte write-what-where (the pointers are set in prep_key_desc()). We use the write-what-where to overwrite the qfq_qdisc_ops.change() pointer, and perform ROP. Using available rop gadgets, we overwrite core_pattern with our program path using copy_from_user then simply call msleep. Another thread of our exploit notice the /proc/sys/kernel/core_pattern changes, it will try to crash itself so our exploit will executed as high privilege and gives us root shell to get the flag.

We use a call to hfsc_dequeue() to trigger the usage of the freed hfsc class 2:1. Since, the plug 10:0 is still plugged, we have to unplug it first so that the drr qdisc can dequeue from its other childs. Then, we send a packet into the other hfsc class 2:2, which has a pfifo qdisc attached to it by default. This triggers the __qdisc_run routine which, after enqueueing the packet, tries to dequeue a packet from the hfsc qdisc, accessing the freed hfsc class 2:1.

Mitigation exploit

The exploit method is similar to CVE-2025-37798 and CVE-2025-37890. We will use the following qdisc setup:

          f000:0 (drr)
          /        \
       f000:1      f000:2
         /            \ 
      1:0 (hfsc)     e000:0 (plug)
      /       \
     1:1      1:2
    /           \
2:0 (multiq)    0:0 (default qdisc)
 /   /  ... \
2:1 2:2 ... 2:x
             |
         3:0 (netem) 
             |
            3:1
             |
         4:0 (blackhole)

The only functional difference is the addition of a multiq qdisc between the hfsc qdisc and the netem setup. We also re-number the earlier qdiscs so that they do not conflict with the determine_band() routine. When the hfsc qdisc 1:0 calls peek() on its child multiq qdisc 2:0, the multiq peek handler is called:

static struct sk_buff *multiq_peek(struct Qdisc *sch)
{
	struct multiq_sched_data *q = qdisc_priv(sch);
	unsigned int curband = q->curband;
	struct Qdisc *qdisc;
	struct sk_buff *skb;
	int band;

	for (band = 0; band < q->bands; band++) {
		curband++;
		if (curband >= q->bands)
			curband = 0;
		if (!netif_xmit_stopped(
		    netdev_get_tx_queue(qdisc_dev(sch), curband))) {
			qdisc = q->queues[curband];
			skb = qdisc->ops->peek(qdisc);                                   // [13]
			if (skb)
				return skb;
		}
	}
	return NULL;

}

This translates to a call to the multiq's child's peek handler at [13], which continues along the same codepath as before, calling qdisc_peek_dequeued(), which calls netem_dequeue(). So, the addition of the multiq qdisc does not disrupt the exploitation flow.

Let's review the hfsc_change_class() logic to see how the hfsc-multiq relationship is corrupted. After waiting for time_to_send to pass, the multiq-netem-hfsc subtree has a single packet enqueued. The call to hfsc_change_class() triggers the chain of calls: qdisc_peek_len() [11] -> multiq_peek() -> qdisc_peek_dequeued() [13] -> netem_dequeue() -> qdisc_enqueue() [7] -> blackhole_enqueue(). Like before, the dropped packet at [7] is propagated up the qdisc hierarchy, deactivating the hfsc class 1:1. However, the subsequent init_vf() call at [12] reactivates the class, creating a pointer to the emptied subtree.

Then, we simply delete the hfsc class 1:1 as before, leaving a dangling pointer, and continue the exploitation strategy from CVE-2025-37798. We reclaim the multiq q->queues large chunk with sendmsg() and forge qdisc pointers in kernfs_pr_cont_buf. Then, we overwrite qdisc->ops table to gain RIP hijack. Finally, we use a stack pivot gadget to ROP chain and overwrite core_pattern, achieving LPE.