## Introduction
vpp graph nodes make extensive use of explicit prefetching to cover dependent read latency. In the simplest dual-loop case, we prefetch buffer headers and (typically) one cache line worth of packet data. The rest of this page shows what happens if we disable the prefetch block.

### Baseline
Single-core, 13 MPPS offered load, i40e NICs, ~13 MPPS in+out:

```
vpp# show run
             Name                 Clocks       Vectors/Call
FortyGigabitEthernet84/0/1-out         9.08e0           50.09
FortyGigabitEthernet84/0/1-tx          3.84e1           50.09
dpdk-input                             7.45e1           50.09
interface-output                       1.08e1           50.09
ip4-input-no-checksum                  3.92e1           50.09
ip4-lookup                             3.88e1           50.09
ip4-rewrite-transit                    3.43e1           50.09
```

The key statistic to note here: ip4-input-no-checksum costs 39 clocks per packet

Baseline "perf top" function-level profile:

```
 14.21%  libvnet.so.0.0.0           [.] ip4_input_no_checksum_avx2
 14.14%  libvnet.so.0.0.0           [.] ip4_lookup_avx2
 14.10%  vpp                        [.] i40e_recv_scattered_pkts_vec
 12.64%  libvnet.so.0.0.0           [.] ip4_rewrite_transit_avx2
 10.60%  libvnet.so.0.0.0           [.] dpdk_input_avx2
  9.70%  vpp                        [.] i40e_xmit_pkts_vec
  4.88%  libvnet.so.0.0.0           [.] dpdk_interface_tx_avx2
  3.67%  libvlib.so.0.0.0           [.] dispatch_node
  3.25%  libvnet.so.0.0.0           [.] vnet_per_buffer_interface_output_avx2
  2.96%  libvnet.so.0.0.0           [.] vnet_interface_output_node_no_flatten
  1.85%  libvlib.so.0.0.0           [.] vlib_put_next_frame
  1.80%  libvlib.so.0.0.0           [.] vlib_get_next_frame_internal
  1.12%  vpp                        [.] rte_delay_us_block
```

### Turn off the dual-loop prefetch block in ip4_input_inline(...)

```
/* Prefetch next iteration. */
if (0)
  {
     vlib_buffer_t * p2, * p3;

     p2 = vlib_get_buffer (vm, from[2]);
     p3 = vlib_get_buffer (vm, from[3]);

     vlib_prefetch_buffer_header (p2, LOAD);
     vlib_prefetch_buffer_header (p3, LOAD);

     CLIB_PREFETCH (p2->data, sizeof (ip0[0]), LOAD);
    CLIB_PREFETCH (p3->data, sizeof (ip1[0]), LOAD);
  }
```

This is a fairly harsh demonstration, but it clearly shows the "missing prefetch, fix me" signature:

```
             Name               Clocks       Vectors/Call
FortyGigabitEthernet84/0/1-out       7.91e0           76.97
FortyGigabitEthernet84/0/1-tx        3.76e1           76.97
dpdk-input                           6.62e1           76.97
interface-output                     9.91e0           76.97
ip4-input-no-checksum                5.53e1           76.97
ip4-lookup                           3.49e1           76.97
ip4-rewrite-transit                  3.32e1           76.97
```

This single change causes ip4-input-no-checksum to increase to 55 clocks/pkt (from 39 clocks/pkt). ip4-input-no-checksum jumps to the top of the "perf top" summary:

```
 21.47%  libvnet.so.0.0.0               [.] ip4_input_no_checksum_avx2
 13.73%  vpp                            [.] i40e_recv_scattered_pkts_vec
 13.42%  libvnet.so.0.0.0               [.] ip4_lookup_avx2
 12.53%  libvnet.so.0.0.0               [.] ip4_rewrite_transit_avx2
```

The "perf top" detailed function profile shows a gross stall (32% of the function runtime) at the first use of packet data:

```
      │       /* Check bounds. */
      │       ASSERT ((signed) b->current_data >= (signed) -VLIB_BUFFER_PRE_DAT
      │       return b->data + b->current_data;
 0.77 │       movswq (%rbx),%rax
      │               p1 = vlib_get_buffer (vm, pi1);
      │
      │               ip0 = vlib_buffer_get_current (p0);
      │               ip1 = vlib_buffer_get_current (p1);
      │
      │               sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX];
 0.06 │       mov    0x20(%rbx),%r11d
      │               sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX];
 0.20 │       mov    0x20(%rbp),%r10d
 0.03 │       lea    0x100(%rbx,%rax,1),%rdx
 0.80 │       movswq 0x0(%rbp),%rax
      │
      │               arc0 = ip4_address_is_multicast (&ip0->dst_address) ? lm-
 0.23 │       movzbl 0x10(%rdx),%edi
32.64 │       lea    0x100(%rbp,%rax,1),%rax
      │       and    $0xfffffff0,%edi
 0.84 │       cmp    $0xe0,%dil
      │               arc1 = ip4_address_is_multicast (&ip1->dst_address) ? lm-
 0.81 │       movzbl 0x10(%rax),%edi
      │
      │               vnet_buffer (p0)->ip.adj_index[VLIB_RX] = ~0;
 5.32 │       movl   $0xffffffff,0x28(%rbx)
      │               ip1 = vlib_buffer_get_current (p1);
      │
      │               sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX];
      │               sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX];
      │
```