## Introduction vpp graph nodes make extensive use of explicit prefetching to cover dependent read latency. In the simplest dual-loop case, we prefetch buffer headers and (typically) one cache line worth of packet data. The rest of this page shows what happens if we disable the prefetch block. ### Baseline Single-core, 13 MPPS offered load, i40e NICs, ~13 MPPS in+out: ``` vpp# show run Name Clocks Vectors/Call FortyGigabitEthernet84/0/1-out 9.08e0 50.09 FortyGigabitEthernet84/0/1-tx 3.84e1 50.09 dpdk-input 7.45e1 50.09 interface-output 1.08e1 50.09 ip4-input-no-checksum 3.92e1 50.09 ip4-lookup 3.88e1 50.09 ip4-rewrite-transit 3.43e1 50.09 ``` The key statistic to note here: ip4-input-no-checksum costs 39 clocks per packet Baseline "perf top" function-level profile: ``` 14.21% libvnet.so.0.0.0 [.] ip4_input_no_checksum_avx2 14.14% libvnet.so.0.0.0 [.] ip4_lookup_avx2 14.10% vpp [.] i40e_recv_scattered_pkts_vec 12.64% libvnet.so.0.0.0 [.] ip4_rewrite_transit_avx2 10.60% libvnet.so.0.0.0 [.] dpdk_input_avx2 9.70% vpp [.] i40e_xmit_pkts_vec 4.88% libvnet.so.0.0.0 [.] dpdk_interface_tx_avx2 3.67% libvlib.so.0.0.0 [.] dispatch_node 3.25% libvnet.so.0.0.0 [.] vnet_per_buffer_interface_output_avx2 2.96% libvnet.so.0.0.0 [.] vnet_interface_output_node_no_flatten 1.85% libvlib.so.0.0.0 [.] vlib_put_next_frame 1.80% libvlib.so.0.0.0 [.] vlib_get_next_frame_internal 1.12% vpp [.] rte_delay_us_block ``` ### Turn off the dual-loop prefetch block in ip4_input_inline(...) ``` /* Prefetch next iteration. */ if (0) { vlib_buffer_t * p2, * p3; p2 = vlib_get_buffer (vm, from[2]); p3 = vlib_get_buffer (vm, from[3]); vlib_prefetch_buffer_header (p2, LOAD); vlib_prefetch_buffer_header (p3, LOAD); CLIB_PREFETCH (p2->data, sizeof (ip0[0]), LOAD); CLIB_PREFETCH (p3->data, sizeof (ip1[0]), LOAD); } ``` This is a fairly harsh demonstration, but it clearly shows the "missing prefetch, fix me" signature: ``` Name Clocks Vectors/Call FortyGigabitEthernet84/0/1-out 7.91e0 76.97 FortyGigabitEthernet84/0/1-tx 3.76e1 76.97 dpdk-input 6.62e1 76.97 interface-output 9.91e0 76.97 ip4-input-no-checksum 5.53e1 76.97 ip4-lookup 3.49e1 76.97 ip4-rewrite-transit 3.32e1 76.97 ``` This single change causes ip4-input-no-checksum to increase to 55 clocks/pkt (from 39 clocks/pkt). ip4-input-no-checksum jumps to the top of the "perf top" summary: ``` 21.47% libvnet.so.0.0.0 [.] ip4_input_no_checksum_avx2 13.73% vpp [.] i40e_recv_scattered_pkts_vec 13.42% libvnet.so.0.0.0 [.] ip4_lookup_avx2 12.53% libvnet.so.0.0.0 [.] ip4_rewrite_transit_avx2 ``` The "perf top" detailed function profile shows a gross stall (32% of the function runtime) at the first use of packet data: ``` │ /* Check bounds. */ │ ASSERT ((signed) b->current_data >= (signed) -VLIB_BUFFER_PRE_DAT │ return b->data + b->current_data; 0.77 │ movswq (%rbx),%rax │ p1 = vlib_get_buffer (vm, pi1); │ │ ip0 = vlib_buffer_get_current (p0); │ ip1 = vlib_buffer_get_current (p1); │ │ sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX]; 0.06 │ mov 0x20(%rbx),%r11d │ sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX]; 0.20 │ mov 0x20(%rbp),%r10d 0.03 │ lea 0x100(%rbx,%rax,1),%rdx 0.80 │ movswq 0x0(%rbp),%rax │ │ arc0 = ip4_address_is_multicast (&ip0->dst_address) ? lm- 0.23 │ movzbl 0x10(%rdx),%edi 32.64 │ lea 0x100(%rbp,%rax,1),%rax │ and $0xfffffff0,%edi 0.84 │ cmp $0xe0,%dil │ arc1 = ip4_address_is_multicast (&ip1->dst_address) ? lm- 0.81 │ movzbl 0x10(%rax),%edi │ │ vnet_buffer (p0)->ip.adj_index[VLIB_RX] = ~0; 5.32 │ movl $0xffffffff,0x28(%rbx) │ ip1 = vlib_buffer_get_current (p1); │ │ sw_if_index0 = vnet_buffer (p0)->sw_if_index[VLIB_RX]; │ sw_if_index1 = vnet_buffer (p1)->sw_if_index[VLIB_RX]; │ ```