-
Notifications
You must be signed in to change notification settings - Fork 707
VPP How_To_Optimize_Performance_%28System_Tuning%29
This page describes system configuration tweaks that can help maximize the packet processing performance of VPP applications.
- 1 General Considerations
-
2 BIOS settings
- 2.1 Power Management
- 2.2 Turboboost / Speedstep
- 2.3 Virtualization Extensions
-
2.4 CMDLINE configuration parameters
- 2.4.1 Kernel command line during startup
- 2.4.2 Tickless Kernel
- [[2.4.3 In a VM: CPU Isolation on Host (isolcpus)|#In_a_VM:CPU_Isolation_on_Host.28isolcpus.29]]
-
3 Running VPP in a KVM VM
- 3.1 Disable Interrupt Balancing (irqbalance)
- [[3.2 In a VM: Disable Kernel Samepage Merging (KSM)|#In_a_VM:Disable_Kernel_Samepage_Merging.28KSM.29]]
- 3.3 In a VM: Configure KVM Parameters
- 3.4 In a VM: Remove VirtIO Balloon Driver
- 3.5 In a VM: Set CPU Affinity and NUMA Memory Policy for the VPP VM threads
- 3.6 Set CPU Affinity for VPP in the VM
- 3.7 In a VM: Don't run anything else in the VM!
- 3.8 Hyperthreading
- 4 Other
- 5 VPP configuration
- 6 References
WARNING: The suggestions on this page have been validated on Intel CPUs ONLY. The applicability of these suggestions to other CPU architectures (such as arm64) has not been verified. Please consider any adjustments that might be appropriate for non-Intel CPUs.
Most of the suggestions on this page apply to both VM machines and Bare Metal OS instances (by "Bare Metal" we mean an instance of an operating system running directly on hardware and not on a virtual machine). Please note that the section titles that contain the words "In a VM" are suggestions that would apply only to an OS running on a virtual machine.
Intel processors have a power management feature where the system goes in power savings mode when the system is being under utilized. This feature should be turned off to avoid variance in vpp application performance. The system should be configured for maximum performance (bios configuration). The downside of this is that even when the host system is idle, the power consumption is not down.
For maximum performance, low-power processor states (C6, C1 enhanced) should be disabled.
Speedstep is a CPU feature that dynamically adjusts the frequency of processor to meet processing needs, decreasing the frequency under low cpu-load conditions. Turboboost overclocks a core when the demand for cpu is high. Turboboost requires that Speedstep is enabled.
While these two configuration are good for power saving they could introduce a variance in dataplane performance when there is a burst of packets. For consistency of behavior, these two features should be disabled.
For maximum performance, Speedstep and Turboboost can both be enabled. BIOS changes are likely not sufficient to enable Turboboost. The host OS may also need changes to support running at higher clock speeds. The specific configuration changes required are different on Ubuntu, CentOS, RedHat, etc. Please see this link for details: Avoiding CPU speed scaling
Ob Ubuntu, “performance” mode for all CPU cores should be set in these files:
root# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance
<etc>
The following output is from a system with an “Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz” with Turboboost enabled, showing the cores running at 2.90Ghz:
root # grep MHz /proc/cpuinfo
cpu MHz : 2900.292
cpu MHz : 2900.000
cpu MHz : 2900.000
cpu MHz : 2900.000
cpu MHz : 2899.902
cpu MHz : 2899.902
cpu MHz : 2899.902
cpu MHz : 2900.000
cpu MHz : 2900.000
cpu MHz : 2900.000
cpu MHz : 2900.000
cpu MHz : 2900.000
Intel virtualization extensions (VT – for VT-x) and VT-d (for direct IO) and DMA remapping (DMAR) must be turned on. VT-d enables IOMMU virtualization capabilities that are required for PCIe passthrough. Also, interrupt remapping should be enabled so that hardware interrupts can be remapped to a VM for PCIe passthrough.
On host bootup, the output of the command should look like:
$ dmesg | grep -e DMAR -e IOMMU
[ 0.000000] ACPI: DMAR 000000008f6dd000 001A8 (v01 Cisco0 CiscoUCS 00000001 MSFT 0100000D)
[ 0.056527] dmar: IOMMU 0: reg_base_addr fe710000 ver 1:0 cap c90780106f0462 ecap f020ff
[ 0.056637] IOAPIC id 8 under DRHD base 0xfe710000 IOMMU 0
[ 0.056638] IOAPIC id 9 under DRHD base 0xfe710000 IOMMU 0
We recommend disabling VT-d Coherency Support for higher performance.
Note that Intel Sandy Bridge CPUs have a limitation with their VT-d IOTLB that limits PCIe passthrough throughput. Sandy Bridge (and earlier) CPUs are not recommended if high performance is required.
Set the grub configuration for those kernel parameters which can only be set via the kernel command line during startup:
GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"
In the example command above:
- intel_iommu=on - must be set for PCIe-passthrough interfaces to work in a VM.
- isolcpus=1-13 nohz_full=1-13 - multi-core scheduler / placement configuration parameters
- hugepagesz=1GB hugepages=64 default_hugepagesz=1GB - setting 1GB hugepages will drastically improve VPP initialization times.
For high performance applications, using a tickless kernel can result in improved performance. The host kernel must have the cores operating in tickless mode and the same cores should be dedicated to the vpp application.
You can check if local timer interrupts are occurring on each core from the output of:
grep LOC /proc/interrupts
or dynamically with:
watch -n1 -d "cat /proc/interrupts | egrep 'LOC|CPU'"
The host kernel may have been built with the CONFIG_NO_HZ_FULL_ALL option. If so, tickless operation will happen automatically on any core on which the linux scheduler has only one thread to run. To check for this, look for that string in your linux kernel config file. This file may be at /boot/ (determine your kernel version with “uname –a”) or at /proc/config.gz.
If the kernel was not built with CONFIG_NO_HZ_FULL it may still be possible to run tickless by configuring it in the grub file (see the Grub File section). Specify the same set of cpus for both nohz_full and isolcpus.
For optimal performance of a virtual machine, specifically for the dataplane/forwarding features, the CPUs assigned to the virtual machine should be used exclusively by the VM. One reasonable way to configure this is via cgroup configuration, where the cpu’s assigned to the cgroup node for the VM are not shared with other tasks on the system. The kernel thread will still run on all the cores – so this does not give complete isolation. This configuration ensures that the host does not schedule other tasks on the same physical cpu and thus lets the qemu thread (and by that token the guest run on that core (almost) exclusively).
The host kernel threads can still be scheduled on the pcpu as mentioned earlier. To isolate the host CPUs completely, even from the kernel threads, isolcpus can be configured. The qemu threads can then be pinned to the isolated cpus. This requires grub configuration on the host and isolates them from running any load (other than the load that’s explicitly pinned to these cores).
Most deployments may not need configuration as it requires customized work load scheduling on the host system. Also this information needs to be propagated to the virtual router/virtual machine and the virtual router/machine needs to use the same isolated cpus. Our recommendation is to have this mechanism if the operator has the need and systems in place to configure/manage the host with this level of detail.
Be aware that cpu cores on a socket may not be numbered contiguously. This can be checked with:
grep “physical id” /proc/cpuinfo
For example, on an HP ProLiant DL380 Gen9 with two CPU E5-2680 v3 12-core CPUs, the cores from different sockets are interleaved like this:
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
physical id : 1
There must always be one core that is not isolated. Commonly this is cpu 0. isolcpus is configured in the grub file. (See the Grub File section.) The cpu list can be a combination of ranges and/or comma-separated values, such as isolcpus=1-13 or isolcpus=11,12,13,14 or isolcpus=0-5,12-17.
For example:
GRUB_CMDLINE_LINUX="intel_iommu=on isolcpus=1-13 nohz_full=1-13 hugepagesz=1GB hugepages=64 default_hugepagesz=1GB"
The CONFIG_NO_HZ_FULL linux kernel build option is used to configure a tickless kernel. The idea is to configure certain processor cores to operate in tickless mode and these cores do not receive any periodic interrupts. These cores will run dedicated tasks (and no other tasks will be schedules on such cores obviating the need to send a scheduling tick). A CONFIG_HZ based timer interrupt will invalidate L1 cache on the core and this can degrade dataplane performance by a few % points (to be quantified, but estimated to be 1-3%). Running tickless typically means getting 1 timer interrupt/sec instead of 1000/sec.
The following configuration tweaks have been used to demonstrate a 98-100% Max bandwidth zero packet drop rate forwarding 1000 byte ipv4 packets from one 10GigE interface to another bi-directionally using 2 10GigE ports on an Ixia traffic generator.
The Irqbalance daemon is enabled by default. It is designed to distribute hardware interrupts across CPUs in a multi-core system in order to increase performance. However, it can/will cause the cpu running the vpp VM to be stalled, causing dropped Rx packets. When irqbalance is disabled, all interrupts will be handled by cpu0, so the vpp VM (or any other service VMs) should NOT run on cpu0.
Disable irqbalance by setting ENABLED="0" in the default configuration file (/etc/default/irqbalance):
#Configuration for the irqbalance daemon
#Should irqbalance be enabled?
ENABLED="0"
#Balance the IRQs only once?
ONESHOT="0"
Man page: http://manpages.ubuntu.com/manpages/precise/man1/irqbalance.1.html
KSM is a memory-saving de-duplication feature, that merges anonymous (private) pages (not pagecache ones).
While diagnosing the vpp Rx zero packet drop issue, we noticed a correlation between the /sys/kernel/debug/kvm/pf_fixed counter being incremented and the periodic Rx packet drops. We observed that disabling KSM eliminated the incrementing of these counters. KSM is enabled in Ubuntu 14.04 server on the host OS only. It is disabled when Ubuntu 14.04 server is run in a VM.
Disable KSM by writing "0" to /sys/kernel/mm/ksm/run in the host OS:
sudo bash
echo 0 > /sys/kernel/mm/ksm/run
exit
For more information, see: http://www.linux-kvm.org/page/KSM
In order to run VPP in a VM, the following parameters must be configured on the command line invocation or in the libvirt / virsh xml domain configuration:
-cpu host : This parameter causes the VM to inherit the host OS flags.
Note: libvirt 0.9.11 or greater is required for this to be included in the xml configuration.
-m 8192 : 8 GB of ram is required for optimal zero packet drop rates.
TBD: Need to investigate why this is true. 4GB has Rx pkt drops even though there is only 2.2GB allocated!
-smp 2,sockets=1,cores=4,threads=2
To disable PXE boot delays, add the ",rombar=0" option to the end of each "-device" option list or
add "<rom bar='off'/> to the device xml configuration.
Use of the VirtIO Balloon driver in the vpp VM causes Rx packet drops when the balloon driver calls mmap().
Remove the VirtIO Balloon Driver from the VM configuration:
If editing the xml configuration, remove the memballoon driver by setting the model='none':
<memballoon model='none'/>
or delete the device definition from the command line parameter list:
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
CPU Affinity and NUMA Memory Policy can be configured with libvirt.
For more information, see: https://libvirt.org/formatdomain.html#elementsNUMATuning
In order to prevent the linux scheduler from relocating the vpe application to a different CPU and in order to prevent interrupt handlers from running on the same cpu as vpe, the qn application cpu affinity shall be set to cpu0 and the vpe application cpu affinity set to cpu1.
If occasional packet drops are acceptable (e.g. a few hundred packets / 10s of minutes), this configuration step may be omitted.
Note: Given the KVM -smp options, there is only one NUMA node, thus no need to set NUMA memory affinity in the VM for the vpe application.
A VPP application should configure the correct cpu affinity during application initialization.
As noted in the previous section, setting the CPU affinity for the vpe and qn application in the VM is important prevent Rx packet drops under the right circumstances. Running other applications (e.g. htop) in the vpp VM may also cause Rx packet drops.
When hyperthreading is enabled, each physical CPU core is appears as two logical cores. Each logical core shares the resources (L1 and L2 cache, registers) of the physical core. This is controlled by a setting in the BIOS.
In general dataplane performance suffers when hyperthreading is enabled and so the recommendation is to disable it.
Since HT configuration is a BIOS setting, and changing it requires a reboot, a deployment will choose to operate with a particular setting and in reality, not enable/disable it based on the workload being run on the machine.
If HT is enabled, it is still possible to obtain the same performance as with HT disabled. To do this, isolate the extra logical cores (see CPU isolation) and do not assign any threads to them.
Transparent hugepage (THP) feature automates the task of creating and managing hugepages. A kernel daemon process (khugepaged) runs in background and stitches free pages together to form/free hugepages.
We recommend turning this feature off and instead allocating hugepages explicitly (this is not a strong recommendation). It is possible to preallocate hugepages and still have THP daemon on the host system.
To turn off THP:
echo never > /sys/kernel/mm/transparent_hugepages/enabled
On a heavily loaded host system, linux will evict a process’ pages to free memory. This can happen to text pages, which are backed by a physical store. If swapping is enabled, the data segments can be swapped out to swap area on disk in case the system is running low on memory. This typically happens when system is overprovisioned. This is the typical setup on a server, but uncommon on embedded systems. Swapping leads to “slow” and non-deterministic response times (added latency to access the page). Page eviction can add to latency if the page is not in memory.
Our recommendation for running nfv application is not to overprovision the system and specifically to avoid swapping (turn swap off). For deterministic response time, we recommend to pin qemu memory for vpp applications. Pinning/locking qemu memory ensures that the qemu process pages are always memory resident. This provides consistent response times.
The parameter to turn on locking of qemu process memory is: -realtime mlock=on
A few things need to be considered to turn on page locking. The calling process must have the process limits (prlimit) set appropriately to lock the appropriate amount/size of memory. If using virsh to start the virtual router, the process limits of libvirtd must be set appropriately.
To validate that the process memory is locked, check the value of VmLck field in /proc//status file. The needs to be the pid of the qemu process (or pid of any of the qemu threads for the virtual router).
Kernel Same-page Merging (also known as kernel shared memory and memory merging) is a kernel feature that makes it possible for a hypervisor system to share identical memory pages amongst different processes or among multiple virtual machines. While not directly linked, Kernel-based Virtual Machine (KVM) can use KSM to merge memory pages occupied by virtual machines.
KSM is a linux kernel feature (today qemu being the only client application). KSM consumes non-trivial cpu resources on the host system in trying to optimize memory utilization. Also, KSM attempts to merge pages at periodic intervals (typically 200 ms, but configurable via tuning the entry in /sys/kernel/mm/ksm/sleep_millisecs)
We recommend turning this function off when running a single vpp instance.
If there are multiple vpp instances running on a system, turning on this feature will save memory at the expense of some cpu cycles.
To turn off this feature, execute:
echo 0 > /sys/kernel/mm/ksm/run
If it's not practical to turn off ksm, we recommend turning off ksm across numa nodes:
echo 0 > /sys/kernel/mm/ksm/merge_across_nodes
Pass the host CPU configuration to the sunstone virtual router. This is specifically important to see if the host cpu supports 1gb huge pages (pdpe1gb flag in /proc/cpuinfo). This is done using the –cpu=host flag in qemu commandline.
In any environment where high throughput performance is a requirement, it is suggested to run VPP in multithreaded mode.
If running in the default single-threaded configuration, then the same thread that is handling packet forwarding will also perform administrative tasks such as responding to API calls or collecting statistics (which may consume different amounts of time depending on NIC make and model, NIC placement, and the amount of NICs configured for use in VPP), thus allowing external factors to impact forwarding performance. Therefore, even if the required performance target can be achieved by a single CPU core, running VPP in a "one main thread plus one worker thread" configuration will help to alleviate the impact external factors can have, and allow the one worker thread to deliver better and more consistent forwarding performance.
- kernel.org hugetlbpage doc
- hugetlbfs man page
- kernel.org kernel-per-CPU-kthreads doc
- kernel.org cgroup memory.txt
- kernel.org cgroups.txt
- kernel.org cgroup cpusets.txt
- kernel.org cgroup hugetlb.txt
- kernel.org cgroup devices.txt
- kvm VFIO (.pdf format)
- kernel.org vfio.txt
- red hat cpu/irq.html
- tickless kernel
- NO_HZ kernel operation
- kernel.org sched-domains.txt
- dpdk overview
- NO_HZ "full god mode"
- lwn.net transparent hugepages issue
- kernel.org kernel-per-CPU-kthreads.txt
- lwn.net hugetlbfs
- greenhost.nl multi-queue NICs with SMP on Linux
- irqbalance man page
- kernel.org IRQ-affinity.txt
- kernel.org network scaling.txt
- VPP 2022 Make Test Use Case Poll
- VPP-AArch64
- VPP-ABF
- VPP Alternative Builds
- VPP API Concepts
- VPP API Versioning
- VPP-ApiChangeProcess
- VPP-ArtifactVersioning
- VPP-BIER
- VPP-Bihash
- VPP-BugReports
- VPP Build System Deep Dive
- VPP Build, Install, And Test Images
- VPP-BuildArtifactRetentionPolicy
- VPP-c2cpel
- VPP Code Walkthrough VoD
- VPP Code Walkthrough VoD Topic Index
- VPP Code Walkthrough VoDs
- VPP-CodeStyleConventions
- VPP-CodingTips
- VPP Command Line Arguments
- VPP Command Line Interface CLI Guide
- VPP-CommitMessages
- VPP-Committers-SMEs
- VPP-CommitterTasks-ApiFreeze
- VPP CommitterTasks Compare API Changes
- VPP-CommitterTasks-CutPointRelease
- VPP-CommitterTasks-CutRelease
- VPP-CommitterTasks-FinalReleaseCandidate
- VPP-CommitterTasks-PullThrottleBranch
- VPP-CommitterTasks-ReleasePlan
- VPP Configuration Tool
- VPP Configure An LW46 MAP E Terminator
- VPP Configure VPP As A Router Between Namespaces
- VPP Configure VPP TAP Interfaces For Container Routing
- VPP-CoreFileMismatch
- VPP-cpel
- VPP-cpeldump
- VPP-CurrentData
- VPP-DHCPKit
- VPP-DHCPv6
- VPP-DistributedOwnership
- VPP-Documentation
- VPP DPOs And Feature Arcs
- VPP EC2 Instance With SRIOV
- VPP-elog
- VPP-FAQ
- VPP Feature Arcs
- VPP-Features
- VPP-Features-IPv6
- VPP-FIB
- VPP-g2
- VPP Getting VPP 16.06
- VPP Getting VPP Release Binaries
- VPP-HA
- VPP-HostStack
- VPP-HostStack-BuiltinEchoClientServer
- VPP-HostStack-EchoClientServer
- VPP-HostStack-ExternalEchoClientServer
- VPP HostStack Hs Test
- VPP-HostStack-LDP-iperf
- VPP-HostStack-LDP-nginx
- VPP-HostStack-LDP-sshd
- VPP-HostStack-nginx
- VPP-HostStack-SessionLayerArchitecture
- VPP-HostStack-TestHttpServer
- VPP-HostStack-TestProxy
- VPP-HostStack-TLS
- VPP-HostStack-VCL
- VPP-HostStack-VclEchoClientServer
- VPP-Hotplug
- VPP How To Add A Tunnel Encapsulation
- VPP How To Build The Sample Plugin
- VPP How To Connect A PCI Interface To VPP
- VPP How To Create A VPP Binary Control Plane API
- VPP How To Deploy VPP In EC2 Instance And Use It To Connect Two Different VPCs
- VPP How To Optimize Performance %28System Tuning%29
- VPP How To Use The API Trace Tools
- VPP How To Use The C API
- VPP How To Use The Packet Generator And Packet Tracer
- VPP-Howtos
- VPP-index
- VPP Installing VPP Binaries From Packages
- VPP Interconnecting vRouters With VPP
- VPP Introduction To IP Adjacency
- VPP Introduction To N Tuple Classifiers
- VPP IP Adjacency Introduction
- VPP-IPFIX
- VPP-IPSec
- VPP IPSec And IKEv2
- VPP IPv6 SR VIRL Topology File
- VPP Java API
- VPP Java API Plugin Support
- VPP Jira Workflow
- VPP-Macswapplugin
- VPP-MakeTestFramework
- VPP-Meeting
- VPP-MFIB
- VPP Missing Prefetches
- VPP Modifying The Packet Processing Directed Graph
- VPP MPLS FIB
- VPP-NAT
- VPP Nataas Test
- VPP-OVN
- VPP Per Feature Notes
- VPP Performance Analysis Tools
- VPP-perftop
- VPP Progressive VPP Tutorial
- VPP Project Meeting Minutes
- VPP Pulling, Building, Running, Hacking And Pushing VPP Code
- VPP Pure L3 Between Namespaces With 32s
- VPP Pure L3 Container Networking
- VPP Pushing And Testing A Tag
- VPP Python API
- VPP-PythonVersionPolicy
- VPP-QuickTrexSetup
- VPP Random Hints And Kinks For KVM Usage
- VPP Release Plans Release Plan 16.09
- VPP Release Plans Release Plan 17.01
- VPP Release Plans Release Plan 17.04
- VPP Release Plans Release Plan 17.07
- VPP Release Plans Release Plan 17.10
- VPP Release Plans Release Plan 18.01
- VPP Release Plans Release Plan 18.04
- VPP Release Plans Release Plan 18.07
- VPP Release Plans Release Plan 18.10
- VPP Release Plans Release Plan 19.01
- VPP Release Plans Release Plan 19.04
- VPP Release Plans Release Plan 19.08
- VPP Release Plans Release Plan 20.01
- VPP Release Plans Release Plan 20.05
- VPP Release Plans Release Plan 20.09
- VPP Release Plans Release Plan 21.01
- VPP Release Plans Release Plan 21.06
- VPP Release Plans Release Plan 21.10
- VPP Release Plans Release Plan 22.02
- VPP Release Plans Release Plan 22.06
- VPP Release Plans Release Plan 22.10
- VPP Release Plans Release Plan 23.02
- VPP Release Plans Release Plan 23.06
- VPP Release Plans Release Plan 23.10
- VPP Release Plans Release Plan 24.02
- VPP Release Plans Release Plan 24.06
- VPP Release Plans Release Plan 24.10
- VPP Release Plans Release Plan 25.02
- VPP Release Plans Release Plan 25.06
- VPP Release Plans Release Plan 25.10
- VPP Release Plans Release Plan 26.02
- VPP Release Plans Release Plan 26.06
- VPP-RM
- VPP-SecurityGroups
- VPP Segment Routing For IPv6
- VPP Segment Routing For MPLS
- VPP Setting Up Your Dev Environment
- VPP-SNAT
- VPP Software Architecture
- VPP STN Testing
- VPP The VPP API
- VPP Training Events
- VPP-Troubleshooting
- VPP-Troubleshooting-BuildIssues
- VPP-Troubleshooting-Vagrant
- VPP Tutorial DPDK And MacSwap
- VPP Tutorial Routing And Switching
- VPP-Tutorials
- VPP Use VPP To Chain VMs Using Vhost User Interface
- VPP Use VPP To Connect VMs Using Vhost User Interface
- VPP Using mTCP User Mode TCP Stack With VPP
- VPP Using VPP As A VXLAN Tunnel Terminator
- VPP Using VPP In A Multi Thread Model
- VPP-VOM
- VPP VPP BFD Nexus
- VPP VPP Home Gateway
- VPP VPP WIKI DEPRECATED CONTENT
- VPP-VPPCommunicationsLibrary
- VPP-VPPConfig
- VPP What Is ODP4VPP
- VPP What Is VPP
- VPP Working Environments
- VPP Working With The 16.06 Throttle Branch