Skip to content

Commit 11afffd

Browse files
Content dev
1 parent c83029d commit 11afffd

File tree

6 files changed

+51
-32
lines changed

6 files changed

+51
-32
lines changed

content/learning-paths/servers-and-cloud-computing/tune-network-workloads-on-bare-metal/1_setup.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
## Overview
1010

11-
Tomcat is a common client–server web workload that serves HTTP/HTTPS requests. In this section, you will set up a benchmarking environment using **Apache Tomcat** (server) and **wrk2** (client) to generate load and measure performance on an Arm-based bare‑metal instance. This guide was validated on an AWS **c8g.metal‑48xl** running Ubuntu 24.04.
11+
Tomcat is a common client–server web workload that serves HTTP/HTTPS requests. In this section, you will set up a benchmarking environment using Apache Tomcat (server) and `wrk2` (client) to generate load and measure performance on an Arm-based bare‑metal instance. This guide was validated on an AWS `c8g.metal‑48xl` instance running Ubuntu 24.04.
1212

1313
## Set up the Tomcat benchmark server
1414

@@ -36,7 +36,7 @@ Alternatively, you can build Tomcat [from source](https://github.com/apache/tomc
3636

3737
## Enable access to Tomcat examples
3838

39-
To access the built‑in examples from your local network or external IP, modify the `context.xml` file and update the `RemoteAddrValve` to allow your clients.
39+
To access the built‑in examples from your local network or external IP, modify the `context.xml` file and update `RemoteAddrValve` to allow your clients.
4040

4141
The file is located at:
4242

content/learning-paths/servers-and-cloud-computing/tune-network-workloads-on-bare-metal/2_baseline.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ This baseline includes:
2626
- Disabling access logging
2727
- Setting optimal thread counts
2828

29-
### Align IOMMU settings with Ubuntu defaults
29+
## Align IOMMU settings with Ubuntu defaults
3030

3131
{{% notice Note %}}
3232
If you are using a cloud image (for example, AWS) with non-default kernel parameters, align IOMMU settings with the Ubuntu defaults: `iommu.strict=1` and `iommu.passthrough=0`.
@@ -60,7 +60,7 @@ You should see that under the default configuration, `iommu.strict` is enabled,
6060
...
6161
```
6262

63-
### Establish a baseline on Arm Neoverse bare-metal instances
63+
## Establish a baseline on Arm Neoverse bare-metal instances
6464

6565
{{% notice Note %}}
6666
To mirror a typical Tomcat deployment and simplify tuning, keep **8 CPU cores online** and set the remaining cores offline. Adjust the CPU range to match your instance. The example below assumes 192 CPUs (as on AWS `c8g.metal-48xl`).
@@ -115,7 +115,7 @@ To mirror a typical Tomcat deployment and simplify tuning, keep **8 CPU cores on
115115
Transfer/sec: 129.90MB
116116
```
117117

118-
### Disable access logging
118+
## Disable access logging
119119

120120
Disabling access logs removes I/O overhead during benchmarking.
121121

@@ -157,7 +157,7 @@ Disabling access logs removes I/O overhead during benchmarking.
157157
Transfer/sec: 144.36MB
158158
```
159159

160-
### Set optimal thread counts
160+
## Set optimal thread counts
161161

162162
To minimize contention and context switching, align Tomcat’s CPU‑intensive thread count with available CPU cores.
163163

content/learning-paths/servers-and-cloud-computing/tune-network-workloads-on-bare-metal/4_local-numa.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: NUMA-based Tuning
2+
title: NUMA-based tuning
33
weight: 5
44

55
### FIXED, DO NOT MODIFY
@@ -10,7 +10,7 @@ layout: learningpathall
1010

1111
In this section, you configure local NUMA and assess the performance uplift achieved through tuning. Cross‑NUMA data transfers generally incur higher latency than intra‑NUMA transfers, so Tomcat should be deployed on the NUMA node where the network interface resides to reduce cross‑node memory traffic and improve throughput and latency.
1212

13-
### Configure local NUMA
13+
## Configure local NUMA
1414

1515
Check NUMA topology and relative latencies:
1616

@@ -78,7 +78,7 @@ NUMA:
7878
...
7979
```
8080

81-
### Validate performance after NUMA tuning
81+
## Validate performance after NUMA tuning
8282

8383
Restart Tomcat on the Arm Neoverse bare‑metal instance:
8484

content/learning-paths/servers-and-cloud-computing/tune-network-workloads-on-bare-metal/5_iommu.md

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,28 +6,30 @@ weight: 6
66
layout: learningpathall
77
---
88

9-
## IOMMU-based tuning
10-
IOMMU (Input-Output Memory Management Unit) is a hardware feature that manages how I/O devices access memory.
11-
In cloud environments, SmartNICs are typically used to offload the IOMMU workload. On bare-metal systems, to align performance with the cloud, you should disable `iommu.strict` and enable `iommu.passthrough` settings to achieve better performance.
9+
## Tune with IOMMU
1210

13-
### Setting IOMMU
11+
IOMMU (Input–Output Memory Management Unit) controls how I/O devices access memory. In many cloud environments, SmartNICs offload IOMMU-related work. On Arm Neoverse bare‑metal systems, you can often improve Tomcat networking performance by **disabling strict mode** and **enabling passthrough** (setting `iommu.strict=0` and `iommu.passthrough=1`).
1412

15-
1. To configure the IOMMU setting, use a text editor to modify the `grub` file by adding or updating the `GRUB_CMDLINE_LINUX` configuration.
13+
## Configure IOMMU settings
14+
15+
Edit the GRUB configuration to set IOMMU to passthrough and disable strict invalidations:
1616

1717
```bash
1818
sudo vi /etc/default/grub
1919
```
20-
then add or update:
20+
Add or update the kernel command line:
2121
```bash
2222
GRUB_CMDLINE_LINUX="iommu.strict=0 iommu.passthrough=1"
2323
```
2424

25-
2. Update GRUB and reboot to apply the settings.
25+
Update GRUB and reboot to apply the settings:
26+
2627
```bash
2728
sudo update-grub && sudo reboot
2829
```
2930

30-
3. Verify if the settings have been successfully applied:
31+
Verify that IOMMU is in passthrough mode after reboot:
32+
3133
```bash
3234
sudo dmesg | grep iommu
3335
```
@@ -38,24 +40,30 @@ You will notice that the IOMMU is already in passthrough mode:
3840
[ 0.855658] iommu: Default domain type: Passthrough (set via kernel command line)
3941
```
4042

41-
### The result after configuring IOMMU
43+
## Validate performance after IOMMU tuning
44+
45+
Prepare the Arm Neoverse bare‑metal server (ensure your `${net}` interface variable is set; if not, set it to your NIC name, for example `net=enP11p4s0`), align queues, and restart Tomcat:
4246

43-
1. Run the following command on the Arm Neoverse bare-metal where `Tomcat` is on:
4447
```bash
4548
for no in {96..103}; do sudo bash -c "echo 1 > /sys/devices/system/cpu/cpu${no}/online"; done
4649
for no in {0..95} {104..191}; do sudo bash -c "echo 0 > /sys/devices/system/cpu/cpu${no}/online"; done
47-
net=$(ls /sys/class/net/ | grep 'en')
50+
51+
# Ensure NIC queue count matches the number of online CPUs (example: 8)
4852
sudo ethtool -L ${net} combined 8
53+
54+
# Restart Tomcat with a higher file‑descriptor limit
4955
~/apache-tomcat-11.0.10/bin/shutdown.sh 2>/dev/null
5056
ulimit -n 65535 && ~/apache-tomcat-11.0.10/bin/startup.sh
5157
```
5258

53-
2. Run run `wrk2` on the `x86_64` bare-metal instance as shown:
59+
Run `wrk2` on the `x86_64` benchmarking client to measure throughput and latency:
60+
5461
```bash
5562
ulimit -n 65535 && wrk -c1280 -t128 -R500000 -d60 http://${tomcat_ip}:8080/examples/servlets/servlet/HelloWorldExample
5663
```
5764

58-
The result after iommu tuning should look like:
65+
Sample results after IOMMU tuning:
66+
5967
```output
6068
Thread Stats Avg Stdev Max +/- Stdev
6169
Latency 4.92s 2.49s 10.08s 62.27%

content/learning-paths/servers-and-cloud-computing/tune-network-workloads-on-bare-metal/6_summary.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,26 @@ weight: 7
66
layout: learningpathall
77
---
88

9-
## Summary
10-
You will observe that each tuning method can bring significant performance improvements while running Tomcat as shown in the results summary below:
9+
## Review the results: Tomcat performance tuning on Arm Neoverse
1110

12-
| Method | Requests/sec | Latency-Avg |
13-
|:----------------|:-------------|:------------|
14-
| Baseline | 357835.75 | 10.26s |
15-
| NIC-Queue | 378782.37 | 8.35s |
16-
| NUMA-Local | 363744.39 | 9.41s |
17-
| IOMMU | 428628.50 | 4.92s |
11+
Each tuning technique delivered measurable gains for the Tomcat HTTP benchmark on an Arm Neoverse bare‑metal server (workload generated with **wrk2**). The table summarizes requests per second and average latency at each stage.
1812

13+
| Method | Requests/sec | Avg latency (s) |
14+
|:-------------|-------------:|----------------:|
15+
| Baseline | 357,835.75 | 10.26 |
16+
| NIC queues | 378,782.37 | 8.35 |
17+
| NUMA-local | 363,744.39 | 9.41 |
18+
| IOMMU | 428,628.50 | 4.92 |
1919

20-
The same tuning methods can be applied as general guidance to help optimize and tune other network-based workloads.
20+
### Key takeaways
21+
22+
- **IOMMU passthrough** produced the largest throughput gain: **+19.8%** vs. baseline, with a **52.0%** drop in average latency.
23+
- **NIC queue count alignment** improved throughput by **+5.9%** and reduced average latency by **18.6%**.
24+
- **NUMA locality** yielded a smaller but consistent benefit: **+1.7%** throughput and **8.3%** lower average latency.
25+
- Together, these techniques (IOMMU tuning, NIC queue optimization, and NUMA-aware placement) form a practical checklist for improving network workload performance on Arm Neoverse.
26+
27+
### Next steps
28+
29+
- Apply the same tuning pattern to other HTTP services and microservices (for example, NGINX, Envoy, or custom Jetty/Tomcat apps).
30+
- Re‑evaluate queue counts, CPU pinning, and IOMMU mode as you scale cores, update kernels, or change NIC drivers/firmware.
31+
- Track end‑to‑end SLOs (p95/p99 latency and error rates) in addition to average metrics to ensure sustained gains under real traffic.

content/learning-paths/servers-and-cloud-computing/tune-network-workloads-on-bare-metal/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ learning_objectives:
1414

1515
prerequisites:
1616
- An Arm Neoverse-based bare-metal server running Ubuntu 24.04 to run Apache Tomcat (this Learning Path was tested with an AWS c8g.metal-48xl instance)
17-
- Access to an x86_64 bare-metal server running Ubuntu 24.04 to run wrk2
17+
- Access to an x86_64 bare-metal server running Ubuntu 24.04 to run `wrk2`
1818
- Basic familiarity with Java applications
1919

2020
author: Ying Yu, Ker Liu, Rui Chang

0 commit comments

Comments
 (0)