Skip to content

Commit 7740452

Browse files
Merge pull request #1840 from kieranhejmadi01/disk-io-benchmark
microbenchmarking-disk-performance-with-FIO-LP
2 parents a26349d + ac65f88 commit 7740452

File tree

9 files changed

+394
-0
lines changed

9 files changed

+394
-0
lines changed
116 KB
Loading
39.5 KB
Loading
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: Microbenchmark Storage Performance with Fio
3+
4+
minutes_to_complete: 30
5+
6+
who_is_this_for: A cloud developer who wants to optimise storage cost or performance of their application. Developers who want to uncover potential storage-bound bottlenecks or changes when migrating an application to a different platform.
7+
8+
learning_objectives:
9+
- Understand the flow of data for storage devices
10+
- Use basic observability utilities such as iostat, iotop and pidstat
11+
- Understand how to run fio for microbenchmarking a block storage device
12+
13+
prerequisites:
14+
- Access to an Arm-based server
15+
- Basic understanding of Linux
16+
17+
author: Kieran Hejmadi
18+
19+
### Tags
20+
skilllevels: Introductory
21+
subjects: Performance and Architecture
22+
armips:
23+
- Neoverse
24+
tools_software_languages:
25+
- bash
26+
- Runbook
27+
operatingsystems:
28+
- Linux
29+
30+
31+
further_reading:
32+
- resource:
33+
title: Fio documentation
34+
link: https://fio.readthedocs.io/en/latest/fio_doc.html#running-fio
35+
type: documentation
36+
37+
### FIXED, DO NOT MODIFY
38+
# ================================================================================
39+
weight: 1 # _index.md always has weight of 1 to order correctly
40+
layout: "learningpathall" # All files under learning paths have this same wrapper
41+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
42+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
title: Characterising a Workload
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Basic Characteristics
10+
11+
The basic attributes of a given workload are the following.
12+
13+
- IOPS
14+
- I/O Size
15+
- Throughput
16+
- Read to Write Ratio
17+
- Random vs Sequential access
18+
19+
There are many more characteristics to observe, just as latency but since this is an introductory topic we will mostly stick to the high-level metrics listed above.
20+
21+
## Running an Example Workload
22+
23+
Connect to an Arm-based cloud instance. As an example workload, we will be using the media manipulation tool, FFMPEG on an AWS `t4g.medium` instance.
24+
25+
First install the prequistite tools.
26+
27+
```bash
28+
sudo apt update
29+
sudo apt install ffmpeg iotop -y
30+
```
31+
32+
Download the popular reference video for transcoding, `BigBuckBunny.mp4` which is available under the [Creative Commons 3.0 License](https://creativecommons.org/licenses/by/3.0/).
33+
34+
```bash
35+
cd ~
36+
mkdir src
37+
cd src
38+
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
39+
```
40+
41+
Run the following command to begin transcoding the video and audio using the `H.264` and `aac` transcoders respectively. We use the `-flush_packets` flag to write each chunk of video back to storage from memory.
42+
43+
```bash
44+
ffmpeg -i BigBuckBunny.mp4 -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 128k -flush_packets 1 output_video.mp4
45+
```
46+
47+
### Observing Disk Usage
48+
49+
Whilst the transcoding is running, we can use the `pidstat` command to see the disk statistics of that specific process.
50+
51+
```bash
52+
pidstat -d -p $(pgrep ffmpeg) 1
53+
```
54+
Since this example `151MB` video fits within memory, we observe no `kB_rd/s` for the storage device after the initial read. However, since we are flushing to storage we observe period ~275 `kB_wr/s`.
55+
56+
```output
57+
Linux 6.8.0-1024-aws (ip-10-248-213-118) 04/15/25 _aarch64_ (2 CPU)
58+
59+
10:01:24 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
60+
10:01:25 1000 24250 0.00 276.00 0.00 0 ffmpeg
61+
10:01:26 1000 24250 0.00 256.00 0.00 0 ffmpeg
62+
10:01:27 1000 24250 0.00 216.00 0.00 0 ffmpeg
63+
10:01:28 1000 24250 0.00 184.00 0.00 0 ffmpeg
64+
10:01:29 1000 24250 0.00 424.00 0.00 0 ffmpeg
65+
10:01:30 1000 24250 0.00 312.00 0.00 0 ffmpeg
66+
10:01:31 1000 24250 0.00 372.00 0.00 0 ffmpeg
67+
10:01:32 1000 24250 0.00 344.00 0.00 0 ffmpeg
68+
```
69+
70+
{{% notice Please Note%}}
71+
In this simple example, since we are interacting with a file on the mounted filesystem, we are also observing the behaviour of the filesystem.
72+
{{% /notice %}}
73+
74+
Of course, there may be other processes or background services that are writing to this disk. We can use `iotop` command for inspection. As per the output below, the `ffmpeg` process has the greatest disk utilisation.
75+
76+
```bash
77+
sudo iotop
78+
```
79+
80+
```output
81+
Total DISK READ: 0.00 B/s | Total DISK WRITE: 332.11 K/s
82+
Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s
83+
TID PRIO USER DISK READ DISK WRITE> COMMAND
84+
24891 be/4 ubuntu 0.00 B/s 332.11 K/s ffmpeg -i BigBuckBunny.mp4 -c:v ~ts 1 output_video.mp4 [mux0:mp4]
85+
1 be/4 root 0.00 B/s 0.00 B/s systemd --system --deserialize=74
86+
2 be/4 root 0.00 B/s 0.00 B/s [kthreadd]
87+
```
88+
89+
Using the input, output statistics command (`iostat`) we can observe the system-wide metrics from the `nvme0n1` drive. Please Note that we are using a snapshot of this workload, more accurate characteristics can be obtained by measuring the distribution of a workload.
90+
91+
```bash
92+
watch -n 0.1 iostat -z nvme0n1
93+
```
94+
You should see output similar to that below.
95+
96+
```output
97+
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
98+
nvme0n1 3.81 31.63 217.08 0.00 831846 5709210 0
99+
```
100+
101+
To observe the more detailed metrics we can run `iostat` with the `-x` option.
102+
103+
```bash
104+
iostat -xz nvme0n1
105+
```
106+
107+
```output
108+
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
109+
nvme0n1 0.66 29.64 0.24 26.27 0.73 44.80 2.92 203.88 3.17 52.01 2.16 69.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.15
110+
```
111+
112+
### Basic Characteristics of our Example Workload
113+
114+
This is a simple transcoding workload with flushed writes, where most data is processed and stored in memory. Disk I/O is minimal, with an IOPS of just 3.81, low throughput (248.71 kB/s), and an average IO depth of 0.01 — all summarised in very low disk utilization. The 52% write merge rate and low latencies further suggest sequential, infrequent disk access, reinforcing that the workload is primarily memory-bound.
115+
116+
117+
| Metric | Calculation Explanation | Value |
118+
|--------------------|-------------------------------------------------------------------------------------------------------------|---------------|
119+
| IOPS | Taken directly from the `tps` (transfers per second) field | 3.81 |
120+
| Throughput (Read) | From monitoring tool output | 31.63 kB/s |
121+
| Throughput (Write) | From monitoring tool output | 217.08 kB/s |
122+
| Throughput (Total) | Sum of read and write throughput | 248.71 kB/s |
123+
| Avg I/O Size | Total throughput divided by IOPS: 248.71 / 3.81 | ≈ 65.3 KB |
124+
| Read Ratio | Read throughput ÷ total throughput: 31.63 / 248.71 | ~13% |
125+
| Write Ratio | Write throughput ÷ total throughput: 217.08 / 248.71 | ~87% |
126+
| IO Depth | Taken directly from `aqu-sz` (average number of in-flight I/Os) | 0.01 |
127+
| Access Pattern | Based on cache hits, merge rates, and low wait times. 52% of writes were merged (`wrqm/s` = 3.17, `w/s` = 2.92) → suggests mostly sequential access | Sequential-ish (52.01% merged) |
128+
129+
130+
{{% notice Please Note%}}
131+
If you have access to the workloads source code, the expected access patterns can more easily be observed.
132+
{{% /notice %}}
31.2 KB
Loading
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: Fundamentals of Storage Systems
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Introduction
10+
11+
The ideal storage activity of your system is 0. In this situation all of your application data and instructions are available in memory or caches with no reads or writes to a spinning hard-disk drive or solid-state SSD required. However, due to physical capacity limitations, data volatility and need to store large amounts of data, many applications require frequent access to storage media.
12+
13+
## High-Level Flow of Data
14+
15+
The diagram below is a high-level overview of how data can be written or read from a storage device. This diagram illustrates a multi-disk I/O architecture where each disk (Disk 1 to Disk N) has an I/O queue and optional disk cache, communicating with a central CPU via a disk controller. Memory is not explicitly shown but resides between the CPU and storage, offering fast access times with the tradeoff of volatile. File systems, though not depicted, operate at the OS/kernel level to handling file access metadata and offer a friendly way to interact through files and directories.
16+
17+
![disk i/o](./diskio.jpeg)
18+
19+
20+
## Key Terms
21+
22+
#### Sectors and Blocks
23+
24+
Sectors are the basic physical units on a storage device. For instance, traditional hard drives typically use a sector size of 512 bytes, while many modern disks use 4096 bytes (or 4K sectors) to improve error correction and efficiency.
25+
26+
Blocks are the logical grouping of one or more sectors used by filesystems for data organization. A common filesystem block size is 4096 bytes, meaning that each block might consist of 8 of the 512-byte sectors, or simply map directly to a 4096-byte physical sector layout if the disk supports it.
27+
28+
#### Input Output Operations per second (IOPS)
29+
IOPS is a measure of how much random read or write requests your storage system can manage. It is worth noting that IOPS can vary by block size depending on the storage medium (e.g., flash drives). Importantly, traditional hard disk drives (HDDs) often don't specify the IOPS. For example the IOPS value for HDD volume on AWS is not shown.
30+
31+
![iops_hdd](./IOPS.png)
32+
33+
#### Throughput / Bandwidth
34+
Throughput is the data transfer rate normally in MB/s with bandwidth specifying the maximum amount that a connection can transfer. IOPS x block size can be used to calculate the storage throughput of your application.
35+
36+
#### Queue Depth
37+
Queue depth refers to the number of simultaneous I/O operations that can be pending on a device. Consumer SSDs might typically have a queue depth in the range of 32 to 64, whereas enterprise-class NVMe drives can support hundreds or even thousands of concurrent requests per queue. This parameter affects how much the device can parallelize operations and therefore influences overall I/O performance.
38+
39+
#### I/O Schedule Engine
40+
41+
The I/O engine is the software component within Linux responsible for managing I/O requests between applications and the storage subsystem. For example, in Linux, the kernel’s block I/O scheduler acts as an I/O engine by queuing and dispatching requests to device drivers. Schedulers use multiple queues to reorder requests optimal disk access.
42+
In benchmarking tools like fio, you might select I/O engines such as sync (synchronous I/O), `libaio` (Linux native asynchronous I/O library), or `io_uring` (which leverages newer Linux kernel capabilities for asynchronous I/O).
43+
44+
#### I/O Wait
45+
46+
This is the perceived time spent waiting for I/O to return the value from the perspective of the CPU core.
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
title: Using FIO
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Setup and Install Fio
10+
11+
I will be using the same `t4g.medium` instance from the previous section with 2 different types of SSD-based block storage devices as per the console screenshot below. Both block devices have the same, 8GiB capacity but the `io1` is geared towards throughput as opposed to the general purpose SSD `gp2`. In this section we want to observe what the real-world performance for our workload is so that it can inform our selection.
12+
13+
![EBS](./EBS.png)
14+
15+
Flexible I/O (fio) is a command-line tool to generate a synthetic workload with specific I/O characteristics. This serves as a simpler alternative to full record and replay testing. Fio is available through most Linux distribution packages, please refer to the [documentation](https://github.com/axboe/fio) for the binary package availability.
16+
17+
```bash
18+
sudo apt update
19+
sudo apt install fio -y
20+
```
21+
22+
Confirm installation with the following commands.
23+
24+
```bash
25+
fio --version
26+
```
27+
28+
```output
29+
fio-3.36
30+
```
31+
32+
## Locate Device
33+
34+
`Fio` allows us to microbenchmark either the block device or a mounted filesystem. The disk free, `df` command to confirm our EBS volumes are not mounted. Writing to drives that hold critical information may cause issues. Hence we are writing to blank, unmounted block storage device.
35+
36+
Using the `lsblk` command to view the EBS volumes attached to the server (`nvme1n1` and `nvme2n1`). The immediate number appended to `nvme`, e.g., `nvme0`, shows it is a physically separate device. `nvme1n1` corresponds to the faster `io2` block device and `nvme2n1` corresponds to the slower `gp2` block device.
37+
38+
```bash
39+
lsblk -e 7
40+
```
41+
42+
```output
43+
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
44+
nvme1n1 259:0 0 8G 0 disk
45+
nvme0n1 259:1 0 8G 0 disk
46+
├─nvme0n1p1 259:3 0 7G 0 part /
47+
├─nvme0n1p15 259:4 0 99M 0 part /boot/efi
48+
└─nvme0n1p16 259:5 0 923M 0 part /boot
49+
nvme2n1 259:2 0 8G 0 disk
50+
```
51+
52+
{{% notice Please Note%}}
53+
If you have more than 1 block volumes attached to an instance, the `sudo nvme list` command from the `nvme-cli` package and be used to differentiate between volumes
54+
{{% /notice %}}
55+
56+
## Generating a Synthetic Workload
57+
58+
Let us say we want to simulate a fictional logging application with the following characteristics observed using the tools from the previous section.
59+
60+
{{% notice Workload%}}
61+
The logging workload has light sequential read and write characteristics. The system write throughput per thread is 5 MB/s with 83% writes. There are infrequent bursts of reads for approximately 5 seconds, operating at up to 16MB/s per thread. The workload can scale the infrequent reads and writes to use up to 16 threads each. The block size for the writes and reads are 64KiB and 256KiB respectively (as opposed to the standard 4KiB Page size).
62+
63+
Further, the application latency sensitive and given it holds critical information, needs to write directly to non-volatile storage through direct IO.
64+
{{% /notice %}}
65+
66+
The fio tool uses simple configuration `jobfiles` to describe the characterisics of your synthetic workload. Parameters under the `[global]` option are shared among jobs. From the example below, we have created 2 jobs to represent the steady write and infrequent reads. Please refer to the official [documentation](https://fio.readthedocs.io/en/latest/fio_doc.html#job-file-format) for more details.
67+
68+
Copy and paste the configuration file below into 2 files named `nvme<x>.fio`. Replace the `<x>` with the block devices we are comparing and just the `filename` parameter accordingly.
69+
70+
```ini
71+
; -- start job file including.fio --
72+
[global]
73+
ioengine=libaio
74+
direct=1 ; write directly to the drive
75+
time_based
76+
runtime=30
77+
group_reporting=1
78+
log_avg_msec=1000
79+
rate=16m,5m ; limit to 16 MB/s and 5MB/s for read and write per job
80+
numjobs=${NUM_JOBS} ; set at the command line
81+
iodepth=${IO_DEPTH} ; set at the command line
82+
filename=/dev/nvme1n1 ; or nvme2n1
83+
84+
[steady_write]
85+
name=steady_write
86+
rw=write ; sequential write
87+
bs=64k ; Block size of 64KiB (default block size of 4 KiB)
88+
89+
[burst_read]
90+
name=burst_read
91+
rw=read
92+
bs=256k ; adjust the block size to 64KiB writes (default is 4KiB)
93+
startdelay=10 ; simulate infrequent reads (5 seconds out 30)
94+
runtime=5
95+
; -- end job file including.fio --
96+
```
97+
98+
Run the following commands to run each test back to back.
99+
100+
```bash
101+
sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme1.fio
102+
```
103+
104+
Then
105+
106+
```bash
107+
sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme2.fio
108+
```
109+
110+
### Interpreting Results
111+
112+
The final terminal output from both runs are shown below.
113+
114+
```output
115+
nvme1:
116+
117+
Run status group 0 (all jobs):
118+
READ: bw=118MiB/s (124MB/s), 118MiB/s-118MiB/s (124MB/s-124MB/s), io=629MiB (660MB), run=5324-5324msec
119+
WRITE: bw=80.0MiB/s (83.9MB/s), 80.0MiB/s-80.0MiB/s (83.9MB/s-83.9MB/s), io=2400MiB (2517MB), run=30006-30006msec
120+
121+
Disk stats (read/write):
122+
nvme1n1: ios=2663/38225, sectors=1294480/4892800, merge=0/0, ticks=148524/454840, in_queue=603364, util=62.19%
123+
124+
nvme2:
125+
126+
Run status group 0 (all jobs):
127+
READ: bw=85.6MiB/s (89.8MB/s), 85.6MiB/s-85.6MiB/s (89.8MB/s-89.8MB/s), io=456MiB (478MB), run=5322-5322msec
128+
WRITE: bw=60.3MiB/s (63.2MB/s), 60.3MiB/s-60.3MiB/s (63.2MB/s-63.2MB/s), io=1816MiB (1904MB), run=30119-30119msec
129+
130+
Disk stats (read/write):
131+
nvme2n1: ios=1872/28855, sectors=935472/3693440, merge=0/0, ticks=159753/1025104, in_queue=1184857, util=89.83%
132+
```
133+
134+
Here we can see that the faster `io2` block storage (`nvme1`) is able to meet the throughput requirement of 80MB/s for steady writes when all 16 write threads are running (5MB/s per thread). However `gp2` saturates at 60.3 MiB/s with over 89.8% SSD utilisation.
135+
136+
We are told the fictional logging application is sensitive to operation latency. The output belows highlights that over ~35% operations have a latency above 1s on nvme2 compared to ~7% on nvme1.
137+
138+
139+
```output
140+
141+
nvme2:
142+
143+
lat (usec) : 10=0.01%, 500=1.53%, 750=5.13%, 1000=7.55%
144+
lat (msec) : 2=29.49%, 4=0.89%, 10=0.09%, 20=0.02%, 50=0.21%
145+
lat (msec) : 100=0.56%, 250=1.84%, 500=6.39%, 750=9.76%, 1000=10.17%
146+
lat (msec) : 2000=19.59%, >=2000=6.77%
147+
148+
nvme1:
149+
150+
lat (usec) : 750=0.44%, 1000=0.41%
151+
lat (msec) : 2=62.63%, 4=1.12%, 10=0.34%, 20=1.61%, 50=3.91%
152+
lat (msec) : 100=2.34%, 250=5.91%, 500=8.46%, 750=4.33%, 1000=2.50%
153+
lat (msec) : 2000=3.62%, >=2000=2.38%
154+
```
155+
156+
This insights above suggest the SSD designed for throughput, `io2` is more suitable than the general purpose `gp2` storage to meet the requirements of our logging application.
157+
158+
{{% notice Tip%}}
159+
If the text output is hard to follow, you can use the `fio2gnuplot` package to plot the data graphically or use the visualisations available from the cloud service provider's dashboard. See image below for an example.
160+
161+
![plot](./visualisations.png)
162+
{{% /notice %}}
163+
164+
The insights gathered by microbenchmarking with fio above can lead to more informed decisions about which block storage to connect to your Arm-based instance.
165+
166+
15 KB
Loading

0 commit comments

Comments
 (0)