Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Microbenchmark Storage Performance with Fio

minutes_to_complete: 30

who_is_this_for: A cloud developer who wants to optimise storage cost or performance of their application. Developers who want to uncover potential storage-bound bottlenecks or changes when migrating an application to a different platform.

learning_objectives:
- Understand the flow of data for storage devices
- Use basic observability utilities such as iostat, iotop and pidstat
- Understand how to run fio for microbenchmarking a block storage device

prerequisites:
- Access to an Arm-based server
- Basic understanding of Linux

author: Kieran Hejmadi

### Tags
skilllevels: Introductory
subjects: Performance and Architecture
armips:
- Neoverse
tools_software_languages:
- bash
- Runbook
operatingsystems:
- Linux


further_reading:
- resource:
title: Fio documentation
link: https://fio.readthedocs.io/en/latest/fio_doc.html#running-fio
type: documentation

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: Characterising a Workload
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Basic Characteristics

The basic attributes of a given workload are the following.

- IOPS
- I/O Size
- Throughput
- Read to Write Ratio
- Random vs Sequential access

There are many more characteristics to observe, just as latency but since this is an introductory topic we will mostly stick to the high-level metrics listed above.

## Running an Example Workload

Connect to an Arm-based cloud instance. As an example workload, we will be using the media manipulation tool, FFMPEG on an AWS `t4g.medium` instance.

First install the prequistite tools.

```bash
sudo apt update
sudo apt install ffmpeg iotop -y
```

Download the popular reference video for transcoding, `BigBuckBunny.mp4` which is available under the [Creative Commons 3.0 License](https://creativecommons.org/licenses/by/3.0/).

```bash
cd ~
mkdir src
cd src
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
```

Run the following command to begin transcoding the video and audio using the `H.264` and `aac` transcoders respectively. We use the `-flush_packets` flag to write each chunk of video back to storage from memory.

```bash
ffmpeg -i BigBuckBunny.mp4 -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 128k -flush_packets 1 output_video.mp4
```

### Observing Disk Usage

Whilst the transcoding is running, we can use the `pidstat` command to see the disk statistics of that specific process.

```bash
pidstat -d -p $(pgrep ffmpeg) 1
```
Since this example `151MB` video fits within memory, we observe no `kB_rd/s` for the storage device after the initial read. However, since we are flushing to storage we observe period ~275 `kB_wr/s`.

```output
Linux 6.8.0-1024-aws (ip-10-248-213-118) 04/15/25 _aarch64_ (2 CPU)

10:01:24 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
10:01:25 1000 24250 0.00 276.00 0.00 0 ffmpeg
10:01:26 1000 24250 0.00 256.00 0.00 0 ffmpeg
10:01:27 1000 24250 0.00 216.00 0.00 0 ffmpeg
10:01:28 1000 24250 0.00 184.00 0.00 0 ffmpeg
10:01:29 1000 24250 0.00 424.00 0.00 0 ffmpeg
10:01:30 1000 24250 0.00 312.00 0.00 0 ffmpeg
10:01:31 1000 24250 0.00 372.00 0.00 0 ffmpeg
10:01:32 1000 24250 0.00 344.00 0.00 0 ffmpeg
```

{{% notice Please Note%}}
In this simple example, since we are interacting with a file on the mounted filesystem, we are also observing the behaviour of the filesystem.
{{% /notice %}}

Of course, there may be other processes or background services that are writing to this disk. We can use `iotop` command for inspection. As per the output below, the `ffmpeg` process has the greatest disk utilisation.

```bash
sudo iotop
```

```output
Total DISK READ: 0.00 B/s | Total DISK WRITE: 332.11 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE> COMMAND
24891 be/4 ubuntu 0.00 B/s 332.11 K/s ffmpeg -i BigBuckBunny.mp4 -c:v ~ts 1 output_video.mp4 [mux0:mp4]
1 be/4 root 0.00 B/s 0.00 B/s systemd --system --deserialize=74
2 be/4 root 0.00 B/s 0.00 B/s [kthreadd]
```

Using the input, output statistics command (`iostat`) we can observe the system-wide metrics from the `nvme0n1` drive. Please Note that we are using a snapshot of this workload, more accurate characteristics can be obtained by measuring the distribution of a workload.

```bash
watch -n 0.1 iostat -z nvme0n1
```
You should see output similar to that below.

```output
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme0n1 3.81 31.63 217.08 0.00 831846 5709210 0
```

To observe the more detailed metrics we can run `iostat` with the `-x` option.

```bash
iostat -xz nvme0n1
```

```output
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 0.66 29.64 0.24 26.27 0.73 44.80 2.92 203.88 3.17 52.01 2.16 69.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.15
```

### Basic Characteristics of our Example Workload

This is a simple transcoding workload with flushed writes, where most data is processed and stored in memory. Disk I/O is minimal, with an IOPS of just 3.81, low throughput (248.71 kB/s), and an average IO depth of 0.01 — all summarised in very low disk utilization. The 52% write merge rate and low latencies further suggest sequential, infrequent disk access, reinforcing that the workload is primarily memory-bound.


| Metric | Calculation Explanation | Value |
|--------------------|-------------------------------------------------------------------------------------------------------------|---------------|
| IOPS | Taken directly from the `tps` (transfers per second) field | 3.81 |
| Throughput (Read) | From monitoring tool output | 31.63 kB/s |
| Throughput (Write) | From monitoring tool output | 217.08 kB/s |
| Throughput (Total) | Sum of read and write throughput | 248.71 kB/s |
| Avg I/O Size | Total throughput divided by IOPS: 248.71 / 3.81 | ≈ 65.3 KB |
| Read Ratio | Read throughput ÷ total throughput: 31.63 / 248.71 | ~13% |
| Write Ratio | Write throughput ÷ total throughput: 217.08 / 248.71 | ~87% |
| IO Depth | Taken directly from `aqu-sz` (average number of in-flight I/Os) | 0.01 |
| Access Pattern | Based on cache hits, merge rates, and low wait times. 52% of writes were merged (`wrqm/s` = 3.17, `w/s` = 2.92) → suggests mostly sequential access | Sequential-ish (52.01% merged) |


{{% notice Please Note%}}
If you have access to the workloads source code, the expected access patterns can more easily be observed.
{{% /notice %}}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: Fundamentals of Storage Systems
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Introduction

The ideal storage activity of your system is 0. In this situation all of your application data and instructions are available in memory or caches with no reads or writes to a spinning hard-disk drive or solid-state SSD required. However, due to physical capacity limitations, data volatility and need to store large amounts of data, many applications require frequent access to storage media.

## High-Level Flow of Data

The diagram below is a high-level overview of how data can be written or read from a storage device. This diagram illustrates a multi-disk I/O architecture where each disk (Disk 1 to Disk N) has an I/O queue and optional disk cache, communicating with a central CPU via a disk controller. Memory is not explicitly shown but resides between the CPU and storage, offering fast access times with the tradeoff of volatile. File systems, though not depicted, operate at the OS/kernel level to handling file access metadata and offer a friendly way to interact through files and directories.

![disk i/o](./diskio.jpeg)


## Key Terms

#### Sectors and Blocks

Sectors are the basic physical units on a storage device. For instance, traditional hard drives typically use a sector size of 512 bytes, while many modern disks use 4096 bytes (or 4K sectors) to improve error correction and efficiency.

Blocks are the logical grouping of one or more sectors used by filesystems for data organization. A common filesystem block size is 4096 bytes, meaning that each block might consist of 8 of the 512-byte sectors, or simply map directly to a 4096-byte physical sector layout if the disk supports it.

#### Input Output Operations per second (IOPS)
IOPS is a measure of how much random read or write requests your storage system can manage. It is worth noting that IOPS can vary by block size depending on the storage medium (e.g., flash drives). Importantly, traditional hard disk drives (HDDs) often don't specify the IOPS. For example the IOPS value for HDD volume on AWS is not shown.

![iops_hdd](./IOPS.png)

#### Throughput / Bandwidth
Throughput is the data transfer rate normally in MB/s with bandwidth specifying the maximum amount that a connection can transfer. IOPS x block size can be used to calculate the storage throughput of your application.

#### Queue Depth
Queue depth refers to the number of simultaneous I/O operations that can be pending on a device. Consumer SSDs might typically have a queue depth in the range of 32 to 64, whereas enterprise-class NVMe drives can support hundreds or even thousands of concurrent requests per queue. This parameter affects how much the device can parallelize operations and therefore influences overall I/O performance.

#### I/O Schedule Engine

The I/O engine is the software component within Linux responsible for managing I/O requests between applications and the storage subsystem. For example, in Linux, the kernel’s block I/O scheduler acts as an I/O engine by queuing and dispatching requests to device drivers. Schedulers use multiple queues to reorder requests optimal disk access.
In benchmarking tools like fio, you might select I/O engines such as sync (synchronous I/O), `libaio` (Linux native asynchronous I/O library), or `io_uring` (which leverages newer Linux kernel capabilities for asynchronous I/O).

#### I/O Wait

This is the perceived time spent waiting for I/O to return the value from the perspective of the CPU core.
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
title: Using FIO
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Setup and Install Fio

I will be using the same `t4g.medium` instance from the previous section with 2 different types of SSD-based block storage devices as per the console screenshot below. Both block devices have the same, 8GiB capacity but the `io1` is geared towards throughput as opposed to the general purpose SSD `gp2`. In this section we want to observe what the real-world performance for our workload is so that it can inform our selection.

![EBS](./EBS.png)

Flexible I/O (fio) is a command-line tool to generate a synthetic workload with specific I/O characteristics. This serves as a simpler alternative to full record and replay testing. Fio is available through most Linux distribution packages, please refer to the [documentation](https://github.com/axboe/fio) for the binary package availability.

```bash
sudo apt update
sudo apt install fio -y
```

Confirm installation with the following commands.

```bash
fio --version
```

```output
fio-3.36
```

## Locate Device

`Fio` allows us to microbenchmark either the block device or a mounted filesystem. The disk free, `df` command to confirm our EBS volumes are not mounted. Writing to drives that hold critical information may cause issues. Hence we are writing to blank, unmounted block storage device.

Using the `lsblk` command to view the EBS volumes attached to the server (`nvme1n1` and `nvme2n1`). The immediate number appended to `nvme`, e.g., `nvme0`, shows it is a physically separate device. `nvme1n1` corresponds to the faster `io2` block device and `nvme2n1` corresponds to the slower `gp2` block device.

```bash
lsblk -e 7
```

```output
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme1n1 259:0 0 8G 0 disk
nvme0n1 259:1 0 8G 0 disk
├─nvme0n1p1 259:3 0 7G 0 part /
├─nvme0n1p15 259:4 0 99M 0 part /boot/efi
└─nvme0n1p16 259:5 0 923M 0 part /boot
nvme2n1 259:2 0 8G 0 disk
```

{{% notice Please Note%}}
If you have more than 1 block volumes attached to an instance, the `sudo nvme list` command from the `nvme-cli` package and be used to differentiate between volumes
{{% /notice %}}

## Generating a Synthetic Workload

Let us say we want to simulate a fictional logging application with the following characteristics observed using the tools from the previous section.

{{% notice Workload%}}
The logging workload has light sequential read and write characteristics. The system write throughput per thread is 5 MB/s with 83% writes. There are infrequent bursts of reads for approximately 5 seconds, operating at up to 16MB/s per thread. The workload can scale the infrequent reads and writes to use up to 16 threads each. The block size for the writes and reads are 64KiB and 256KiB respectively (as opposed to the standard 4KiB Page size).

Further, the application latency sensitive and given it holds critical information, needs to write directly to non-volatile storage through direct IO.
{{% /notice %}}

The fio tool uses simple configuration `jobfiles` to describe the characterisics of your synthetic workload. Parameters under the `[global]` option are shared among jobs. From the example below, we have created 2 jobs to represent the steady write and infrequent reads. Please refer to the official [documentation](https://fio.readthedocs.io/en/latest/fio_doc.html#job-file-format) for more details.

Copy and paste the configuration file below into 2 files named `nvme<x>.fio`. Replace the `<x>` with the block devices we are comparing and just the `filename` parameter accordingly.

```ini
; -- start job file including.fio --
[global]
ioengine=libaio
direct=1 ; write directly to the drive
time_based
runtime=30
group_reporting=1
log_avg_msec=1000
rate=16m,5m ; limit to 16 MB/s and 5MB/s for read and write per job
numjobs=${NUM_JOBS} ; set at the command line
iodepth=${IO_DEPTH} ; set at the command line
filename=/dev/nvme1n1 ; or nvme2n1

[steady_write]
name=steady_write
rw=write ; sequential write
bs=64k ; Block size of 64KiB (default block size of 4 KiB)

[burst_read]
name=burst_read
rw=read
bs=256k ; adjust the block size to 64KiB writes (default is 4KiB)
startdelay=10 ; simulate infrequent reads (5 seconds out 30)
runtime=5
; -- end job file including.fio --
```

Run the following commands to run each test back to back.

```bash
sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme1.fio
```

Then

```bash
sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme2.fio
```

### Interpreting Results

The final terminal output from both runs are shown below.

```output
nvme1:

Run status group 0 (all jobs):
READ: bw=118MiB/s (124MB/s), 118MiB/s-118MiB/s (124MB/s-124MB/s), io=629MiB (660MB), run=5324-5324msec
WRITE: bw=80.0MiB/s (83.9MB/s), 80.0MiB/s-80.0MiB/s (83.9MB/s-83.9MB/s), io=2400MiB (2517MB), run=30006-30006msec

Disk stats (read/write):
nvme1n1: ios=2663/38225, sectors=1294480/4892800, merge=0/0, ticks=148524/454840, in_queue=603364, util=62.19%

nvme2:

Run status group 0 (all jobs):
READ: bw=85.6MiB/s (89.8MB/s), 85.6MiB/s-85.6MiB/s (89.8MB/s-89.8MB/s), io=456MiB (478MB), run=5322-5322msec
WRITE: bw=60.3MiB/s (63.2MB/s), 60.3MiB/s-60.3MiB/s (63.2MB/s-63.2MB/s), io=1816MiB (1904MB), run=30119-30119msec

Disk stats (read/write):
nvme2n1: ios=1872/28855, sectors=935472/3693440, merge=0/0, ticks=159753/1025104, in_queue=1184857, util=89.83%
```

Here we can see that the faster `io2` block storage (`nvme1`) is able to meet the throughput requirement of 80MB/s for steady writes when all 16 write threads are running (5MB/s per thread). However `gp2` saturates at 60.3 MiB/s with over 89.8% SSD utilisation.

We are told the fictional logging application is sensitive to operation latency. The output belows highlights that over ~35% operations have a latency above 1s on nvme2 compared to ~7% on nvme1.


```output

nvme2:

lat (usec) : 10=0.01%, 500=1.53%, 750=5.13%, 1000=7.55%
lat (msec) : 2=29.49%, 4=0.89%, 10=0.09%, 20=0.02%, 50=0.21%
lat (msec) : 100=0.56%, 250=1.84%, 500=6.39%, 750=9.76%, 1000=10.17%
lat (msec) : 2000=19.59%, >=2000=6.77%

nvme1:

lat (usec) : 750=0.44%, 1000=0.41%
lat (msec) : 2=62.63%, 4=1.12%, 10=0.34%, 20=1.61%, 50=3.91%
lat (msec) : 100=2.34%, 250=5.91%, 500=8.46%, 750=4.33%, 1000=2.50%
lat (msec) : 2000=3.62%, >=2000=2.38%
```

This insights above suggest the SSD designed for throughput, `io2` is more suitable than the general purpose `gp2` storage to meet the requirements of our logging application.

{{% notice Tip%}}
If the text output is hard to follow, you can use the `fio2gnuplot` package to plot the data graphically or use the visualisations available from the cloud service provider's dashboard. See image below for an example.

![plot](./visualisations.png)
{{% /notice %}}

The insights gathered by microbenchmarking with fio above can lead to more informed decisions about which block storage to connect to your Arm-based instance.


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading