ArmDeveloperEcosystem · jasonrandrews · Apr 24, 2025 · Apr 15, 2025 · Apr 16, 2025 · Apr 17, 2025
diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/EBS.png b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/EBS.png
diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/IOPS.png b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/IOPS.png
diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_index.md
@@ -0,0 +1,42 @@
+---
+title: Microbenchmark Storage Performance with Fio
+
+minutes_to_complete: 30
+
+who_is_this_for: A cloud developer who wants to optimise storage cost or performance of their application. Developers who want to uncover potential storage-bound bottlenecks or changes when migrating an application to a different platform. 
+
+learning_objectives: 
+    - Understand the flow of data for storage devices 
+    - Use basic observability utilities such as iostat, iotop and pidstat
+    - Understand how to run fio for microbenchmarking a block storage device
+
+prerequisites:
+    - Access to an Arm-based server
+    - Basic understanding of Linux
+
+author: Kieran Hejmadi
+
+### Tags
+skilllevels: Introductory
+subjects: Performance and Architecture
+armips:
+    - Neoverse
+tools_software_languages:
+    - bash
+    - Runbook
+operatingsystems:
+    - Linux
+
+
+further_reading:
+    - resource:
+        title: Fio documentation
+        link: https://fio.readthedocs.io/en/latest/fio_doc.html#running-fio
+        type: documentation
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/...ent/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_next-steps.md b/...ent/learning-paths/servers-and-cloud-computing/disk-io-benchmark/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/...-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md b/...-paths/servers-and-cloud-computing/disk-io-benchmark/characterising-workload.md
@@ -0,0 +1,132 @@
+---
+title: Characterising a Workload
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Basic Characteristics
+
+The basic attributes of a given workload are the following. 
+
+- IOPS
+- I/O Size
+- Throughput
+- Read to Write Ratio
+- Random vs Sequential access
+
+There are many more characteristics to observe, just as latency but since this is an introductory topic we will mostly stick to the high-level metrics listed above. 
+
+## Running an Example Workload
+
+Connect to an Arm-based cloud instance. As an example workload, we will be using the media manipulation tool, FFMPEG on an AWS `t4g.medium` instance. 
+
+First install the prequistite tools. 
+
+```bash
+sudo apt update 
+sudo apt install ffmpeg iotop -y
+```
+
+Download the popular reference video for transcoding, `BigBuckBunny.mp4` which is available under the [Creative Commons 3.0 License](https://creativecommons.org/licenses/by/3.0/).
+
+```bash
+cd ~
+mkdir src
+cd src
+wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
+```
+
+Run the following command to begin transcoding the video and audio using the `H.264` and `aac` transcoders respectively. We use the `-flush_packets` flag to write each chunk of video back to storage from memory.   
+
+```bash
+ffmpeg -i BigBuckBunny.mp4 -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 128k -flush_packets 1 output_video.mp4
+```
+
+### Observing Disk Usage 
+
+Whilst the transcoding is running, we can use the `pidstat` command to see the disk statistics of that specific process. 
+
+```bash
+pidstat -d -p $(pgrep ffmpeg) 1
+```
+Since this example `151MB` video  fits within memory, we observe no `kB_rd/s` for the storage device after the initial read. However, since we are flushing to storage we observe period ~275 `kB_wr/s`.  
+
+```output
+Linux 6.8.0-1024-aws (ip-10-248-213-118)        04/15/25        _aarch64_       (2 CPU)
+
+10:01:24      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
+10:01:25     1000     24250      0.00    276.00      0.00       0  ffmpeg
+10:01:26     1000     24250      0.00    256.00      0.00       0  ffmpeg
+10:01:27     1000     24250      0.00    216.00      0.00       0  ffmpeg
+10:01:28     1000     24250      0.00    184.00      0.00       0  ffmpeg
+10:01:29     1000     24250      0.00    424.00      0.00       0  ffmpeg
+10:01:30     1000     24250      0.00    312.00      0.00       0  ffmpeg
+10:01:31     1000     24250      0.00    372.00      0.00       0  ffmpeg
+10:01:32     1000     24250      0.00    344.00      0.00       0  ffmpeg
+```
+
+{{% notice Please Note%}}
+In this simple example, since we are interacting with a file on the mounted filesystem, we are also observing the behaviour of the filesystem. 
+{{% /notice %}}
+
+Of course, there may be other processes or background services that are writing to this disk. We can use `iotop` command for inspection. As per the output below, the `ffmpeg` process has the greatest disk utilisation. 
+
+```bash
+sudo iotop
+```
+
+```output
+Total DISK READ:         0.00 B/s | Total DISK WRITE:       332.11 K/s
+Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
+    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND                                                       
+  24891 be/4 ubuntu      0.00 B/s  332.11 K/s ffmpeg -i BigBuckBunny.mp4 -c:v ~ts 1 output_video.mp4 [mux0:mp4]
+      1 be/4 root        0.00 B/s    0.00 B/s systemd --system --deserialize=74
+      2 be/4 root        0.00 B/s    0.00 B/s [kthreadd]
+```
+
+Using the input, output statistics command (`iostat`) we can observe the system-wide metrics from the `nvme0n1` drive. Please Note that we are using a snapshot of this workload, more accurate characteristics can be obtained by measuring the distribution of a workload. 
+
+```bash
+watch -n 0.1 iostat -z nvme0n1
+```
+You should see output similar to that below. 
+
+```output
+Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
+nvme0n1           3.81        31.63       217.08         0.00     831846    5709210          0
+```
+
+To observe the more detailed metrics we can run `iostat` with the `-x` option.
+
+```bash
+iostat -xz nvme0n1
+```
+
+```output
+Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
+nvme0n1          0.66     29.64     0.24  26.27    0.73    44.80    2.92    203.88     3.17  52.01    2.16    69.70    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01   0.15
+```
+
+### Basic Characteristics of our Example Workload
+
+This is a simple transcoding workload with flushed writes, where most data is processed and stored in memory. Disk I/O is minimal, with an IOPS of just 3.81, low throughput (248.71 kB/s), and an average IO depth of 0.01 — all summarised in very low disk utilization. The 52% write merge rate and low latencies further suggest sequential, infrequent disk access, reinforcing that the workload is primarily memory-bound.
+
+
+| Metric             | Calculation Explanation                                                                                     | Value         |
+|--------------------|-------------------------------------------------------------------------------------------------------------|---------------|
+| IOPS               | Taken directly from the `tps` (transfers per second) field                                                  | 3.81          |
+| Throughput (Read)  | From monitoring tool output                                                                                 | 31.63 kB/s    |
+| Throughput (Write) | From monitoring tool output                                                                                 | 217.08 kB/s   |
+| Throughput (Total) | Sum of read and write throughput                                                                            | 248.71 kB/s   |
+| Avg I/O Size       | Total throughput divided by IOPS: 248.71 / 3.81                                                             | ≈ 65.3 KB     |
+| Read Ratio         | Read throughput ÷ total throughput: 31.63 / 248.71                                                          | ~13%          |
+| Write Ratio        | Write throughput ÷ total throughput: 217.08 / 248.71                                                        | ~87%          |
+| IO Depth           | Taken directly from `aqu-sz` (average number of in-flight I/Os)                                             | 0.01          |
+| Access Pattern | Based on cache hits, merge rates, and low wait times. 52% of writes were merged (`wrqm/s` = 3.17, `w/s` = 2.92) → suggests mostly sequential access | Sequential-ish (52.01% merged) |
+
+
+{{% notice Please Note%}}
+If you have access to the workloads source code, the expected access patterns can more easily be observed. 
+{{% /notice %}}
diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/diskio.jpeg b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/diskio.jpeg
diff --git a/...nt/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md b/...nt/learning-paths/servers-and-cloud-computing/disk-io-benchmark/introduction.md
@@ -0,0 +1,46 @@
+---
+title: Fundamentals of Storage Systems
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Introduction
+
+The ideal storage activity of your system is 0. In this situation all of your application data and instructions are available in memory or caches with no reads or writes to a spinning hard-disk drive or solid-state SSD required. However, due to physical capacity limitations, data volatility and need to store large amounts of data, many applications require frequent access to storage media. 
+
+## High-Level Flow of Data
+
+The diagram below is a high-level overview of how data can be written or read from a storage device.  This diagram illustrates a multi-disk I/O architecture where each disk (Disk 1 to Disk N) has an I/O queue and optional disk cache, communicating with a central CPU via a disk controller. Memory is not explicitly shown but resides between the CPU and storage, offering fast access times with the tradeoff of volatile. File systems, though not depicted, operate at the OS/kernel level to handling file access metadata and offer a friendly way to interact through files and directories.
+
+![disk i/o](./diskio.jpeg)
+
+
+## Key Terms
+
+#### Sectors and Blocks
+
+Sectors are the basic physical units on a storage device. For instance, traditional hard drives typically use a sector size of 512 bytes, while many modern disks use 4096 bytes (or 4K sectors) to improve error correction and efficiency.
+
+Blocks are the logical grouping of one or more sectors used by filesystems for data organization. A common filesystem block size is 4096 bytes, meaning that each block might consist of 8 of the 512-byte sectors, or simply map directly to a 4096-byte physical sector layout if the disk supports it.
+
+#### Input Output Operations per second (IOPS)
+IOPS is a measure of how much random read or write requests your storage system can manage. It is worth noting that IOPS can vary by block size depending on the storage medium (e.g., flash drives). Importantly, traditional hard disk drives (HDDs) often don't specify the IOPS. For example the IOPS value for HDD volume on AWS is not shown. 
+
+![iops_hdd](./IOPS.png)
+
+#### Throughput / Bandwidth
+Throughput is the data transfer rate normally in MB/s with bandwidth specifying the maximum amount that a connection can transfer. IOPS x block size can be used to calculate the storage throughput of your application.
+
+#### Queue Depth
+Queue depth refers to the number of simultaneous I/O operations that can be pending on a device. Consumer SSDs might typically have a queue depth in the range of 32 to 64, whereas enterprise-class NVMe drives can support hundreds or even thousands of concurrent requests per queue. This parameter affects how much the device can parallelize operations and therefore influences overall I/O performance.
+
+#### I/O Schedule Engine
+
+The I/O engine is the software component within Linux responsible for managing I/O requests between applications and the storage subsystem. For example, in Linux, the kernel’s block I/O scheduler acts as an I/O engine by queuing and dispatching requests to device drivers. Schedulers use multiple queues to reorder requests optimal disk access. 
+In benchmarking tools like fio, you might select I/O engines such as sync (synchronous I/O), `libaio` (Linux native asynchronous I/O library), or `io_uring` (which leverages newer Linux kernel capabilities for asynchronous I/O).
+
+#### I/O Wait
+
+This is the perceived time spent waiting for I/O to return the value from the perspective of the CPU core. 
diff --git a/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md b/content/learning-paths/servers-and-cloud-computing/disk-io-benchmark/using-fio.md
@@ -0,0 +1,166 @@
+---
+title: Using FIO
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Setup and Install Fio
+
+I will be using the same `t4g.medium` instance from the previous section with 2 different types of SSD-based block storage devices as per the console screenshot below. Both block devices have the same, 8GiB capacity but the `io1` is geared towards throughput as opposed to the general purpose SSD `gp2`. In this section we want to observe what the real-world performance for our workload is so that it can inform our selection.
+
+![EBS](./EBS.png)
+
+Flexible I/O (fio) is a command-line tool to generate a synthetic workload with specific I/O characteristics. This serves as a simpler alternative to full record and replay testing. Fio is available through most Linux distribution packages, please refer to the [documentation](https://github.com/axboe/fio) for the binary package availability.
+
+```bash
+sudo apt update
+sudo apt install fio -y
+```
+
+Confirm installation with the following commands. 
+
+```bash
+fio --version
+```
+
+```output
+fio-3.36
+```
+
+## Locate Device 
+
+`Fio` allows us to microbenchmark either the block device or a mounted filesystem. The disk free, `df` command to confirm our EBS volumes are not mounted. Writing to drives that hold critical information may cause issues. Hence we are writing to blank, unmounted block storage device.
+
+Using the `lsblk` command to view the EBS volumes attached to the server (`nvme1n1` and `nvme2n1`). The immediate number appended to `nvme`, e.g., `nvme0`, shows it is a physically separate device. `nvme1n1` corresponds to the faster `io2` block device and `nvme2n1` corresponds to the slower `gp2` block device. 
+
+```bash
+lsblk -e 7
+```
+
+```output
+NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
+nvme1n1      259:0    0    8G  0 disk 
+nvme0n1      259:1    0    8G  0 disk 
+├─nvme0n1p1  259:3    0    7G  0 part /
+├─nvme0n1p15 259:4    0   99M  0 part /boot/efi
+└─nvme0n1p16 259:5    0  923M  0 part /boot
+nvme2n1      259:2    0    8G  0 disk 
+```
+
+{{% notice Please Note%}}
+If you have more than 1 block volumes attached to an instance, the `sudo nvme list` command from the `nvme-cli` package and be used to differentiate between volumes
+{{% /notice %}}
+
+## Generating a Synthetic Workload
+
+Let us say we want to simulate a fictional logging application with the following characteristics observed using the tools from the previous section. 
+
+{{% notice Workload%}}
+The logging workload has light sequential read and write characteristics. The system write throughput per thread is 5 MB/s with 83% writes. There are infrequent bursts of reads for approximately 5 seconds, operating at up to 16MB/s per thread. The workload can scale the infrequent reads and writes to use up to 16 threads each. The block size for the writes and reads are 64KiB and 256KiB respectively (as opposed to the standard 4KiB Page size). 
+
+Further, the application latency sensitive and given it holds critical information, needs to write directly to non-volatile storage through direct IO. 
+{{% /notice %}}
+
+The fio tool uses simple configuration `jobfiles` to describe the characterisics of your synthetic workload. Parameters under the `[global]` option are shared among jobs. From the example below, we have created 2 jobs to represent the steady write and infrequent reads. Please refer to the official [documentation](https://fio.readthedocs.io/en/latest/fio_doc.html#job-file-format) for more details. 
+
+Copy and paste the configuration file below into 2 files named `nvme<x>.fio`. Replace the `<x>` with the block devices we are comparing and just the `filename` parameter accordingly. 
+
+```ini
+ ; -- start job file including.fio --
+[global]
+ioengine=libaio
+direct=1 ; write directly to the drive
+time_based
+runtime=30
+group_reporting=1
+log_avg_msec=1000
+rate=16m,5m ; limit to 16 MB/s and 5MB/s for read and write per job
+numjobs=${NUM_JOBS} ; set at the command line
+iodepth=${IO_DEPTH} ; set at the command line
+filename=/dev/nvme1n1 ; or nvme2n1
+
+[steady_write]
+name=steady_write
+rw=write ; sequential write
+bs=64k ; Block size of 64KiB (default block size of 4 KiB)
+
+[burst_read]
+name=burst_read
+rw=read
+bs=256k ; adjust the block size to 64KiB writes (default is 4KiB)
+startdelay=10 ; simulate infrequent reads (5 seconds out 30)
+runtime=5
+; -- end job file including.fio --
+```
+
+Run the following commands to run each test back to back.  
+
+```bash
+sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme1.fio
+```
+
+Then
+
+```bash
+sudo NUM_JOBS=16 IO_DEPTH=64 fio nvme2.fio 
+```
+
+### Interpreting Results
+
+The final terminal output from both runs are shown below. 
+
+```output
+nvme1:
+
+Run status group 0 (all jobs):
+   READ: bw=118MiB/s (124MB/s), 118MiB/s-118MiB/s (124MB/s-124MB/s), io=629MiB (660MB), run=5324-5324msec
+  WRITE: bw=80.0MiB/s (83.9MB/s), 80.0MiB/s-80.0MiB/s (83.9MB/s-83.9MB/s), io=2400MiB (2517MB), run=30006-30006msec
+
+Disk stats (read/write):
+  nvme1n1: ios=2663/38225, sectors=1294480/4892800, merge=0/0, ticks=148524/454840, in_queue=603364, util=62.19%
+
+nvme2:
+
+Run status group 0 (all jobs):
+   READ: bw=85.6MiB/s (89.8MB/s), 85.6MiB/s-85.6MiB/s (89.8MB/s-89.8MB/s), io=456MiB (478MB), run=5322-5322msec
+  WRITE: bw=60.3MiB/s (63.2MB/s), 60.3MiB/s-60.3MiB/s (63.2MB/s-63.2MB/s), io=1816MiB (1904MB), run=30119-30119msec
+
+Disk stats (read/write):
+  nvme2n1: ios=1872/28855, sectors=935472/3693440, merge=0/0, ticks=159753/1025104, in_queue=1184857, util=89.83%
+```
+
+Here we can see that the faster `io2` block storage (`nvme1`) is able to meet the throughput requirement of 80MB/s for steady writes when all 16 write threads are running (5MB/s per thread). However `gp2` saturates at 60.3 MiB/s with over 89.8% SSD utilisation. 
+
+We are told the fictional logging application is sensitive to operation latency. The output belows highlights that over ~35% operations have a latency above 1s on nvme2 compared to ~7% on nvme1. 
+
+
+```output
+
+  nvme2:
+
+  lat (usec)   : 10=0.01%, 500=1.53%, 750=5.13%, 1000=7.55%
+  lat (msec)   : 2=29.49%, 4=0.89%, 10=0.09%, 20=0.02%, 50=0.21%
+  lat (msec)   : 100=0.56%, 250=1.84%, 500=6.39%, 750=9.76%, 1000=10.17%
+  lat (msec)   : 2000=19.59%, >=2000=6.77%
+
+  nvme1:
+
+  lat (usec)   : 750=0.44%, 1000=0.41%
+  lat (msec)   : 2=62.63%, 4=1.12%, 10=0.34%, 20=1.61%, 50=3.91%
+  lat (msec)   : 100=2.34%, 250=5.91%, 500=8.46%, 750=4.33%, 1000=2.50%
+  lat (msec)   : 2000=3.62%, >=2000=2.38%
+```
+
+This insights above suggest the SSD designed for throughput, `io2` is more suitable than the general purpose `gp2` storage to meet the requirements of our logging application.
+
+{{% notice Tip%}}
+If the text output is hard to follow, you can use the `fio2gnuplot` package to plot the data graphically or use the visualisations available from the cloud service provider's dashboard. See image below for an example. 
+
+ ![plot](./visualisations.png)
+{{% /notice %}}
+
+The insights gathered by microbenchmarking with fio above can lead to more informed decisions about which block storage to connect to your Arm-based instance. 
+
+
diff --git a/...learning-paths/servers-and-cloud-computing/disk-io-benchmark/visualisations.png b/...learning-paths/servers-and-cloud-computing/disk-io-benchmark/visualisations.png