Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Improve File System Cache hit rate with the posix_fadvise function

minutes_to_complete: 15

who_is_this_for: Developers who want to boost performance of applications limited by file system cache misses.

learning_objectives:
- Basic understanding of memory usage in a system
- Learn how to measure cache miss rates
- Learn how to use the posix_fadvise() function to provide hints to the kernel about file access patterns

prerequisites:
- Basic understanding of C++ and Linux
- Understanding of File Systems and Memory Usage

author: Kieran Hejmadi

### Tags
skilllevels: Introductory
subjects: Runbook
armips:
- Neoverse
tools_software_languages:
- C++
operatingsystems:
- Linux

further_reading:
- resource:
title: posix_fadvise documentation
link: https://man7.org/linux/man-pages/man2/posix_fadvise.2.html
type: documentation



### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: Recap of Filesystems caching and Memory
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Recap of File Systems

A file system is the method an operating system uses to store, organize, and manage data on storage devices like hard drives or SSDs. Linux uses a virtual file system (VFS) layer to provide a uniform interface to different file system types (e.g., `ext4`, `xfs`, `btrfs`), allowing programs to interact with them in a consistent way.

Developers typically interact with the file system through system calls or standard library functions—for example, using `open()`, `read()`, and `write()` in C to access files. For instance, reading a configuration file at `/etc/myapp/config.json` involves navigating the file system hierarchy and accessing the file’s contents through these interfaces. To speed up access to such files, the operating system creates a file system cache, managed by the kernel and resides in main memory (RAM). This cache temporarily stores recently accessed file data and metadata to speed up future reads and reduce disk I/O.


## Recap of Memory Usage

The hardware and operating system is responsible for managing the memory usage of the system.

The well-known command `free -wh` provides a snapshot of memory usage. The `cache` column includes the portion of memory used to store the file system cache.

```output
total used free shared buffers cache available
Mem: 7.6Gi 1.1Gi 5.8Gi 960Ki 22Mi 832Mi 6.5Gi
Swap: 0B 0B 0B
```

The command summarises the memory usage into the following columns.

- `total`: Total installed memory
- `used`: Memory in use (excluding cache/buffers)
- `free`: Unused memory
- `shared`: Memory used by tmpfs and shared between processes
- `buffers`: Memory used by kernel buffers
- `cache`: Memory used by file system cache
- `available`: An estimate of memory available (both free memory and memory that can be reclaimed) when a new process starts.


### What is the posix_fadvise syscall?

`posix_fadvise` is a Linux system call that allows a program to provide the kernel with hints about its expected file access patterns, such as sequential or random reads. These hints help the kernel optimize caching and I/O performance but are purely advisory—the system may choose to ignore them. It’s particularly useful for tuning performance when working with large files or when bypassing unnecessary caching.

### When the posix_fadvise syscall could be of use?

You should consider using `posix_fadvise` when:

- You're reading or writing large files that won't fit entirely in RAM.
- You want to avoid polluting the cache with data you won't reuse.
- You know the access pattern ahead of time and can optimize accordingly.

The syscall doesn’t guarantee behavior, but it influences how the kernel allocates caching resources.



Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
title: With Hint
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---


## Providing hints with posix_fadvise

The `posix_fadvise()` function is a wrapper around the `fadvise64` linux system call (syscall). This syscall gives the Linux kernel hints about expected file access patterns to optimize I/O, for example preemptively reading ahead for sequential file access. The function takes the following arguments.
- `fd`: file descriptor of the file,
- `offset`: where the advice starts (0 = beginning),
- `len`: how many bytes the advice applies to (0 = to the end),
- `advice`: the expected access pattern (`POSIX_FADV_SEQUENTIAL` suggests sequential reading).

For more information on all available arguments, see the [official documentation](https://man7.org/linux/man-pages/man2/posix_fadvise.2.html)

Copy and paste the code sample below a new file called `with_hint.cpp` which includes the `posix_fadvise()` function.

```cpp
#include <iostream>
#include <fcntl.h>
#include <unistd.h>
#include <vector>
#include <cstdlib>

int drop_cache() {
int result = system("sync && echo 3 > /proc/sys/vm/drop_caches");
if (result != 0) {
std::cerr << "Failed to drop caches. Are you running as root?\n";
}
return result;
}

int main() {
drop_cache();
int fd = open("./smallfile.bin", O_RDONLY);
if (fd == -1) {
std::cerr << "Error opening file\n";
return 1;
}

posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);

const size_t bufferSize = 4096; // 4 KB
std::vector<char> buffer(bufferSize);

ssize_t bytesRead;
while ((bytesRead = read(fd, buffer.data(), bufferSize)) > 0) {
volatile char temp = buffer[0];
(void)temp;
usleep(10); // Simulate processing delay
}

close(fd);
return 0;
}

```

Compile with the following command.

```bash
g++ with_hint.cpp -o with_hint
```

Again, run the `perf stat` command to observe the rate of cache misses.

```bash
sudo perf stat -e cache-references,cache-misses,minor-faults,major-faults -r 9 ./with_hint
```

```output
Performance counter stats for './with_hint' (9 runs):

108313825 cache-references ( +- 0.28% )
4189713 cache-misses # 3.87% of all cache refs ( +- 0.68% )
227 minor-faults ( +- 0.36% )
8 major-faults ( +- 8.10%

```

{{% notice Tip%}}
If you are using `posix_fadvise()` with your own application and you want to observe which system calls are issued. Consider using the system call tracer, `strace` with a command such as `strace -ttT -e trace=<syscall of interest,fadvise64> ./<your workload>` to observe what is causing fewer cache misses.
{{% /notice %}}

### Results

Here we observe that on this run with a single line of code we are able to reduce the cache miss rate from ~4.8% to ~3.8%. This can translate to more efficient and performant software, especially if your program is synchronous and has to wait for disk accesses.

{{% notice Please Note%}}
Since this is advise to the operating system, the operating may not do anything with this. Real-world impact depends on other factors such as memory pressure, memory usage etc. As such, the behaviour on your own system may be different.
{{% /notice %}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
title: Example without Hint
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Setup

For this demonstration, connect to an Arm-based AWS `c7g.xlarge` instance running Ubuntu 24.04. Results may vary depending on which instance and kernel version you are using. At the time of writing, kernel version `6.8.0-1024-aws` was used.

First, you need to install the linux performance measure tool, `perf`. Please follow the [installation guide](https://learn.arm.com/install-guides/perf/) for your system.

Additionally, install a `C++` compiler with the following command.

```bash
sudo apt update && sudo apt install g++ -y
```

Next, to understand the current cache and memory usage. Run the following command to see the cache structure.

```bash
lscpu | grep cache
```

As per the output below, each of our 4 cores has 256 KiB of level 1 data and instruction cache along with 4 MiB of slower level 2 cache with shared data and instructions. Finally we have 32 MiB of level 3 cache which is shared among our 4 CPU cores.

```output  
L1d cache: 256 KiB (4 instances)
L1i cache: 256 KiB (4 instances)
L2 cache: 4 MiB (4 instances)
L3 cache: 32 MiB (1 instance)
```

This information will be useful to ensure our working set size cannot all fit within on-CPU cache.

Next, check the current memory usage of an idle system with the `free -h` command.

```output
total used free shared buff/cache available
Mem: 7.6Gi 779Mi 6.4Gi 952Ki 597Mi 6.8Gi
Swap: 0B 0B 0B
```

As the output above shows, we have `7.6GiB` of total memory on this instance with `779 MiB` actively used by user and kernel processes. It may look confusing how the `free` and `available` columns show different values. `Free` is memory completely unused (6.4GiB) whereas `available` includes free memory plus reclaimable cache/buffers (6.8GiB), showing what’s ready for new processes if the file system cache is reclaimed.


## Example

First, we need to create a file on the file system. Run the command below to create a file with random bytes in the current working directory. The command writes 64 blocks of 1 MB each to a file named `smallfile.bin`. Importantly, this file is too large to fit within our 64 MiB on-CPU cache.

```bash
dd if=/dev/urandom of=smallfile.bin bs=1M count=64
```

Next, copy and paste the following `C++` file into a new file called `no_hints.cpp`. The sample below continuously reads the random binary file created above into a 4KiB buffer. The first byte is then read and a processing is simulated with a short delay.

Importantly, with the `drop_cache()` function, we first drop the file file system cache in memory each time this program is run so that we are running from a cold start each time.

```cpp
#include <iostream>
#include <fstream>
#include <vector>
#include <unistd.h> // for usleep
#include <cstdlib>


int drop_cache() {
int result = system("sync && echo 3 > /proc/sys/vm/drop_caches");
if (result != 0) {
std::cerr << "Failed to drop caches. Are you running as root?\n";
}
return result;
}

int main() {
drop_cache();
std::ifstream file("./smallfile.bin", std::ios::binary);
if (!file) {
std::cerr << "Error opening file\n";
return 1;
}

const size_t bufferSize = 4096; // 4 KB
std::vector<char> buffer(bufferSize);

while (file.read(buffer.data(), bufferSize) || file.gcount()) {
volatile char temp = buffer[0];
(void)temp;
usleep(10); // Simulate processing delay
}

file.close();
return 0;
}

```

Compile without any optimisations with the following command.

```bash
g++ no_hints.cpp -o no_hints
```

To observe the cache miss ratio we can use the `perf stat` command that prints out the cache miss statistics. Repeating this multiple times allows us to observe the variation in performance.

```bash
sudo perf stat -e cache-references,cache-misses,minor-faults,major-faults -r 9 ./no_hints
```

```output
Performance counter stats for './read_without_fadvise_small' (9 runs):

113047762 cache-references ( +- 0.37% )
5407145 cache-misses # 4.78% of all cache refs ( +- 0.23% )
229 minor-faults ( +- 0.39% )
9 major-faults ( +- 9.80% )

1.70040 +- 0.00224 seconds time elapsed ( +- 0.13% )
```
Loading