Skip to content

Commit c9b674a

Browse files
authored
Merge pull request #161282 from b-juche/live-udpate-06-04-Linux-Best-Practices-DirecIO-Cache-Readahead-VMSkus
Add Linux best practice articles: Direct I/O, Filesystem Cache, Read,…
2 parents da64202 + 2102fce commit c9b674a

7 files changed

+369
-3
lines changed

articles/azure-netapp-files/TOC.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,12 +37,20 @@
3737
href: performance-oracle-single-volumes.md
3838
- name: Performance reference for Azure NetApp Files
3939
items:
40+
- name: Linux direct I/O best practices
41+
href: performance-linux-direct-io.md
42+
- name: Linux filesystem cache best practices
43+
href: performance-linux-filesystem-cache.md
4044
- name: Linux NFS mount options best practices
4145
href: performance-linux-mount-options.md
4246
- name: Linux concurrency best practices
4347
href: performance-linux-concurrency-session-slots.md
48+
- name: Linux NFS read-ahead best practices
49+
href: performance-linux-nfs-read-ahead.md
4450
- name: SMB performance best practices
45-
href: azure-netapp-files-smb-performance.md
51+
href: azure-netapp-files-smb-performance.md
52+
- name: Azure virtual machine SKUs best practices
53+
href: performance-virtual-machine-sku.md
4654
- name: Cost model for Azure NetApp Files
4755
href: azure-netapp-files-cost-model.md
4856
- name: Understand volume quota

articles/azure-netapp-files/performance-linux-concurrency-session-slots.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,5 +261,9 @@ The following example shows Packet 14 (server maximum requests):
261261

262262
## Next steps
263263

264+
* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
265+
* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
264266
* [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
267+
* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
268+
* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md)
265269
* [Performance benchmarks for Linux](performance-benchmarks-linux.md)
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
---
2+
title: Linux direct I/O best practices for Azure NetApp Files | Microsoft Docs
3+
description: Describes Linux direct I/O and the best practices to follow for Azure NetApp Files.
4+
services: azure-netapp-files
5+
documentationcenter: ''
6+
author: b-juche
7+
manager: ''
8+
editor: ''
9+
10+
ms.assetid:
11+
ms.service: azure-netapp-files
12+
ms.workload: storage
13+
ms.tgt_pltfrm: na
14+
ms.devlang: na
15+
ms.topic: conceptual
16+
ms.date: 07/02/2021
17+
ms.author: b-juche
18+
---
19+
# Linux direct I/O best practices for Azure NetApp Files
20+
21+
This article helps you understand direct I/O best practices for Azure NetApp Files.
22+
23+
## Direct I/O
24+
25+
The most common parameter used in storage performance benchmarking is direct I/O. It is supported by FIO and Vdbench. DISKSPD offers support for the similar construct of memory-mapped I/O. With direct I/O, the filesystem cache is bypassed, operations for direct memory access copy are avoided, and storage tests are made fast and simple.
26+
27+
Using the direct I/O parameter makes storage testing easy. No data is read from the filesystem cache on the client. As such, the test is truly stressing the storage protocol and service itself, rather than memory access speeds. Also, without the DMA memory copies, read and write operations are efficient from a processing perspective.
28+
29+
Take the Linux `dd` command as an example workload. Without the optional `odirect` flag, all I/O generated by `dd` is served from the Linux buffer cache. Reads with the blocks already in memory are not retrieved from storage. Reads resulting in a buffer cache miss end up being read from storage using NFS read-ahead with varying results, depending on factors as mount `rsize` and client read-ahead tunables. When writes are sent through the buffer cache, they use a write-behind mechanism, which is untuned and uses a significant amount of parallelism to send the data to the storage device. You might attempt to run two independent streams of I/O, one `dd` for reads and one `dd` for writes. But in fact, the operating system, untuned, favors writes over reads and uses more parallelism of it.
30+
31+
Aside from database, few applications use direct I/O. Instead, they choose to leverage the advantages of a large memory cache for repeated reads and a write behind cache for asynchronous writes. In short, using direct I/O turns the test into a micro benchmark *if* the application being synthesized uses the filesystem cache.
32+
33+
The following are some databases that support direct I/O:
34+
35+
* Oracle
36+
* SAP HANA
37+
* MySQL (InnoDB storage engine)
38+
* RocksDB
39+
* PostgreSQL
40+
* Teradata
41+
42+
## Best practices
43+
44+
Testing with `directio` is an excellent way to understand the limits of the storage service and client. To get a better understanding for how the application itself will behave (if the application doesn't use `directio`), you should also run tests through the filesystem cache.
45+
46+
## Next steps
47+
48+
* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
49+
* [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
50+
* [Linux concurrency best practices for Azure NetApp Files](performance-linux-concurrency-session-slots.md)
51+
* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
52+
* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md)
53+
* [Performance benchmarks for Linux](performance-benchmarks-linux.md)
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
title: Linux filesystem cache best practices for Azure NetApp Files | Microsoft Docs
3+
description: Describes Linux filesystem cache best practices to follow for Azure NetApp Files.
4+
services: azure-netapp-files
5+
documentationcenter: ''
6+
author: b-juche
7+
manager: ''
8+
editor: ''
9+
10+
ms.assetid:
11+
ms.service: azure-netapp-files
12+
ms.workload: storage
13+
ms.tgt_pltfrm: na
14+
ms.devlang: na
15+
ms.topic: conceptual
16+
ms.date: 07/02/2021
17+
ms.author: b-juche
18+
---
19+
# Linux filesystem cache best practices for Azure NetApp Files
20+
21+
This article helps you understand filesystem cache best practices for Azure NetApp Files.
22+
23+
## Filesystem cache tunables
24+
25+
You need to understand the following factors about filesystem cache tunables:
26+
27+
* Flushing a dirty buffer leaves the data in a clean state usable for future reads until memory pressure leads to eviction.
28+
* There are three triggers for an asynchronous flush operation:
29+
* Time based: When a buffer reaches the age defined by this tunable, it must be marked for cleaning (that is, flushing, or writing to storage).
30+
* Memory pressure: See [`vm.dirty_ratio | vm.dirty_bytes`](#vmdirty_ratio--vmdirty_bytes) for details.
31+
* Close: When a file handle is closed, all dirty buffers are asynchronously flushed to storage.
32+
33+
These factors are controlled by four tunables. Each tunable can be tuned dynamically and persistently using `tuned` or `sysctl` in the `/etc/sysctl.conf` file. Tuning these variables improves performance for applications.
34+
35+
> [!NOTE]
36+
> Information discussed in this article was uncovered during SAS GRID and SAS Viya validation exercises. As such, the tunables are based on lessons learned from the validation exercises. Many applications will similarly benefit from tuning these parameters.
37+
38+
### `vm.dirty_ratio | vm.dirty_bytes`
39+
40+
These two tunables define the amount of RAM made usable for data modified but not yet written to stable storage. Whichever tunable is set automatically sets the other tunable to zero; RedHat advises against manually setting either of the two tunables to zero. The option `vm.dirty_ratio` (the default of the two) is set by Redhat to either 20% or 30% of physical memory depending on the OS, which is a significant amount considering the memory footprint of modern systems. Consideration should be given to setting `vm.dirty_bytes` instead of `vm.dirty_ratio` for a more consistent experience regardless of memory size. For example, ongoing work with SAS GRID determined 30 MiB an appropriate setting for best overall mixed workload performance.
41+
42+
### `vm.dirty_background_ratio | vm.dirty_background_bytes`
43+
44+
These tunables define the starting point where the Linux write-back mechanism begins flushing dirty blocks to stable storage. Redhat defaults to 10% of physical memory, which, on a large memory system, is a significant amount of data to start flushing. Taking SAS GRID for example, historically the recommendation has been to set `vm.dirty_background` to 1/5 size of `vm.dirty_ratio` or `vm.dirty_bytes`. Considering how aggressively the `vm.dirty_bytes` setting is set for SAS GRID, no specific value is being set here.
45+
46+
### `vm.dirty_expire_centisecs`
47+
48+
This tunable defines how old a dirty buffer can be before it must be tagged for asynchronously writing out. Take SAS Viya’s CAS workload for example. An ephemeral write-dominant workload found that setting this value to 300 centiseconds (3 seconds) was optimal, with 3000 centiseconds (30 seconds) being the default.
49+
50+
SAS Viya shares CAS data into multiple small chunks of a few megabytes each. Rather than closing these file handles after writing data to each shard, the handles are left open and the buffers within are memory-mapped by the application. Without a close, there will be no flush until either memory pressure or 30 seconds has passed. Waiting for memory pressure proved suboptimal as did waiting for a long timer to expire. Unlike SAS GRID, which looked for the best overall throughput, SAS Viya looked to optimize write bandwidth.
51+
52+
### `vm.dirty_writeback_centisecs`
53+
54+
The kernel flusher thread is responsible for asynchronously flushing dirty buffers between each flush thread sleeps. This tunable defines the amount spent sleeping between flushes. Considering the 3-second `vm.dirty_expire_centisecs` value used by SAS Viya, SAS set this tunable to 100 centiseconds (1 second) rather than the 500 centiseconds (5 seconds) default to find the best overall performance.
55+
56+
## Impact of an untuned filesystem cache
57+
58+
Considering the default virtual memory tunables and the amount of RAM in modern systems, write-back potentially slows down other storage-bound operations from the perspective of the specific client driving this mixed workload. The following symptoms may be expected from an untuned, write-heavy, cache-laden Linux machine.
59+
60+
* Directory lists `ls` take long enough as to appear hung.
61+
* Read throughput against the filesystem decreases significantly in comparison to write throughput.
62+
* `nfsiostat` reports write latencies **in seconds or higher**.
63+
64+
You might experience this behavior only by *the Linux machine* performing the mixed write-heavy workload. Further, the experience is degraded against all NFS volumes mounted against a single storage endpoint. If the mounts come from two or more endpoints, only the volumes sharing an endpoint exhibit this behavior.
65+
66+
Setting the filesystem cache parameters as described in this section has been shown to address the issues.
67+
68+
## Monitoring virtual memory
69+
70+
To understand what is going with virtual memory and the write-back, consider the following code snippet and output. *Dirty* represents the amount dirty memory in the system, and *writeback* represents the amount of memory actively being written to storage.
71+
72+
`# while true; do echo "###" ;date ; egrep "^Cached:|^Dirty:|^Writeback:|file" /proc/meminfo; sleep 5; done`
73+
74+
The following output comes from an experiment where the `vm.dirty_ratio` and the `vm.dirty_background` ratio were set to 2% and 1% of physical memory respectively. In this case, flushing began at 3.8 GiB, 1% of the 384-GiB memory system. Writeback closely resembled the write throughput to NFS.
75+
76+
```
77+
Cons
78+
Dirty: 1174836 kB
79+
Writeback: 4 kB
80+
###
81+
Dirty: 3319540 kB
82+
Writeback: 4 kB
83+
###
84+
Dirty: 3902916 kB <-- Writes to stable storage begins here
85+
Writeback: 72232 kB
86+
###
87+
Dirty: 3131480 kB
88+
Writeback: 1298772 kB
89+
```
90+
91+
## Next steps
92+
93+
* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
94+
* [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
95+
* [Linux concurrency best practices for Azure NetApp Files](performance-linux-concurrency-session-slots.md)
96+
* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
97+
* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md)
98+
* [Performance benchmarks for Linux](performance-benchmarks-linux.md)

articles/azure-netapp-files/performance-linux-mount-options.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ When preparing a multi-node SAS GRID environment for production, you might notic
3131
| No `nconnect` | 8 hours |
3232
| `nconnect=8` | 5.5 hours |
3333

34-
Both sets of tests used the same E32-8_v4 virtual machine and RHEL8.3, with readahead set to 15 MiB.
34+
Both sets of tests used the same E32-8_v4 virtual machine and RHEL8.3, with read-ahead set to 15 MiB.
3535

3636
When you use `nconnect`, keep the following rules in mind:
3737

@@ -85,7 +85,7 @@ sudo vi /etc/fstab
8585
10.23.1.4:/HN1-shared/shared /hana/shared nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0
8686
```
8787

88-
Also for example, SAS Viya recommends a 256-KiB read and write sizes, and [SAS GRID](https://communities.sas.com/t5/Administration-and-Deployment/Azure-NetApp-Files-A-shared-file-system-to-use-with-SAS-Grid-on/m-p/606973/highlight/true#M17740) limits the `r/wsize` to 64 KiB while augmenting read performance with increased readahead for the NFS mounts. <!-- For more information on readahead, see the article “NFS Readahead”. -->
88+
Also for example, SAS Viya recommends a 256-KiB read and write sizes, and [SAS GRID](https://communities.sas.com/t5/Administration-and-Deployment/Azure-NetApp-Files-A-shared-file-system-to-use-with-SAS-Grid-on/m-p/606973/highlight/true#M17740) limits the `r/wsize` to 64 KiB while augmenting read performance with increased read-ahead for the NFS mounts. See [NFS read-ahead best practices for Azure NetApp Files](performance-linux-nfs-read-ahead.md) for details.
8989

9090
The following considerations apply to the use of `rsize` and `wsize`:
9191

@@ -133,5 +133,9 @@ When no close-to-open consistency (`nocto`) is used, the client will trust the f
133133

134134
## Next steps
135135

136+
* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
137+
* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
136138
* [Linux concurrency best practices for Azure NetApp Files](performance-linux-concurrency-session-slots.md)
139+
* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
140+
* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md)
137141
* [Performance benchmarks for Linux](performance-benchmarks-linux.md)
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
title: Linux NFS read-ahead best practices for Azure NetApp Files - Session slots and slot table entries | Microsoft Docs
3+
description: Describes filesystem cache and Linux NFS read-ahead best practices for Azure NetApp Files.
4+
services: azure-netapp-files
5+
documentationcenter: ''
6+
author: b-juche
7+
manager: ''
8+
editor: ''
9+
10+
ms.assetid:
11+
ms.service: azure-netapp-files
12+
ms.workload: storage
13+
ms.tgt_pltfrm: na
14+
ms.devlang: na
15+
ms.topic: conceptual
16+
ms.date: 07/02/2021
17+
ms.author: b-juche
18+
---
19+
# Linux NFS read-ahead best practices for Azure NetApp Files
20+
21+
This article helps you understand filesystem cache best practices for Azure NetApp Files.
22+
23+
NFS read-ahead predictively requests blocks from a file in advance of I/O requests by the application. It is designed to improve client sequential read throughput. Until recently, all modern Linux distributions set the read-ahead value to be equivalent of 15 times the mounted filesystems `rsize`.
24+
25+
The following table shows the default read-ahead values for each given `rsize` mount option.
26+
27+
| Mounted filesystem `rsize` | Blocks read-ahead |
28+
|-|-|
29+
| 64 KiB | 960 KiB |
30+
| 256 KiB | 3,840 KiB |
31+
| 1024 KiB | 15,360 KiB |
32+
33+
RHEL 8.3 and Ubuntu 18.04 introduced changes that might negatively impact client sequential read performance. Unlike earlier releases, these distributions set read-ahead to a default of 128 KiB regardless of the `rsize` mount option used. Upgrading from releases with the larger read-ahead value to those with the 128-KiB default experienced decreases in sequential read performance. However, read-ahead values may be tuned upward both dynamically and persistently. For example, testing with SAS GRID found the 15,360-KiB read value optimal compared to 3,840 KiB, 960 KiB, and 128 KiB. Not enough tests have been run beyond 15,360 KiB to determine positive or negative impact.
34+
35+
The following table shows the default read-ahead values for each currently available distribution.
36+
37+
| Distribution | Release | Blocks read-ahead |
38+
|-|-|-|
39+
| RHEL | 8.3 | 128 KiB |
40+
| RHEL | 7.X, 8.0, 8.1, 8.2 | 15 X `rsize` |
41+
| SLES | 12.X – at least 15SP2 | 15 X `rsize` |
42+
| Ubuntu | 18.04 – at least 20.04 | 128 KiB |
43+
| Ubuntu | 16.04 | 15 X `rsize` |
44+
| Debian | Up to at least 10 | 15 x `rsize` |
45+
46+
47+
## How to work with per-NFS filesystem read-ahead
48+
49+
NFS read-ahead is defined at the mount point for an NFS filesystem. The default setting can be viewed and set both dynamically and persistently. For convenience, the following bash script written by Red Hat has been provided for viewing or dynamically setting read-ahead for amounted NFS filesystem.
50+
51+
Read-ahead can be defined either dynamically per NFS mount using the following script or persistently using `udev` rules as shown in this section. To display or set read-ahead for a mounted NFS filesystem, you can save the following script as a bash file, modify the file’s permissions to make it an executable (`chmod 544 readahead.sh`), and run as shown.
52+
53+
## How to show or set read-ahead values
54+
55+
To show the current read-ahead value (the returned value is in KiB), run the following command:
56+
57+
`$ ./readahead.sh show <mount-point>`
58+
59+
To set a new value for read-ahead, run the following command:
60+
61+
`$ ./readahead.sh show <mount-point> [read-ahead-kb]`
62+
63+
### Example
64+
65+
```
66+
#!/bin/bash
67+
# set | show readahead for a specific mount point
68+
# Useful for things like NFS and if you do not know / care about the backing device
69+
#
70+
# To the extent possible under law, Red Hat, Inc. has dedicated all copyright
71+
# to this software to the public domain worldwide, pursuant to the
72+
# CC0 Public Domain Dedication. This software is distributed without any warranty.
73+
# See <http://creativecommons.org/publicdomain/zero/1.0/>.
74+
#
75+
76+
E_BADARGS=22
77+
function myusage() {
78+
echo "Usage: `basename $0` set|show <mount-point> [read-ahead-kb]"
79+
}
80+
81+
if [ $# -gt 3 -o $# -lt 2 ]; then
82+
myusage
83+
exit $E_BADARGS
84+
fi
85+
86+
MNT=${2%/}
87+
BDEV=$(grep $MNT /proc/self/mountinfo | awk '{ print $3 }')
88+
89+
if [ $# -eq 3 -a $1 == "set" ]; then
90+
echo $3 > /sys/class/bdi/$BDEV/read_ahead_kb
91+
elif [ $# -eq 2 -a $1 == "show" ]; then
92+
echo "$MNT $BDEV /sys/class/bdi/$BDEV/read_ahead_kb = "$(cat /sys/class/bdi/$BDEV/read_ahead_kb)
93+
else
94+
myusage
95+
exit $E_BADARGS
96+
fi
97+
```
98+
99+
## How to persistently set read-ahead for NFS mounts
100+
101+
To persistently set read-ahead for NFS mounts, `udev` rules can be written as follows:
102+
103+
1. Create and test `/etc/udev/rules.d/99-nfs.rules`:
104+
105+
`SUBSYSTEM=="bdi", ACTION=="add", PROGRAM="/bin/awk -v bdi=$kernel 'BEGIN{ret=1} {if ($4 == bdi) {ret=0}} END{exit ret}' /proc/fs/nfsfs/volumes", ATTR{read_ahead_kb}="15380"`
106+
107+
2. Apply the `udev` rule:
108+
109+
`$udevadm control --reload`
110+
111+
## Next steps
112+
113+
* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
114+
* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
115+
* [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
116+
* [Linux concurrency best practices](performance-linux-concurrency-session-slots.md)
117+
* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md)
118+
* [Performance benchmarks for Linux](performance-benchmarks-linux.md)

0 commit comments

Comments
 (0)