Merge pull request #161282 from b-juche/live-udpate-06-04-Linux-Best-Practices-DirecIO-Cache-Readahead-VMSkus

v-shmck · web-flow · commit c9b674a61d32 · 2021-07-02T13:18:55.000-07:00
Add Linux best practice articles: Direct I/O, Filesystem Cache, Read,…
diff --git a/articles/azure-netapp-files/TOC.yml b/articles/azure-netapp-files/TOC.yml
@@ -37,12 +37,20 @@
       href: performance-oracle-single-volumes.md
     - name: Performance reference for Azure NetApp Files
       items:
+      - name: Linux direct I/O best practices
+        href: performance-linux-direct-io.md
+      - name: Linux filesystem cache best practices
+        href: performance-linux-filesystem-cache.md
       - name: Linux NFS mount options best practices
         href: performance-linux-mount-options.md
       - name: Linux concurrency best practices
         href: performance-linux-concurrency-session-slots.md
+      - name: Linux NFS read-ahead best practices
+        href: performance-linux-nfs-read-ahead.md
       - name: SMB performance best practices
-        href: azure-netapp-files-smb-performance.md 
+        href: azure-netapp-files-smb-performance.md
+      - name: Azure virtual machine SKUs best practices
+        href: performance-virtual-machine-sku.md 
   - name: Cost model for Azure NetApp Files
     href: azure-netapp-files-cost-model.md
   - name: Understand volume quota
diff --git a/articles/azure-netapp-files/performance-linux-concurrency-session-slots.md b/articles/azure-netapp-files/performance-linux-concurrency-session-slots.md
@@ -261,5 +261,9 @@ The following example shows Packet 14 (server maximum requests):
 
 ## Next steps  
 
+* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
+* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
 * [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
+* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
+* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md) 
 * [Performance benchmarks for Linux](performance-benchmarks-linux.md) 
diff --git a/articles/azure-netapp-files/performance-linux-direct-io.md b/articles/azure-netapp-files/performance-linux-direct-io.md
@@ -0,0 +1,53 @@
+---
+title: Linux direct I/O best practices for Azure NetApp Files | Microsoft Docs
+description: Describes Linux direct I/O and the best practices to follow for Azure NetApp Files.  
+services: azure-netapp-files
+documentationcenter: ''
+author: b-juche
+manager: ''
+editor: ''
+
+ms.assetid:
+ms.service: azure-netapp-files
+ms.workload: storage
+ms.tgt_pltfrm: na
+ms.devlang: na
+ms.topic: conceptual
+ms.date: 07/02/2021
+ms.author: b-juche
+---
+# Linux direct I/O best practices for Azure NetApp Files
+
+This article helps you understand direct I/O best practices for Azure NetApp Files.  
+
+## Direct I/O
+
+ The most common parameter used in storage performance benchmarking is direct I/O. It is supported by FIO and Vdbench. DISKSPD offers support for the similar construct of memory-mapped I/O. With direct I/O, the filesystem cache is bypassed, operations for direct memory access copy are avoided, and storage tests are made fast and simple.  
+
+Using the direct I/O parameter makes storage testing easy. No data is read from the filesystem cache on the client. As such, the test is truly stressing the storage protocol and service itself, rather than memory access speeds. Also, without the DMA memory copies, read and write operations are efficient from a processing perspective. 
+
+Take the Linux `dd` command as an example workload. Without the optional `odirect` flag, all I/O generated by `dd` is served from the Linux buffer cache. Reads with the blocks already in memory are not retrieved from storage. Reads resulting in a buffer cache miss end up being read from storage using NFS read-ahead with varying results, depending on factors as mount `rsize` and client read-ahead tunables. When writes are sent through the buffer cache, they use a write-behind mechanism, which is untuned and uses a significant amount of parallelism to send the data to the storage device. You might attempt to run two independent streams of I/O, one `dd` for reads and one `dd` for writes. But in fact, the operating system, untuned, favors writes over reads and uses more parallelism of it.
+
+Aside from database, few applications use direct I/O. Instead, they choose to leverage the advantages of a large memory cache for repeated reads and a write behind cache for asynchronous writes. In short, using direct I/O turns the test into a micro benchmark *if* the application being synthesized uses the filesystem cache.  
+
+The following are some databases that support direct I/O: 
+
+* Oracle 
+* SAP HANA
+* MySQL (InnoDB storage engine)
+* RocksDB
+* PostgreSQL
+* Teradata
+
+## Best practices 
+
+Testing with `directio` is an excellent way to understand the limits of the storage service and client. To get a better understanding for how the application itself will behave (if the application doesn't use `directio`), you should also run tests through the filesystem cache.
+
+## Next steps  
+
+* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
+* [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
+* [Linux concurrency best practices for Azure NetApp Files](performance-linux-concurrency-session-slots.md)
+* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
+* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md) 
+* [Performance benchmarks for Linux](performance-benchmarks-linux.md) 
diff --git a/articles/azure-netapp-files/performance-linux-filesystem-cache.md b/articles/azure-netapp-files/performance-linux-filesystem-cache.md
@@ -0,0 +1,98 @@
+---
+title: Linux filesystem cache best practices for Azure NetApp Files | Microsoft Docs
+description: Describes Linux filesystem cache best practices to follow for Azure NetApp Files.  
+services: azure-netapp-files
+documentationcenter: ''
+author: b-juche
+manager: ''
+editor: ''
+
+ms.assetid:
+ms.service: azure-netapp-files
+ms.workload: storage
+ms.tgt_pltfrm: na
+ms.devlang: na
+ms.topic: conceptual
+ms.date: 07/02/2021
+ms.author: b-juche
+---
+# Linux filesystem cache best practices for Azure NetApp Files
+
+This article helps you understand filesystem cache best practices for Azure NetApp Files.  
+
+## Filesystem cache tunables
+
+You need to understand the following factors about filesystem cache tunables:  
+
+* Flushing a dirty buffer leaves the data in a clean state usable for future reads until memory pressure leads to eviction.  
+* There are three triggers for an asynchronous flush operation:
+    * Time based: When a buffer reaches the age defined by this tunable, it must be marked for cleaning (that is, flushing, or writing to storage).
+    * Memory pressure: See [`vm.dirty_ratio | vm.dirty_bytes`](#vmdirty_ratio--vmdirty_bytes) for details.
+    * Close: When a file handle is closed, all dirty buffers are asynchronously flushed to storage.
+
+These factors are controlled by four tunables. Each tunable can be tuned dynamically and persistently using `tuned` or `sysctl` in the `/etc/sysctl.conf` file. Tuning these variables improves performance for applications.  
+
+> [!NOTE]
+> Information discussed in this article was uncovered during SAS GRID and SAS Viya validation exercises. As such, the tunables are based on lessons learned from the validation exercises. Many applications will similarly benefit from tuning these parameters.
+
+### `vm.dirty_ratio | vm.dirty_bytes` 
+
+These two tunables define the amount of RAM made usable for data modified but not yet written to stable storage.  Whichever tunable is set automatically sets the other tunable to zero; RedHat advises against manually setting either of the two tunables to zero.  The option `vm.dirty_ratio` (the default of the two) is set by Redhat to either 20% or 30% of physical memory depending on the OS, which is a significant amount considering the memory footprint of modern systems. Consideration should be given to setting `vm.dirty_bytes` instead of `vm.dirty_ratio` for a more consistent experience regardless of memory size.  For example, ongoing work with SAS GRID determined 30 MiB an appropriate setting for best overall mixed workload performance. 
+
+### `vm.dirty_background_ratio | vm.dirty_background_bytes` 
+
+These tunables define the starting point where the Linux write-back mechanism begins flushing dirty blocks to stable storage. Redhat defaults to 10% of physical memory, which, on a large memory system, is a significant amount of data to start flushing. Taking SAS GRID for example, historically the recommendation has been to set `vm.dirty_background` to 1/5 size of `vm.dirty_ratio`  or `vm.dirty_bytes`. Considering how aggressively the `vm.dirty_bytes` setting is set for SAS GRID, no specific value is being set here.  
+
+### `vm.dirty_expire_centisecs` 
+
+This tunable defines how old a dirty buffer can be before it must be tagged for asynchronously writing out.  Take  SAS Viya’s CAS workload for example. An ephemeral write-dominant workload found that setting this value to 300 centiseconds (3 seconds) was optimal, with 3000 centiseconds (30 seconds) being the default.  
+
+SAS Viya shares CAS data into multiple small chunks of a few megabytes each.  Rather than closing these file handles after writing data to each shard, the handles are left open and the buffers within are memory-mapped by the application.  Without a close, there will be no flush until either memory pressure or 30 seconds has passed. Waiting for memory pressure proved suboptimal as did waiting for a long timer to expire. Unlike SAS GRID, which looked for the best overall throughput, SAS Viya looked to optimize write bandwidth.  
+
+### `vm.dirty_writeback_centisecs` 
+
+The kernel flusher thread is responsible for asynchronously flushing dirty buffers between each flush thread sleeps.  This tunable defines the amount spent sleeping between flushes.  Considering the 3-second `vm.dirty_expire_centisecs` value used by SAS Viya, SAS set this tunable to 100 centiseconds (1 second) rather than the 500 centiseconds (5 seconds) default to find the best overall performance.
+
+## Impact of an untuned filesystem cache
+
+Considering the default virtual memory tunables and the amount of RAM in modern systems, write-back potentially slows down other storage-bound operations from the perspective of the specific client driving this mixed workload.  The following symptoms may be expected from an untuned, write-heavy, cache-laden Linux machine.  
+
+* Directory lists `ls` take long enough as to appear hung.
+* Read throughput against the filesystem decreases significantly in comparison to write throughput.
+* `nfsiostat` reports write latencies **in seconds or higher**.
+
+You might experience this behavior only by *the Linux machine* performing the mixed write-heavy workload.  Further, the experience is degraded against all NFS volumes mounted against a single storage endpoint.  If the mounts come from two or more endpoints, only the volumes sharing an endpoint exhibit this behavior.
+
+Setting the filesystem cache parameters as described in this section has been shown to address the issues.
+
+## Monitoring virtual memory
+
+To understand what is going with virtual memory and the write-back, consider the following code snippet and output.  *Dirty* represents the amount dirty memory in the system, and *writeback* represents the amount of memory actively being written to storage.  
+
+`# while true; do echo "###" ;date ; egrep "^Cached:|^Dirty:|^Writeback:|file" /proc/meminfo; sleep 5; done`
+
+The following output comes from an experiment where the `vm.dirty_ratio` and the `vm.dirty_background` ratio were set to 2% and 1% of physical memory respectively.  In this case, flushing began at 3.8 GiB, 1% of the 384-GiB memory system.  Writeback closely resembled the write throughput to NFS. 
+
+```
+Cons
+Dirty:                                    1174836 kB
+Writeback:                         4 kB
+###
+Dirty:                                    3319540 kB
+Writeback:                         4 kB
+###
+Dirty:                                    3902916 kB        <-- Writes to stable storage begins here
+Writeback:                         72232 kB   
+###
+Dirty:                                    3131480 kB
+Writeback:                         1298772 kB   
+``` 
+
+## Next steps  
+
+* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
+* [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
+* [Linux concurrency best practices for Azure NetApp Files](performance-linux-concurrency-session-slots.md)
+* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
+* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md) 
+* [Performance benchmarks for Linux](performance-benchmarks-linux.md) 
diff --git a/articles/azure-netapp-files/performance-linux-mount-options.md b/articles/azure-netapp-files/performance-linux-mount-options.md
@@ -31,7 +31,7 @@ When preparing a multi-node SAS GRID environment for production, you might notic
 | No `nconnect` | 8 hours |
 | `nconnect=8`  | 5.5 hours | 
 
-Both sets of tests used the same E32-8_v4 virtual machine and RHEL8.3, with readahead set to 15 MiB.
+Both sets of tests used the same E32-8_v4 virtual machine and RHEL8.3, with read-ahead set to 15 MiB.
 
 When you use `nconnect`, keep the following rules in mind:
 
@@ -85,7 +85,7 @@ sudo vi /etc/fstab
 10.23.1.4:/HN1-shared/shared /hana/shared  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0
 ```
  
-Also for example, SAS Viya recommends a 256-KiB read and write sizes, and [SAS GRID](https://communities.sas.com/t5/Administration-and-Deployment/Azure-NetApp-Files-A-shared-file-system-to-use-with-SAS-Grid-on/m-p/606973/highlight/true#M17740) limits the `r/wsize` to 64 KiB while augmenting read performance with increased readahead for the NFS mounts.  <!-- For more information on readahead, see the article “NFS Readahead”. --> 
+Also for example, SAS Viya recommends a 256-KiB read and write sizes, and [SAS GRID](https://communities.sas.com/t5/Administration-and-Deployment/Azure-NetApp-Files-A-shared-file-system-to-use-with-SAS-Grid-on/m-p/606973/highlight/true#M17740) limits the `r/wsize` to 64 KiB while augmenting read performance with increased read-ahead for the NFS mounts. See [NFS read-ahead best practices for Azure NetApp Files](performance-linux-nfs-read-ahead.md) for details.
 
 The following considerations apply to the use of `rsize` and `wsize`:
 
@@ -133,5 +133,9 @@ When no close-to-open consistency (`nocto`) is used, the client will trust the f
 
 ## Next steps  
 
+* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
+* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
 * [Linux concurrency best practices for Azure NetApp Files](performance-linux-concurrency-session-slots.md)
+* [Linux NFS read-ahead best practices](performance-linux-nfs-read-ahead.md)
+* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md) 
 * [Performance benchmarks for Linux](performance-benchmarks-linux.md) 
diff --git a/articles/azure-netapp-files/performance-linux-nfs-read-ahead.md b/articles/azure-netapp-files/performance-linux-nfs-read-ahead.md
@@ -0,0 +1,118 @@
+---
+title: Linux NFS read-ahead best practices for Azure NetApp Files - Session slots and slot table entries | Microsoft Docs
+description: Describes filesystem cache and Linux NFS read-ahead best practices for Azure NetApp Files.  
+services: azure-netapp-files
+documentationcenter: ''
+author: b-juche
+manager: ''
+editor: ''
+
+ms.assetid:
+ms.service: azure-netapp-files
+ms.workload: storage
+ms.tgt_pltfrm: na
+ms.devlang: na
+ms.topic: conceptual
+ms.date: 07/02/2021
+ms.author: b-juche
+---
+# Linux NFS read-ahead best practices for Azure NetApp Files
+
+This article helps you understand filesystem cache best practices for Azure NetApp Files.  
+
+NFS read-ahead predictively requests blocks from a file in advance of I/O requests by the application. It is designed to improve client sequential read throughput.  Until recently, all modern Linux distributions set the read-ahead value to be equivalent of 15 times the mounted filesystems `rsize`.  
+
+The following table shows the default read-ahead values for each given `rsize` mount option.
+
+| Mounted filesystem `rsize` | Blocks read-ahead |
+|-|-|
+| 64 KiB | 960 KiB |
+| 256 KiB | 3,840 KiB |
+| 1024 KiB | 15,360 KiB |
+
+RHEL 8.3 and Ubuntu 18.04 introduced changes that might negatively impact client sequential read performance.  Unlike earlier releases, these distributions set read-ahead to a default of 128 KiB regardless of the `rsize` mount option used. Upgrading from releases with the larger read-ahead value to those with the 128-KiB default experienced decreases in sequential read performance. However, read-ahead values may be tuned upward both dynamically and persistently.  For example, testing with SAS GRID  found the 15,360-KiB read value optimal compared to 3,840 KiB, 960 KiB, and 128 KiB.  Not enough tests have been run beyond 15,360 KiB to determine positive or negative impact.
+
+The following table shows the default read-ahead values for each currently available distribution.
+
+|     Distribution    |     Release    |     Blocks   read-ahead    |
+|-|-|-|
+|     RHEL    |     8.3    |     128 KiB    |
+|     RHEL    |     7.X, 8.0, 8.1, 8.2    |     15 X `rsize`    |
+|     SLES    |     12.X – at   least 15SP2    |     15 X `rsize`    |
+|     Ubuntu    |     18.04 – at least 20.04    |     128 KiB    |
+|     Ubuntu    |     16.04    |     15 X `rsize`    |
+|     Debian    |     Up to at least 10    |     15 x `rsize`    |
+
+
+## How to work with per-NFS filesystem read-ahead   
+
+NFS read-ahead is defined at the mount point for an NFS filesystem. The default setting can be viewed and set both dynamically and persistently.  For convenience, the following bash script written by Red Hat has been provided for viewing or dynamically setting read-ahead for amounted NFS filesystem.
+
+Read-ahead can be defined either dynamically per NFS mount using the following script or persistently using `udev` rules as shown in this section.  To display or set read-ahead for a mounted NFS filesystem, you can save the following script as a bash file, modify the file’s permissions to make it an executable (`chmod 544 readahead.sh`), and run as shown. 
+
+## How to show or set read-ahead values   
+
+To show the current read-ahead value (the returned value is in KiB), run the following command:  
+
+`$ ./readahead.sh  show <mount-point>`   
+
+To set a new value for read-ahead, run the following command:   
+
+`$ ./readahead.sh  show <mount-point> [read-ahead-kb]`
+ 
+### Example   
+
+```
+#!/bin/bash
+# set | show readahead for a specific mount point
+# Useful for things like NFS and if you do not know / care about the backing device
+#
+# To the extent possible under law, Red Hat, Inc. has dedicated all copyright
+# to this software to the public domain worldwide, pursuant to the
+# CC0 Public Domain Dedication. This software is distributed without any warranty.
+# See <http://creativecommons.org/publicdomain/zero/1.0/>.
+#
+
+E_BADARGS=22
+function myusage() {
+echo "Usage: `basename $0` set|show <mount-point> [read-ahead-kb]"
+}
+
+if [ $# -gt 3 -o $# -lt 2 ]; then
+   myusage
+   exit $E_BADARGS
+fi
+
+MNT=${2%/}
+BDEV=$(grep $MNT /proc/self/mountinfo | awk '{ print $3 }')
+
+if [ $# -eq 3 -a $1 == "set" ]; then
+   echo $3 > /sys/class/bdi/$BDEV/read_ahead_kb
+elif [ $# -eq 2 -a $1 == "show" ]; then
+   echo "$MNT $BDEV /sys/class/bdi/$BDEV/read_ahead_kb = "$(cat /sys/class/bdi/$BDEV/read_ahead_kb)
+else
+   myusage
+   exit $E_BADARGS
+fi
+```
+
+## How to persistently set read-ahead for NFS mounts
+
+To persistently set read-ahead for NFS mounts, `udev` rules can be written as follows:    
+
+1. Create and test `/etc/udev/rules.d/99-nfs.rules`:
+
+    `SUBSYSTEM=="bdi", ACTION=="add", PROGRAM="/bin/awk -v bdi=$kernel 'BEGIN{ret=1} {if ($4 == bdi) {ret=0}} END{exit ret}' /proc/fs/nfsfs/volumes", ATTR{read_ahead_kb}="15380"`
+
+2. Apply the `udev` rule:   
+
+    `$udevadm control --reload`
+
+## Next steps  
+
+* [Linux direct I/O best practices for Azure NetApp Files](performance-linux-direct-io.md)
+* [Linux filesystem cache best practices for Azure NetApp Files](performance-linux-filesystem-cache.md)
+* [Linux NFS mount options best practices for Azure NetApp Files](performance-linux-mount-options.md)
+* [Linux concurrency best practices](performance-linux-concurrency-session-slots.md)
+* [Azure virtual machine SKUs best practices](performance-virtual-machine-sku.md) 
+* [Performance benchmarks for Linux](performance-benchmarks-linux.md) 
diff --git a/articles/azure-netapp-files/performance-virtual-machine-sku.md b/articles/azure-netapp-files/performance-virtual-machine-sku.md