|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +==================================================== |
| 4 | +pin_user_pages() and related calls |
| 5 | +==================================================== |
| 6 | + |
| 7 | +.. contents:: :local: |
| 8 | + |
| 9 | +Overview |
| 10 | +======== |
| 11 | + |
| 12 | +This document describes the following functions:: |
| 13 | + |
| 14 | + pin_user_pages() |
| 15 | + pin_user_pages_fast() |
| 16 | + pin_user_pages_remote() |
| 17 | + |
| 18 | +Basic description of FOLL_PIN |
| 19 | +============================= |
| 20 | + |
| 21 | +FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() |
| 22 | +("gup") family of functions. FOLL_PIN has significant interactions and |
| 23 | +interdependencies with FOLL_LONGTERM, so both are covered here. |
| 24 | + |
| 25 | +FOLL_PIN is internal to gup, meaning that it should not appear at the gup call |
| 26 | +sites. This allows the associated wrapper functions (pin_user_pages*() and |
| 27 | +others) to set the correct combination of these flags, and to check for problems |
| 28 | +as well. |
| 29 | + |
| 30 | +FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. |
| 31 | +This is in order to avoid creating a large number of wrapper functions to cover |
| 32 | +all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the |
| 33 | +pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so |
| 34 | +that's a natural dividing line, and a good point to make separate wrapper calls. |
| 35 | +In other words, use pin_user_pages*() for DMA-pinned pages, and |
| 36 | +get_user_pages*() for other cases. There are four cases described later on in |
| 37 | +this document, to further clarify that concept. |
| 38 | + |
| 39 | +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, |
| 40 | +multiple threads and call sites are free to pin the same struct pages, via both |
| 41 | +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the |
| 42 | +other, not the struct page(s). |
| 43 | + |
| 44 | +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN |
| 45 | +uses a different reference counting technique. |
| 46 | + |
| 47 | +FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, |
| 48 | +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. |
| 49 | + |
| 50 | +Which flags are set by each wrapper |
| 51 | +=================================== |
| 52 | + |
| 53 | +For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup |
| 54 | +flags the caller provides. The caller is required to pass in a non-null struct |
| 55 | +pages* array, and the function then pin pages by incrementing each by a special |
| 56 | +value. For now, that value is +1, just like get_user_pages*().:: |
| 57 | + |
| 58 | + Function |
| 59 | + -------- |
| 60 | + pin_user_pages FOLL_PIN is always set internally by this function. |
| 61 | + pin_user_pages_fast FOLL_PIN is always set internally by this function. |
| 62 | + pin_user_pages_remote FOLL_PIN is always set internally by this function. |
| 63 | + |
| 64 | +For these get_user_pages*() functions, FOLL_GET might not even be specified. |
| 65 | +Behavior is a little more complex than above. If FOLL_GET was *not* specified, |
| 66 | +but the caller passed in a non-null struct pages* array, then the function |
| 67 | +sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount |
| 68 | +of each page by +1.:: |
| 69 | + |
| 70 | + Function |
| 71 | + -------- |
| 72 | + get_user_pages FOLL_GET is sometimes set internally by this function. |
| 73 | + get_user_pages_fast FOLL_GET is sometimes set internally by this function. |
| 74 | + get_user_pages_remote FOLL_GET is sometimes set internally by this function. |
| 75 | + |
| 76 | +Tracking dma-pinned pages |
| 77 | +========================= |
| 78 | + |
| 79 | +Some of the key design constraints, and solutions, for tracking dma-pinned |
| 80 | +pages: |
| 81 | + |
| 82 | +* An actual reference count, per struct page, is required. This is because |
| 83 | + multiple processes may pin and unpin a page. |
| 84 | + |
| 85 | +* False positives (reporting that a page is dma-pinned, when in fact it is not) |
| 86 | + are acceptable, but false negatives are not. |
| 87 | + |
| 88 | +* struct page may not be increased in size for this, and all fields are already |
| 89 | + used. |
| 90 | + |
| 91 | +* Given the above, we can overload the page->_refcount field by using, sort of, |
| 92 | + the upper bits in that field for a dma-pinned count. "Sort of", means that, |
| 93 | + rather than dividing page->_refcount into bit fields, we simple add a medium- |
| 94 | + large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to |
| 95 | + page->_refcount. This provides fuzzy behavior: if a page has get_page() called |
| 96 | + on it 1024 times, then it will appear to have a single dma-pinned count. |
| 97 | + And again, that's acceptable. |
| 98 | + |
| 99 | +This also leads to limitations: there are only 31-10==21 bits available for a |
| 100 | +counter that increments 10 bits at a time. |
| 101 | + |
| 102 | +TODO: for 1GB and larger huge pages, this is cutting it close. That's because |
| 103 | +when pin_user_pages() follows such pages, it increments the head page by "1" |
| 104 | +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for |
| 105 | +pin_user_pages()) for each tail page. So if you have a 1GB huge page: |
| 106 | + |
| 107 | +* There are 256K (18 bits) worth of 4 KB tail pages. |
| 108 | +* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, |
| 109 | + 10 bits at a time) |
| 110 | +* There are 21 - 18 == 3 bits available to count. Except that there aren't, |
| 111 | + because you need to allow for a few normal get_page() calls on the head page, |
| 112 | + as well. Fortunately, the approach of using addition, rather than "hard" |
| 113 | + bitfields, within page->_refcount, allows for sharing these bits gracefully. |
| 114 | + But we're still looking at about 8 references. |
| 115 | + |
| 116 | +This, however, is a missing feature more than anything else, because it's easily |
| 117 | +solved by addressing an obvious inefficiency in the original get_user_pages() |
| 118 | +approach of retrieving pages: stop treating all the pages as if they were |
| 119 | +PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of |
| 120 | +this, so some work is required. Once that's in place, this limitation mostly |
| 121 | +disappears from view, because there will be ample refcounting range available. |
| 122 | + |
| 123 | +* Callers must specifically request "dma-pinned tracking of pages". In other |
| 124 | + words, just calling get_user_pages() will not suffice; a new set of functions, |
| 125 | + pin_user_page() and related, must be used. |
| 126 | + |
| 127 | +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags |
| 128 | +========================================================== |
| 129 | + |
| 130 | +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing |
| 131 | +these categories: |
| 132 | + |
| 133 | +CASE 1: Direct IO (DIO) |
| 134 | +----------------------- |
| 135 | +There are GUP references to pages that are serving |
| 136 | +as DIO buffers. These buffers are needed for a relatively short time (so they |
| 137 | +are not "long term"). No special synchronization with page_mkclean() or |
| 138 | +munmap() is provided. Therefore, flags to set at the call site are: :: |
| 139 | + |
| 140 | + FOLL_PIN |
| 141 | + |
| 142 | +...but rather than setting FOLL_PIN directly, call sites should use one of |
| 143 | +the pin_user_pages*() routines that set FOLL_PIN. |
| 144 | + |
| 145 | +CASE 2: RDMA |
| 146 | +------------ |
| 147 | +There are GUP references to pages that are serving as DMA |
| 148 | +buffers. These buffers are needed for a long time ("long term"). No special |
| 149 | +synchronization with page_mkclean() or munmap() is provided. Therefore, flags |
| 150 | +to set at the call site are: :: |
| 151 | + |
| 152 | + FOLL_PIN | FOLL_LONGTERM |
| 153 | + |
| 154 | +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's |
| 155 | +because DAX pages do not have a separate page cache, and so "pinning" implies |
| 156 | +locking down file system blocks, which is not (yet) supported in that way. |
| 157 | + |
| 158 | +CASE 3: Hardware with page faulting support |
| 159 | +------------------------------------------- |
| 160 | +Here, a well-written driver doesn't normally need to pin pages at all. However, |
| 161 | +if the driver does choose to do so, it can register MMU notifiers for the range, |
| 162 | +and will be called back upon invalidation. Either way (avoiding page pinning, or |
| 163 | +using MMU notifiers to unpin upon request), there is proper synchronization with |
| 164 | +both filesystem and mm (page_mkclean(), munmap(), etc). |
| 165 | + |
| 166 | +Therefore, neither flag needs to be set. |
| 167 | + |
| 168 | +In this case, ideally, neither get_user_pages() nor pin_user_pages() should be |
| 169 | +called. Instead, the software should be written so that it does not pin pages. |
| 170 | +This allows mm and filesystems to operate more efficiently and reliably. |
| 171 | + |
| 172 | +CASE 4: Pinning for struct page manipulation only |
| 173 | +------------------------------------------------- |
| 174 | +Here, normal GUP calls are sufficient, so neither flag needs to be set. |
| 175 | + |
| 176 | +page_dma_pinned(): the whole point of pinning |
| 177 | +============================================= |
| 178 | + |
| 179 | +The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able |
| 180 | +to query, "is this page DMA-pinned?" That allows code such as page_mkclean() |
| 181 | +(and file system writeback code in general) to make informed decisions about |
| 182 | +what to do when a page cannot be unmapped due to such pins. |
| 183 | + |
| 184 | +What to do in those cases is the subject of a years-long series of discussions |
| 185 | +and debates (see the References at the end of this document). It's a TODO item |
| 186 | +here: fill in the details once that's worked out. Meanwhile, it's safe to say |
| 187 | +that having this available: :: |
| 188 | + |
| 189 | + static inline bool page_dma_pinned(struct page *page) |
| 190 | + |
| 191 | +...is a prerequisite to solving the long-running gup+DMA problem. |
| 192 | + |
| 193 | +Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM |
| 194 | +=================================================================== |
| 195 | + |
| 196 | +Another way of thinking about these flags is as a progression of restrictions: |
| 197 | +FOLL_GET is for struct page manipulation, without affecting the data that the |
| 198 | +struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for |
| 199 | +short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is |
| 200 | +a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more |
| 201 | +restrictive case that has FOLL_PIN as a prerequisite: this is for pages that |
| 202 | +will be pinned longterm, and whose data will be accessed. |
| 203 | + |
| 204 | +Unit testing |
| 205 | +============ |
| 206 | +This file:: |
| 207 | + |
| 208 | + tools/testing/selftests/vm/gup_benchmark.c |
| 209 | + |
| 210 | +has the following new calls to exercise the new pin*() wrapper functions: |
| 211 | + |
| 212 | +* PIN_FAST_BENCHMARK (./gup_benchmark -a) |
| 213 | +* PIN_BENCHMARK (./gup_benchmark -b) |
| 214 | + |
| 215 | +You can monitor how many total dma-pinned pages have been acquired and released |
| 216 | +since the system was booted, via two new /proc/vmstat entries: :: |
| 217 | + |
| 218 | + /proc/vmstat/nr_foll_pin_requested |
| 219 | + /proc/vmstat/nr_foll_pin_requested |
| 220 | + |
| 221 | +Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is |
| 222 | +because there is a noticeable performance drop in unpin_user_page(), when they |
| 223 | +are activated. |
| 224 | + |
| 225 | +References |
| 226 | +========== |
| 227 | + |
| 228 | +* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ |
| 229 | +* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ |
| 230 | +* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ |
| 231 | + |
| 232 | +John Hubbard, October, 2019 |
0 commit comments