Buffer lifetime tracking and zero-copy XShm buffers on GNU/Linux#539
Buffer lifetime tracking and zero-copy XShm buffers on GNU/Linux#539jholveck wants to merge 5 commits into
Conversation
Summary ======= I’ve spoken at length about the importance of avoiding copies. This PR is to eliminate the remaining (CPU-side) copy: copying from the OS-supplied buffer to a Python byte / bytearray object. We introduce buffer lifetime tracking in MSS so backends can safely reclaim or reuse screenshot buffers when downstream consumers are truly done with them. On GNU/Linux, the XShmGetImage backend now uses that mechanism to enable zero-copy screenshot buffers on Python 3.12 and newer. Benchmark --------- I ran a benchmark on my home computer (Ryzen 7 2700X, B-450 chipset, DDR4-2133, RTX 3090). I captured 1000 iterations of 3840×2160 screenshots as quickly as possible while forcing all pixel data to be read (using a NumPy sum). I ran A/B tests of enabling or disabling the new feature, taking a best-of-three test. Capture time decreased from 22.64 ms to 18.59 ms per frame. Put differently, this is from 44 FPS to 54 FPS. This is approximately 18% faster. Why === Previously, backends had to copy screenshot data into fresh Python-owned buffers to avoid reusing memory that might still be referenced by NumPy, Pillow, or other buffer consumers. That copy cost is significant for large captures and high frame-rate use cases. What changed ============ Internal infrastructure ----------------------- This change keeps MSS user-facing behavior the same while improving backend memory handling and performance. - Added new buffer-finalization plumbing that lets backends attach a finalizer to screenshot buffer ownership. - Updated core typing/contracts so grab can return generic buffer-compatible objects. (The user-facing contracts were updated in BoboTiG#521 and others, but this updates the internal contracts.) - Expanded documentation and release notes to explain direct buffer behavior and platform/version scope. - Updated packaging tests and test dependencies to include new buffer and integration test coverage. XShmGetImage backend -------------------- - Reworked the GNU/Linux XShmGetImage backend to use a reusable SHM slot pool with dynamic growth and finalizer-driven slot return. - Added shutdown/cleanup safeguards so slot destruction and connection shutdown are coordinated safely, including finalizer interactions. - Kept fallback behavior intact when MIT-SHM is unavailable or unsuitable. Behavior by runtime =================== - Python 3.12+ on GNU/Linux XShm backend: - zero-copy buffer exposure from SHM-backed storage - SHM slot is released when downstream buffer users release it - Pre-3.12: - copy-based behavior is retained - finalization happens immediately after copy Testing ======= - Added focused unit tests for buffer-finalizer semantics, including fast and slow paths and downstream memoryview trees. - Added GNU/Linux backend lifecycle tests covering: - release on normal finalization - failure while wrapping finalizing buffers - finalization after close - pre-3.12 immediate-finalization behavior - dynamic SHM pool growth failure behavior - threaded release during close to validate shutdown race protections Notes for maintainers ===================== - This PR is intentionally backend-agnostic at the plumbing layer, with initial zero-copy adoption in GNU/Linux XShmGetImage. - The design keeps the existing user-facing API, while making buffer lifetime explicit for backend resource management.
|
That's is so good! 🚀 |
|
Thank you! @halldorfannar has been unavailable, but is back in action now. I'd like him to take a look at this before you commit it, but I think it's in good shape. I think it's going to be pretty easy to use this for Windows GDI. I haven't yet looked at the macOS side; it's probably not hard to speed it up with this too, but I haven't done macOS development in many years. Once we commit this part, we might open an issue to see if other contributors want to tackle it. |
halldorfannar
left a comment
There was a problem hiding this comment.
Good job!
I had one minor nit about a doc a string (see comments) but then I had a bigger issue that I feel we should document somewhere. And this is about the memory requirements of the new fast path. Since by default we allocate two color buffers of the virtual monitor size, this can be a substantial amount of memory (for example, I typically run a 4k monitor and then a laptop screen and these are offset from each other, creating a rectangle that is much larger than the sum of these two screens. Depending on how people organize their code, we may end up with even more buffers than just two. I think we should mention this extra memory requirement somewhere, so users are aware.
I don't mind this approach, especially as a start and a way to do this with minimal changes to the existing API. I just feel we need to surface it. In the future we can look at ways for users to give us hints as they initialize the library, so we can use less memory. But that is future music.
I can see that. However, I don't think it represents extra memory usage with this code. This is mostly because of the transient need for the transfer buffer in the copying mode, which is obviously not needed in zero-copy mode. Indeed, I think that the zero-copy mode may often use less memory in some cases. In the copying case (the previous code, or this code prior to Python 3.12), you'd still end up with just as much memory consumed in the basic case: one framebuffer's worth to hold the shared memory buffer, and a second for the bytearray to hold the Python-usable copy. If a program only takes a single screenshot, I actually question whether the second allocated framebuffer will even consume any memory pages. It's up to the X server about how it allocates the shared memory region, but I suspect it will typically be zeroed pages. Linux will lazily back pages, so if the user only takes a single screenshot, those pages will never actually be allocated (backed). That's an implementation detail I'm uncertain of, and again, in the single screenshot case, the worst case is still no worse than the copying mode. You mention that there may be more than two buffers allocated, depending on the way the user is using them. That's true, although it's still the same memory use as in the copying case (previous code): it only happens if the user retains references to the returned buffers in the zero-copy case, which would also hold the bytearrays in the copying case. The zero-copy mode uses less, actually, since there isn't the need for the transient buffer that's used during the copy: you end up needing only N buffers' worth, instead of N+1 transiently. But let's return to the case where the user isn't doing any sort of pipelining, and is just using one screenshot at a time. I am mostly writing with this loop in mind: There, the copying mode in the steady state actually ends up needing, transiently, three framebuffers' worth. That's one to hold the data transferred from the system, one for the bytearray referenced by img, and one for the bytearray being created in the grab. The zero-copy mode only needs two, for the buffers for each image that exists during the second grab (one still referenced from the previous loop, and one in the new grab). (The analysis for if the user does Additionally, the zero-copy code has an additional potential advantage on memory-constrained systems, once the MSS object is closed and the buffers are unreferenced. It will unmap the memory at that time, returning it to the OS. In the copying mode, the bytearrays get garbage collected, but the memory still gets returned just to Python, not the OS. As I finish editing this reply, I realize that my analyses have mostly been based on the assumption of full-screen grabs. I can repeat my analysis with the idea of small screen grabs cropped out of huge framebuffers. I decided to get this reply out to you rather than trying the new analysis. Note that, in that case, the question of lazily-backed pages becomes much more relevant. |
|
I'll separately add some notes about the numbers, to get a sense of scale. A 4k resolution monitor, in BGRA, represents slight less than 32 MB. For two 4k monitors, then two pre-allocated buffers would represent just under 128 MB, if they're stacked horizontally or vertically, or 256 MB in the unlikely event they're arranged diagonally. The most memory-limited platform that can handle two such displays is probably a Raspberry Pi 4. That's got 1 to 8 GB of RAM. The Raspberry Pi does take the GPU RAM out of the SoC RAM, though, so you might reasonably see, say, about 768 MB of system RAM on a system with two 4k monitors, with some extra VRAM set aside for textures. So it's not ludicrous to consider a case where the MSS memory consumption is a sizable portion of the available system RAM. Of course, anybody developing an application targeting such a setup is going to be accustomed to careful consideration of the memory usage. As I said above, I don't think this is new to, or worsened by, the zero-copy mode. |
|
Another weakness in my analysis that I failed to point out. I'm mostly looking at the peak memory usage, on the assumption that this is what the user is going to need to deal with. In the common loop I gave, if (and only if) the This is only applicable when taking multiple screenshots in a loop, one at a time, and explicitly releasing one before taking the next. In other cases, such as if pipelining or not explicitly releasing them, the zero-copy mode has a lower transient memory usage during the grab, and the same memory usage in the rest of the loop. (All still subject to my caveat about having only analyzed the full-screen use cases.) |
|
I agree that there are scenarios where this new approach will actually save us memory. I just wanted to paint a scenario where it didn't. It seems I wasn't clear on that scenario. It's specifically when you plan to just capture a single monitor, let's say my laptop screen and for this case let's have it running at 1920x1080. It's aligned with my 4k monitor so that the virtual combined rectangle is taller than 4k (because my laptop sits lower than my 4k monitor). The virtual width is just the sum of the regular screen resolutions, 1920 + 3840, but the virtual height is not the maximum of the two. The virtual height (due to the offset) is about 500 pixels taller, so we get max(2160,1080) + 500, which is about 2500 (using approximate math here). This gives us 5760 x 2500 pixels, and with 4 bytes per channel we gives us almost 55 MiB. And we allocate two of them. So we are now over 110 MiB before calling grab. In the old (slow path) we wouldn't preallocate, and when I call grab we would capture my 1920x1080 screen resulting in 7.9 MiB. In that path we copy, so we get double that, a total of 16 MiB. Now, I understand that 110 MiB is not a lot of memory today. But going from 16 MiB to 110 MiB is an order of magnitude change. This is why I felt it was prudent to mention it in the notes, just so that users are aware. I had suggested, that in the future we could allow users to provide a hint so that in the case which I outline above (which isn't a fabricated case, it is my setup) then I could tell the system to just preallocate for my laptop screen dimensions. If you feel that I'm worrying about peanuts that's fine. I have however found, that any such change to software that has existed for a long time, should be called out. I also recognize that a user that has two monitors and one of the 4K is more likely to have more memory so perhaps we should just assume that my scenario will not be an issue for folks. But I still wanted us to make that as an explicit decision, rather than an implicit one 😺 |
|
Ok! You're correct: the single-monitor scenario is the "small part of the whole framebuffer" caveat I mentioned. I do agree that increased memory usage is something that we may want to note. I described the memory requirements on a Raspberry Pi to illustrate that this isn't necessarily negligible. As an aside, prior to this code (in 10.2), we did preallocate a full framebuffer for the SHM region. We didn't do that in 10.1, since that didn't support SHM. I did some tests, and verified that the memory doesn't get backed until it's written to. If we're only capturing a small region (such as a single monitor), then the unused region doesn't get backed; it still points to the zero page. For my test, I captured a 640x480 region from the middle of the screen, and verified that only the expected 1200 kB was backed in the shared memory region. (I did a few variations on this test too, which were all unsurprising.) So that might moot the whole concern, or at least greatly reduce the applicability. I'm pushing back slightly because I'm having a hard time thinking about how to phrase this in the docs. It should be accurate, of course. It should give affected users actionable information. But the vast majority of users - many of whom are newer programmers - are not affected, and I also wouldn't want to cause undue concern among them. Premature optimization, and all that. That said, if you have a way to document this in mind, then I'm fine with including something. I have thought that making the number of buffers to preallocate would be a reasonable option to add to the MSS constructor. This would be useful for the users of a pipelining architectures, who want to preallocate enough buffers on startup instead of on-demand. I've tried to avoid unnecessary timing jitter on the first few frames, before steady state is reached. (Lazy page backing is still a source of startup jitter, but is very minimal.) |
I didn't bother with this in the initial implementation, since scanning the buffer list is so cheap. But in the PR discussions, we realized that the buffers will be lazily backed, with no physical memory allocation until they're written to. This means that a user targeting a platform with heavily constrained memory might want to reuse the same buffer if possible, rather than ping-ponging between the two buffers with each capture. If they are careful to free one screenshot before allocating another, then we should prefer to not back the other buffer unnecessarily. In other words, this helps memory-conscious users save memory, where they previously didn't have a choice. It might also be reasonable to, at the time a buffer is released (returned to the pool), madvise the now-obsolete data with MADV_DONTNEED (or similar flags, like MADV_REMOVE or MADV_FREE). That has some significant historical quirks on Linux that I didn't want to deal with. The implementation complexity seemed significant, and at first glance, seems likely to cause more overhead with page table churn than it would solve.
|
Makes sense. Thanks for going the extra mile to dig into this. I'm good with this. Let's merge! |
|
Ok! I've marked it as ready for BoboTiG's review. I looked at adding a flag to say how many buffers to preallocate, but I think the way we handle things like _choose_impl should get some improvements first. |
|
👍 for me. Let me know when ready to merge. |
|
I'm good! |
Summary
I’ve spoken at length about the importance of avoiding copies. This PR is to eliminate the remaining (CPU-side) copy: copying from the OS-supplied buffer to a Python byte / bytearray object.
We introduce buffer lifetime tracking in MSS so backends can safely reclaim or reuse screenshot buffers when downstream consumers are truly done with them. On GNU/Linux, the XShmGetImage backend now uses that mechanism to enable zero-copy screenshot buffers on Python 3.12 and newer.
Benchmark
I ran a benchmark on my home computer (Ryzen 7 2700X, B-450 chipset, DDR4-2133, RTX 3090). I captured 1000 iterations of 3840×2160 screenshots as quickly as possible while forcing all pixel data to be read (using a NumPy sum). I ran A/B tests of enabling or disabling the new feature, taking a best-of-three test.
Capture time decreased from 22.64 ms to 18.59 ms per frame. Put differently, this is from 44 FPS to 54 FPS. This is approximately 18% faster.
Why
Previously, backends had to copy screenshot data into fresh Python-owned buffers to avoid reusing memory that might still be referenced by NumPy, Pillow, or other buffer consumers. That copy cost is significant for large captures and high frame-rate use cases.
What changed
Internal infrastructure
This change keeps MSS user-facing behavior the same while improving backend memory handling and performance.
XShmGetImage backend
Behavior by runtime
Testing
Notes for maintainers
Changes proposed in this PR
Fixes #424 (I accidentally said this was a duplicate of #476, but it's actually separate)
May be relevant to #222, as it lays the groundwork to lower CPU usage. However, this PR doesn't affect Windows.
./check.shpassed