Skip to content

Buffer lifetime tracking and zero-copy XShm buffers on GNU/Linux#539

Open
jholveck wants to merge 5 commits into
BoboTiG:mainfrom
jholveck:reusable-buffers
Open

Buffer lifetime tracking and zero-copy XShm buffers on GNU/Linux#539
jholveck wants to merge 5 commits into
BoboTiG:mainfrom
jholveck:reusable-buffers

Conversation

@jholveck

@jholveck jholveck commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

I’ve spoken at length about the importance of avoiding copies. This PR is to eliminate the remaining (CPU-side) copy: copying from the OS-supplied buffer to a Python byte / bytearray object.

We introduce buffer lifetime tracking in MSS so backends can safely reclaim or reuse screenshot buffers when downstream consumers are truly done with them. On GNU/Linux, the XShmGetImage backend now uses that mechanism to enable zero-copy screenshot buffers on Python 3.12 and newer.

Benchmark

I ran a benchmark on my home computer (Ryzen 7 2700X, B-450 chipset, DDR4-2133, RTX 3090). I captured 1000 iterations of 3840×2160 screenshots as quickly as possible while forcing all pixel data to be read (using a NumPy sum). I ran A/B tests of enabling or disabling the new feature, taking a best-of-three test.

Capture time decreased from 22.64 ms to 18.59 ms per frame. Put differently, this is from 44 FPS to 54 FPS. This is approximately 18% faster.

Why

Previously, backends had to copy screenshot data into fresh Python-owned buffers to avoid reusing memory that might still be referenced by NumPy, Pillow, or other buffer consumers. That copy cost is significant for large captures and high frame-rate use cases.

What changed

Internal infrastructure

This change keeps MSS user-facing behavior the same while improving backend memory handling and performance.

  • Added new buffer-finalization plumbing that lets backends attach a finalizer to screenshot buffer ownership.
  • Updated core typing/contracts so grab can return generic buffer-compatible objects. (The user-facing contracts were updated in Change the ScreenShot API to be buffer-oriented. #521 and others, but this updates the internal contracts.)
  • Expanded documentation and release notes to explain direct buffer behavior and platform/version scope.
  • Updated packaging tests and test dependencies to include new buffer and integration test coverage.

XShmGetImage backend

  • Reworked the GNU/Linux XShmGetImage backend to use a reusable SHM slot pool with dynamic growth and finalizer-driven slot return.
  • Added shutdown/cleanup safeguards so slot destruction and connection shutdown are coordinated safely, including finalizer interactions.
  • Kept fallback behavior intact when MIT-SHM is unavailable or unsuitable.

Behavior by runtime

  • Python 3.12+ on GNU/Linux XShm backend:
    • zero-copy buffer exposure from SHM-backed storage
    • SHM slot is released when downstream buffer users release it
  • Pre-3.12:
    • copy-based behavior is retained
    • finalization happens immediately after copy

Testing

  • Added focused unit tests for buffer-finalizer semantics, including fast and slow paths and downstream memoryview trees.
  • Added GNU/Linux backend lifecycle tests covering:
    • release on normal finalization
    • failure while wrapping finalizing buffers
    • finalization after close
    • pre-3.12 immediate-finalization behavior
    • dynamic SHM pool growth failure behavior
    • threaded release during close to validate shutdown race protections

Notes for maintainers

  • This PR is intentionally backend-agnostic at the plumbing layer, with initial zero-copy adoption in GNU/Linux XShmGetImage.
  • The design keeps the existing user-facing API, while making buffer lifetime explicit for backend resource management.

Changes proposed in this PR

Fixes #424 (I accidentally said this was a duplicate of #476, but it's actually separate)
May be relevant to #222, as it lays the groundwork to lower CPU usage. However, this PR doesn't affect Windows.

  • Tests added/updated
  • Documentation updated
  • Changelog entry added
  • ./check.sh passed

Summary
=======

I’ve spoken at length about the importance of avoiding copies.  This
PR is to eliminate the remaining (CPU-side) copy: copying from the
OS-supplied buffer to a Python byte / bytearray object.

We introduce buffer lifetime tracking in MSS so backends can safely
reclaim or reuse screenshot buffers when downstream consumers are
truly done with them. On GNU/Linux, the XShmGetImage backend now uses
that mechanism to enable zero-copy screenshot buffers on Python 3.12
and newer.

Benchmark
---------

I ran a benchmark on my home computer (Ryzen 7 2700X, B-450 chipset,
DDR4-2133, RTX 3090).  I captured 1000 iterations of 3840×2160
screenshots as quickly as possible while forcing all pixel data to be
read (using a NumPy sum).  I ran A/B tests of enabling or disabling
the new feature, taking a best-of-three test.

Capture time decreased from 22.64 ms to 18.59 ms per frame.  Put
differently, this is from 44 FPS to 54 FPS.  This is approximately 18%
faster.

Why
===

Previously, backends had to copy screenshot data into fresh
Python-owned buffers to avoid reusing memory that might still be
referenced by NumPy, Pillow, or other buffer consumers. That copy cost
is significant for large captures and high frame-rate use cases.

What changed
============

Internal infrastructure
-----------------------

This change keeps MSS user-facing behavior the same while improving
backend memory handling and performance.

- Added new buffer-finalization plumbing that lets backends attach a
  finalizer to screenshot buffer ownership.
- Updated core typing/contracts so grab can return generic
  buffer-compatible objects.  (The user-facing contracts were updated
  in BoboTiG#521 and others, but this updates the internal contracts.)
- Expanded documentation and release notes to explain direct buffer
  behavior and platform/version scope.
- Updated packaging tests and test dependencies to include new buffer
  and integration test coverage.

XShmGetImage backend
--------------------

- Reworked the GNU/Linux XShmGetImage backend to use a reusable SHM
  slot pool with dynamic growth and finalizer-driven slot return.
- Added shutdown/cleanup safeguards so slot destruction and connection
  shutdown are coordinated safely, including finalizer interactions.
- Kept fallback behavior intact when MIT-SHM is unavailable or
  unsuitable.

Behavior by runtime
===================

- Python 3.12+ on GNU/Linux XShm backend:
  - zero-copy buffer exposure from SHM-backed storage
  - SHM slot is released when downstream buffer users release it
- Pre-3.12:
  - copy-based behavior is retained
  - finalization happens immediately after copy

Testing
=======

- Added focused unit tests for buffer-finalizer semantics, including
  fast and slow paths and downstream memoryview trees.
- Added GNU/Linux backend lifecycle tests covering:
  - release on normal finalization
  - failure while wrapping finalizing buffers
  - finalization after close
  - pre-3.12 immediate-finalization behavior
  - dynamic SHM pool growth failure behavior
  - threaded release during close to validate shutdown race protections

Notes for maintainers
=====================

- This PR is intentionally backend-agnostic at the plumbing layer,
  with initial zero-copy adoption in GNU/Linux XShmGetImage.
- The design keeps the existing user-facing API, while making buffer
  lifetime explicit for backend resource management.
@BoboTiG

BoboTiG commented Jun 16, 2026

Copy link
Copy Markdown
Owner

That's is so good! 🚀

@jholveck

Copy link
Copy Markdown
Contributor Author

Thank you!

@halldorfannar has been unavailable, but is back in action now. I'd like him to take a look at this before you commit it, but I think it's in good shape. I think it's going to be pretty easy to use this for Windows GDI.

I haven't yet looked at the macOS side; it's probably not hard to speed it up with this too, but I haven't done macOS development in many years. Once we commit this part, we might open an issue to see if other contributors want to tackle it.

@halldorfannar halldorfannar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

I had one minor nit about a doc a string (see comments) but then I had a bigger issue that I feel we should document somewhere. And this is about the memory requirements of the new fast path. Since by default we allocate two color buffers of the virtual monitor size, this can be a substantial amount of memory (for example, I typically run a 4k monitor and then a laptop screen and these are offset from each other, creating a rectangle that is much larger than the sum of these two screens. Depending on how people organize their code, we may end up with even more buffers than just two. I think we should mention this extra memory requirement somewhere, so users are aware.

I don't mind this approach, especially as a start and a way to do this with minimal changes to the existing API. I just feel we need to surface it. In the future we can look at ways for users to give us hints as they initialize the library, so we can use less memory. But that is future music.

Comment thread src/mss/linux/xshmgetimage.py Outdated
@jholveck

Copy link
Copy Markdown
Contributor Author

Since by default we allocate two color buffers of the virtual monitor size, this can be a substantial amount of memory (for example, I typically run a 4k monitor and then a laptop screen and these are offset from each other, creating a rectangle that is much larger than the sum of these two screens. Depending on how people organize their code, we may end up with even more buffers than just two. I think we should mention this extra memory requirement somewhere, so users are aware.

I can see that. However, I don't think it represents extra memory usage with this code. This is mostly because of the transient need for the transfer buffer in the copying mode, which is obviously not needed in zero-copy mode. Indeed, I think that the zero-copy mode may often use less memory in some cases.

In the copying case (the previous code, or this code prior to Python 3.12), you'd still end up with just as much memory consumed in the basic case: one framebuffer's worth to hold the shared memory buffer, and a second for the bytearray to hold the Python-usable copy.

If a program only takes a single screenshot, I actually question whether the second allocated framebuffer will even consume any memory pages. It's up to the X server about how it allocates the shared memory region, but I suspect it will typically be zeroed pages. Linux will lazily back pages, so if the user only takes a single screenshot, those pages will never actually be allocated (backed). That's an implementation detail I'm uncertain of, and again, in the single screenshot case, the worst case is still no worse than the copying mode.

You mention that there may be more than two buffers allocated, depending on the way the user is using them. That's true, although it's still the same memory use as in the copying case (previous code): it only happens if the user retains references to the returned buffers in the zero-copy case, which would also hold the bytearrays in the copying case. The zero-copy mode uses less, actually, since there isn't the need for the transient buffer that's used during the copy: you end up needing only N buffers' worth, instead of N+1 transiently.

But let's return to the case where the user isn't doing any sort of pipelining, and is just using one screenshot at a time. I am mostly writing with this loop in mind:

with mss.MSS() as sct:
    while True:
        img = sct.grab(...)
        do_something_with(img)
        # Note that img is not freed by the user before the next loop,
        # even though it would be more memory-efficient to do so.
        # This is just because the code's author doesn't think to.

There, the copying mode in the steady state actually ends up needing, transiently, three framebuffers' worth. That's one to hold the data transferred from the system, one for the bytearray referenced by img, and one for the bytearray being created in the grab. The zero-copy mode only needs two, for the buffers for each image that exists during the second grab (one still referenced from the previous loop, and one in the new grab).

(The analysis for if the user does del img before acquiring a new one ends up largely identical to the single-screenshot use case discussed above.)

Additionally, the zero-copy code has an additional potential advantage on memory-constrained systems, once the MSS object is closed and the buffers are unreferenced. It will unmap the memory at that time, returning it to the OS. In the copying mode, the bytearrays get garbage collected, but the memory still gets returned just to Python, not the OS.

As I finish editing this reply, I realize that my analyses have mostly been based on the assumption of full-screen grabs. I can repeat my analysis with the idea of small screen grabs cropped out of huge framebuffers. I decided to get this reply out to you rather than trying the new analysis. Note that, in that case, the question of lazily-backed pages becomes much more relevant.

@jholveck

Copy link
Copy Markdown
Contributor Author

I'll separately add some notes about the numbers, to get a sense of scale.

A 4k resolution monitor, in BGRA, represents slight less than 32 MB. For two 4k monitors, then two pre-allocated buffers would represent just under 128 MB, if they're stacked horizontally or vertically, or 256 MB in the unlikely event they're arranged diagonally.

The most memory-limited platform that can handle two such displays is probably a Raspberry Pi 4. That's got 1 to 8 GB of RAM.

The Raspberry Pi does take the GPU RAM out of the SoC RAM, though, so you might reasonably see, say, about 768 MB of system RAM on a system with two 4k monitors, with some extra VRAM set aside for textures.

So it's not ludicrous to consider a case where the MSS memory consumption is a sizable portion of the available system RAM. Of course, anybody developing an application targeting such a setup is going to be accustomed to careful consideration of the memory usage.

As I said above, I don't think this is new to, or worsened by, the zero-copy mode.

@jholveck

Copy link
Copy Markdown
Contributor Author

Another weakness in my analysis that I failed to point out. I'm mostly looking at the peak memory usage, on the assumption that this is what the user is going to need to deal with. In the common loop I gave, if (and only if) the del img is added, then the peak usage is the same in copying and zero-copy mode, although the copying mode does temporarily return the transfer buffer back to Python.

This is only applicable when taking multiple screenshots in a loop, one at a time, and explicitly releasing one before taking the next. In other cases, such as if pipelining or not explicitly releasing them, the zero-copy mode has a lower transient memory usage during the grab, and the same memory usage in the rest of the loop.

(All still subject to my caveat about having only analyzed the full-screen use cases.)

@halldorfannar

Copy link
Copy Markdown
Contributor

I agree that there are scenarios where this new approach will actually save us memory. I just wanted to paint a scenario where it didn't. It seems I wasn't clear on that scenario. It's specifically when you plan to just capture a single monitor, let's say my laptop screen and for this case let's have it running at 1920x1080. It's aligned with my 4k monitor so that the virtual combined rectangle is taller than 4k (because my laptop sits lower than my 4k monitor). The virtual width is just the sum of the regular screen resolutions, 1920 + 3840, but the virtual height is not the maximum of the two. The virtual height (due to the offset) is about 500 pixels taller, so we get max(2160,1080) + 500, which is about 2500 (using approximate math here). This gives us 5760 x 2500 pixels, and with 4 bytes per channel we gives us almost 55 MiB. And we allocate two of them. So we are now over 110 MiB before calling grab. In the old (slow path) we wouldn't preallocate, and when I call grab we would capture my 1920x1080 screen resulting in 7.9 MiB. In that path we copy, so we get double that, a total of 16 MiB.

Now, I understand that 110 MiB is not a lot of memory today. But going from 16 MiB to 110 MiB is an order of magnitude change. This is why I felt it was prudent to mention it in the notes, just so that users are aware. I had suggested, that in the future we could allow users to provide a hint so that in the case which I outline above (which isn't a fabricated case, it is my setup) then I could tell the system to just preallocate for my laptop screen dimensions. If you feel that I'm worrying about peanuts that's fine. I have however found, that any such change to software that has existed for a long time, should be called out. I also recognize that a user that has two monitors and one of the 4K is more likely to have more memory so perhaps we should just assume that my scenario will not be an issue for folks. But I still wanted us to make that as an explicit decision, rather than an implicit one 😺

@jholveck

Copy link
Copy Markdown
Contributor Author

Ok! You're correct: the single-monitor scenario is the "small part of the whole framebuffer" caveat I mentioned.

I do agree that increased memory usage is something that we may want to note. I described the memory requirements on a Raspberry Pi to illustrate that this isn't necessarily negligible.

As an aside, prior to this code (in 10.2), we did preallocate a full framebuffer for the SHM region. We didn't do that in 10.1, since that didn't support SHM.

I did some tests, and verified that the memory doesn't get backed until it's written to. If we're only capturing a small region (such as a single monitor), then the unused region doesn't get backed; it still points to the zero page. For my test, I captured a 640x480 region from the middle of the screen, and verified that only the expected 1200 kB was backed in the shared memory region. (I did a few variations on this test too, which were all unsurprising.)

So that might moot the whole concern, or at least greatly reduce the applicability.

I'm pushing back slightly because I'm having a hard time thinking about how to phrase this in the docs. It should be accurate, of course. It should give affected users actionable information. But the vast majority of users - many of whom are newer programmers - are not affected, and I also wouldn't want to cause undue concern among them. Premature optimization, and all that. That said, if you have a way to document this in mind, then I'm fine with including something.

I have thought that making the number of buffers to preallocate would be a reasonable option to add to the MSS constructor. This would be useful for the users of a pipelining architectures, who want to preallocate enough buffers on startup instead of on-demand. I've tried to avoid unnecessary timing jitter on the first few frames, before steady state is reached. (Lazy page backing is still a source of startup jitter, but is very minimal.)

I didn't bother with this in the initial implementation, since
scanning the buffer list is so cheap.  But in the PR discussions, we
realized that the buffers will be lazily backed, with no physical
memory allocation until they're written to.

This means that a user targeting a platform with heavily constrained
memory might want to reuse the same buffer if possible, rather than
ping-ponging between the two buffers with each capture.  If they are
careful to free one screenshot before allocating another, then we
should prefer to not back the other buffer unnecessarily.

In other words, this helps memory-conscious users save memory, where
they previously didn't have a choice.

It might also be reasonable to, at the time a buffer is released
(returned to the pool), madvise the now-obsolete data with
MADV_DONTNEED (or similar flags, like MADV_REMOVE or MADV_FREE).  That
has some significant historical quirks on Linux that I didn't want to
deal with.  The implementation complexity seemed significant, and at
first glance, seems likely to cause more overhead with page table
churn than it would solve.
@halldorfannar

Copy link
Copy Markdown
Contributor

Makes sense. Thanks for going the extra mile to dig into this. I'm good with this. Let's merge!

@jholveck jholveck marked this pull request as ready for review June 24, 2026 20:17
@jholveck

Copy link
Copy Markdown
Contributor Author

Ok! I've marked it as ready for BoboTiG's review.

I looked at adding a flag to say how many buffers to preallocate, but I think the way we handle things like _choose_impl should get some improvements first.

@BoboTiG

BoboTiG commented Jun 24, 2026

Copy link
Copy Markdown
Owner

👍 for me. Let me know when ready to merge.

@jholveck

Copy link
Copy Markdown
Contributor Author

I'm good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance with reusable buffers

3 participants