-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
video_core: Readback optimizations #3404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Avoids flushing GPU data on Rewind and enables driveclub to work properly without readbacks
Scheduler wait can happen inside stream buffer which is used by DMA inside a defered action
Quickly checked Bloodborne, for me there is a 10-15 fps improvement in this PR compared to master |
For your build error, need to use std::bit_cast |
@raphaelthegreat , tell me how to test this PR correctly and what options should be activated, because I don’t quite understand how to do this and give you the necessary information.
I don't provide logs yet because I might be doing something wrong, but I will report bugs. I noticed some unusual behavior in 4 of my games:
The problem that exists on Intel processors has worsened (Without readbacks enabled). If you enable the two options that I mentioned above, the picture becomes like this (in Main the game also looks like this for me in this place).
Without readbacks enabled lines I also get a crash when I try to pull the second spear out of Trico (the very beginning of the game), this might be worth testing further.
An error has returned (Without readbacks enabled):
With Readbacks options I can get into the game, but I get this picture:
Regardless of whether reverse reads are enabled or not, the game returns this error:
|
a5a0c82
to
5ab3d8e
Compare
@GHU7924 Testing of this PR should be with readbacks enabled (a lot of code atm assumes readbacks is on, so with readbacks off some stuff might break a little until I address it). Linear image readback option is not affected here and should only be enabled if its needed. Make sure to only compare to main build, don't just report bugs that you see, that is not helpful, you need to report bugs that this PR causes on its own. |
No improvements or regressions in my titles, with a small performance improvement across the board in titles I tested. CUSA02320 PR.log |
By how much? |
@raphaelthegreat from the screenshots he posted it looks like 30FPS in main, 42FPS with your PR, in driveclub. Not sure if that's valid because one is during the day and another during the night Anyway, here's my experience. At least on my PC, readbacks still seem CPU bottlenecked, even with a 5800X3D. RAM is at 3200Mhz. I see CPU usage ranging from 80 to 98% on cores and jumping between cores. ![]() ![]() ![]() (ignore the dirty tag, I just *2 memory to run 4K) Performance seems very very slightly improved (I get 14-16fps in this area on main, 15-18 with this PR). If you have time, I am available to run tracy on this with your guidance :-) |
Running the game at 4K significantly increases the cost of readbacks so its not too surprising. Do you get a larger boost at 1080p? |
Uncharted 3 Screen_Recording_20250809_144946_Moonlight.mp4Pr Screen_Recording_20250809_145200_Moonlight.mp4Gray character models in last pr, before 34892a2 same as main. |
I dont see the gray character models |
With Dma=true, readbacks=false how does it look? Naughty Dog games technically dont abide by the fence promise because the flush is for SRT buffer which should work with DMA |
With dma=true i don't go in the menu |
I suspect this game might be incompatible with fence detection because it lies about the sizes of storage buffers (which causes the clamp size spam), causing large areas to be marked as GPU modified and emulator thinking it can't write yet. I might be able to think of something but strict readbacks are the only sure way to ensure proper sync |
Hi |
@coolllman If you add |
PS: I'm open to suggestions about what the setting should be called, that name might be confusing to users but I couldn't think of anything else. Basically a setting about how strict readbacks are (0 is most strict like on main, 1 is with the fence detection so less strict) |
With readback accuracy to 2 it behaves just like on main, doesn't seem to be a noticeable stutter, but the game freezes during loading, maybe because performance is too low. With readback accuracy set to 1 there is stutter too. |
@rafael-57 (or anyone else familiar with tracy) can you profile the stutters that occur with the newer accuracy? |
@coolllman Please enable validation in config and send new log |
I played Bloodborne a bit more with readback accuracy at 0 and unfortunately vertex explosion is still there even if it's much less frequent. |
Does this happen only with low accuracy or also with high? |
I will try with high and get back to you. |
People were reporting a similar color bug with LNDF's #3396 as well, maybe this is actually a regression in main? |
I tried to replicate it on main with Readbacks enabled but I nearly always hang in loading screen when going back and forth between the dream and cathedral wards (I also got the softlock a few times with high accuracy on this PR, didn't get any with low). Also for LNDF's PR it seems like things go black, here colors are just wrong so I'd say this is a bit different. I also got vertex explosions (not sure if you can call them that since they didn't happen on NPCs this time) with High accuracy so not sure what's wrong there. ![]() ![]() But also next warp after that vertex explosion that wasn't created by an NPC colors were bugged out so perhaps it really is a regression from the GC PR I can't tell. ![]() ![]() ![]() |
The GC PR didn't use the Dirty flag in SafeToDownload, so small (32x32 BC7 images or less) images that are marked with MaybeCpuDirty could be downloaded incorrectly and cause corruption, but I'm not sure |
I've been testing main for the past 15 minutes without readbacks and colors never changed, so that leaves us with a few possibilities: -Regression from this PR. - -Regression from the GC PR but only when Readbacks are enabled. Conclusion seems to be that it only happens with readbacks enabled. It would be nice to have more people testing this (and ideally without changing the emulator's code or using game patches) Useless log: |
Probably best to try a build right before GC got merged or build locally and comment out the RunGarbageCollector calls here |
Tried a build prior to GC merge and didn't have the color issue, I don't have VS set up so if someone else could test commenting out what Turtle said above to see if GC is indeed the issue it would help me out. |
Infamous Update: fix last pr, thank you |
yes! But do I need to log something in particular or do I just run tracy with everything set to default? Also do I pull fix readbacks off or nah? |
Can you keep reverting commits and test a bit to see if something broke it |
My main takeways after losing my sanity testing for 2 hours:
|
Readbacks 2 is the same as on main. The garbage textures seems like it could happen because texture cache relies on a hashing workaround for that, but readbacks on main having vertex explosions sounds weird to me, does it actually happen? |
Ok, I was testing with readback accuracy 1 instead of 2. So that could be the cause for vertex explosions on objects (not faces unlike main with readbacks OFF). I do remember making this comment in the original readbacks PR though: After finnicking a lot I just got this even in main, whatever it is: Anyway I wouldn't go too out of scope with this PR. Existing readbacks are already in main with all their pros and cons, and this improves performance a bit + adds more granular settings. Would be nice if readbacks were more stable in general though. Would it help if I collect a stack with visualstudio when crashing? What about tracy? Do I just run it with stock code + readbacks accuracy set to 0? |
Yes
Yes I suspect the reason of the stutter is the game overwriting some large GPU modified region from CPU which would cause page-by-page flush. I don't like that this PR has these bugs though so I was thinking of shelving it until its fixed |
Tracy, started right during warp: |
So testing more with Bloodborne, readback accuracy 0 is the only one that doesn't freeze the game but has big stutters. Best performance by far. Readback accuracy 1 has better performance compared to main build by 5-10 fps, but it freezes the game Readback accuracy 2 performance is the same as main build with readbacks enabled and it also freezes the game The game freezes happen with readback enabled on main too. |
Note there is still some cleanup to do, this is not final code. It can also cause freezes/bugs (hope not though)
General idea
If you were to write a vulkan program that generates some data on the GPU and later want to access that data on the host, you need a sync operation to ensure the GPU has finished its work. Said sync operation is called a fence because it makes the CPU wait for the GPU.
In a similar fashion the guest also has to use fences before reading GPU data on the host. Because of its unified memory, it doesn't have to copy said data to a host visible memory, but it still must sync with a fence operation before accessing it. The emulator can rely on that promise, that the guest will not overwrite nor read any GPU generated data before a fence operation has given it opportunity to sync with the GPU.
So the main idea of the PR is to attempt to detect these fence operations in the PM4 command stream and defer read-protecting GPU modified pages until right before them. If a page read/write then happens before a fence, it will pass through without a flush, because the emulator can be sure the guest cannot access the data yet.
The aforementioned detection is not trivial though, because there is little indication to the emulator about what sync operations are used for. AMD GCN uses labels, 4 or 8 byte memory addresses, where "signal" packets write to and "wait" packets can wait on, or the host can poll in case of fence. All that means its close to impossible to detect actual wait of a fence. Instead, this PR implements a prepass which scans input command lists and tries to "guess" which packets act as fences and which not.
Possible packets that can write labels are EventWriteEos, EventWriteEop, WriteData (GFX) and ReleaseMem (ACB). There is a simple heuristic where if the label of a signal packet is waited by the GPU with WaitRegMem packet it is considered a GPU->GPU sync (something akin to pipeline barriers). It is in fact possible for a label to act both as fence and pipeline barrier so the heuristic can fail but its very unlikely.
Deferring read protections allows for some powerful optimizations, 2 of which are implemented here.
Rewind indirect patch
The rewind packet has a misleading name, as it implies execution going back somewhere, but what it does is tell CP to drop all prefetched packets and reload them from memory. It is almost exclusively used for command list self modification (from a compute shader for example), which driveclub does for a dozen dispatches at the start of the frame. It uses a compute shader to patch the dimentions of the DispatchDirect PM4 packet before executing it. Why it didnt use an indirect dispatch I'm not sure.
Before readbacks this lead to launching a dispatch with garbage (often huge) dimensions, freezing the GPU. Readbacks, on the other hand, fixed it by read protecting the memory and flushing the modified data. That works but is very expensive; around a dozen flushes, one per patched dispatch.
Defering read protections allows emulator to reach rewind packet before a flush. Then the emulator can scan the pending GPU ranges inside the current command list, check they are dispatch dimention patches and convert the direct dispatch into an indirect dispatch. The latter reads dimentions from GPU buffers avoiding need for flushing memory.
Preemptive buffer downloads
This optimization is a lot more general then above one and should affect all games that rely on readbacks. It is possible to implement without CPU fence detection, but it makes the implementation more efficient because the emulator can batch preemptive download copies upon reaching the fence.
The idea is to track how many times a page has been flushed and if that number exceeds a threshold, any future GPU data inside it will be copied to host asynchronously. If a flush is triggered, the GPU thread simply has to wait for the GPU to finish and copy data to guest memory. The advatange here is the reduction or (in certain cases) elimination of the wait time, as GPU likely has had time to catch up to host. In addition, once the wait has been done, the rest of preemptive downloads become "free" and don't need further stalls.