Skip to content

Conversation

raphaelthegreat
Copy link
Contributor

@raphaelthegreat raphaelthegreat commented Aug 8, 2025

Note there is still some cleanup to do, this is not final code. It can also cause freezes/bugs (hope not though)

General idea

If you were to write a vulkan program that generates some data on the GPU and later want to access that data on the host, you need a sync operation to ensure the GPU has finished its work. Said sync operation is called a fence because it makes the CPU wait for the GPU.

In a similar fashion the guest also has to use fences before reading GPU data on the host. Because of its unified memory, it doesn't have to copy said data to a host visible memory, but it still must sync with a fence operation before accessing it. The emulator can rely on that promise, that the guest will not overwrite nor read any GPU generated data before a fence operation has given it opportunity to sync with the GPU.

So the main idea of the PR is to attempt to detect these fence operations in the PM4 command stream and defer read-protecting GPU modified pages until right before them. If a page read/write then happens before a fence, it will pass through without a flush, because the emulator can be sure the guest cannot access the data yet.

The aforementioned detection is not trivial though, because there is little indication to the emulator about what sync operations are used for. AMD GCN uses labels, 4 or 8 byte memory addresses, where "signal" packets write to and "wait" packets can wait on, or the host can poll in case of fence. All that means its close to impossible to detect actual wait of a fence. Instead, this PR implements a prepass which scans input command lists and tries to "guess" which packets act as fences and which not.

Possible packets that can write labels are EventWriteEos, EventWriteEop, WriteData (GFX) and ReleaseMem (ACB). There is a simple heuristic where if the label of a signal packet is waited by the GPU with WaitRegMem packet it is considered a GPU->GPU sync (something akin to pipeline barriers). It is in fact possible for a label to act both as fence and pipeline barrier so the heuristic can fail but its very unlikely.

Deferring read protections allows for some powerful optimizations, 2 of which are implemented here.

Rewind indirect patch

The rewind packet has a misleading name, as it implies execution going back somewhere, but what it does is tell CP to drop all prefetched packets and reload them from memory. It is almost exclusively used for command list self modification (from a compute shader for example), which driveclub does for a dozen dispatches at the start of the frame. It uses a compute shader to patch the dimentions of the DispatchDirect PM4 packet before executing it. Why it didnt use an indirect dispatch I'm not sure.

Before readbacks this lead to launching a dispatch with garbage (often huge) dimensions, freezing the GPU. Readbacks, on the other hand, fixed it by read protecting the memory and flushing the modified data. That works but is very expensive; around a dozen flushes, one per patched dispatch.

Defering read protections allows emulator to reach rewind packet before a flush. Then the emulator can scan the pending GPU ranges inside the current command list, check they are dispatch dimention patches and convert the direct dispatch into an indirect dispatch. The latter reads dimentions from GPU buffers avoiding need for flushing memory.

Preemptive buffer downloads

This optimization is a lot more general then above one and should affect all games that rely on readbacks. It is possible to implement without CPU fence detection, but it makes the implementation more efficient because the emulator can batch preemptive download copies upon reaching the fence.

The idea is to track how many times a page has been flushed and if that number exceeds a threshold, any future GPU data inside it will be copied to host asynchronously. If a flush is triggered, the GPU thread simply has to wait for the GPU to finish and copy data to guest memory. The advatange here is the reduction or (in certain cases) elimination of the wait time, as GPU likely has had time to catch up to host. In addition, once the wait has been done, the rest of preemptive downloads become "free" and don't need further stalls.

  • Add config option to control aggressiveness of fence detection
  • Revoke preempt status from pages that stop being flushed

@bigol83
Copy link
Contributor

bigol83 commented Aug 8, 2025

Quickly checked Bloodborne, for me there is a 10-15 fps improvement in this PR compared to master

@squidbus
Copy link
Collaborator

squidbus commented Aug 8, 2025

For your build error, need to use std::bit_cast

@GHU7924
Copy link

GHU7924 commented Aug 8, 2025

@raphaelthegreat , tell me how to test this PR correctly and what options should be activated, because I don’t quite understand how to do this and give you the necessary information.

  1. Should all games be tested with options Enable Readbacks and Enable Readbacks Linear Images enabled?
  2. How can I even tell if a particular game requires this option? (I don't know for sure, but there may be games that don't care about this option)

I don't provide logs yet because I might be doing something wrong, but I will report bugs. I noticed some unusual behavior in 4 of my games:

  1. Bloodborne

The problem that exists on Intel processors has worsened (Without readbacks enabled).
111

If you enable the two options that I mentioned above, the picture becomes like this (in Main the game also looks like this for me in this place).
222

  1. TLG

Without readbacks enabled lines Continue and Options were not displayed in the menu, but with the options enabled, these lines were.
333

I also get a crash when I try to pull the second spear out of Trico (the very beginning of the game), this might be worth testing further.

  1. inFAMOUS™ First Light

An error has returned (Without readbacks enabled):

[Debug] <Critical> vk_presenter.cpp:646 operator(): Assertion Failed!
Device lost during waiting for a frame

With Readbacks options I can get into the game, but I get this picture:
444
555
556

  1. The Order: 1886

Regardless of whether reverse reads are enabled or not, the game returns this error:

[Debug] <Critical> vk_scheduler.cpp:166 operator(): Assertion Failed!
Device lost during submit

@raphaelthegreat
Copy link
Contributor Author

raphaelthegreat commented Aug 8, 2025

@GHU7924 Testing of this PR should be with readbacks enabled (a lot of code atm assumes readbacks is on, so with readbacks off some stuff might break a little until I address it). Linear image readback option is not affected here and should only be enabled if its needed. Make sure to only compare to main build, don't just report bugs that you see, that is not helpful, you need to report bugs that this PR causes on its own.

@StevenMiller123
Copy link
Collaborator

StevenMiller123 commented Aug 8, 2025

No improvements or regressions in my titles, with a small performance improvement across the board in titles I tested.
This was tested before the most recent update with the swap to bitcasts, not sure if that impacted performance in any meaningful way.

Main:
image
image

PR:
image
image

CUSA02320 PR.log
CUSA02320.log
CUSA00663.zip
CUSA00663 PR.zip

@Randomuser8219
Copy link
Contributor

image image Driveclub now works without readbacks.

@coolllman
Copy link

coolllman commented Aug 9, 2025

Main
Screenshot_20250809_135240_Moonlight
Screenshot_20250809_141036_Moonlight

Pr
Screenshot_20250809_141913_Moonlight
Screenshot_20250809_140844_Moonlight

Driveclub , TLG fps boost with readbacks)
Spec: i7 11700, rtx 3060

@raphaelthegreat
Copy link
Contributor Author

By how much?

@rafael-57
Copy link
Contributor

@raphaelthegreat from the screenshots he posted it looks like 30FPS in main, 42FPS with your PR, in driveclub. Not sure if that's valid because one is during the day and another during the night

Anyway, here's my experience. At least on my PC, readbacks still seem CPU bottlenecked, even with a 5800X3D. RAM is at 3200Mhz.

I see CPU usage ranging from 80 to 98% on cores and jumping between cores.
GPU usage is never close to full usage even with Bloodborne at 4K. Unlike main without readbacks which easily stresses 98% of the GPU.

image image image

(ignore the dirty tag, I just *2 memory to run 4K)

Performance seems very very slightly improved (I get 14-16fps in this area on main, 15-18 with this PR).

If you have time, I am available to run tracy on this with your guidance :-)

@raphaelthegreat
Copy link
Contributor Author

Running the game at 4K significantly increases the cost of readbacks so its not too surprising. Do you get a larger boost at 1080p?

@rafael-57
Copy link
Contributor

rafael-57 commented Aug 9, 2025

It's definetely more significant!

1080p main:
image

1080p PR:
image

image

Not strictly relevant to this PR but looking back at the empty bridge I even get 37-41fps, 4k just stayed the same instead. I didn't realize the resolution would increase the redback cost

@coolllman
Copy link

coolllman commented Aug 9, 2025

Uncharted 3
Main

Screen_Recording_20250809_144946_Moonlight.mp4

Pr

Screen_Recording_20250809_145200_Moonlight.mp4

Gray character models in last pr, before 34892a2 same as main.

@raphaelthegreat
Copy link
Contributor Author

raphaelthegreat commented Aug 9, 2025

I dont see the gray character models
EDIT: Actually yes I can see them

@coolllman
Copy link

My settings: Dma=false, readbacks=true
Screenshot_20250809_155120_Samsung Internet

@raphaelthegreat
Copy link
Contributor Author

With Dma=true, readbacks=false how does it look? Naughty Dog games technically dont abide by the fence promise because the flush is for SRT buffer which should work with DMA

@coolllman
Copy link

With dma=true i don't go in the menu

@raphaelthegreat
Copy link
Contributor Author

raphaelthegreat commented Aug 9, 2025

I suspect this game might be incompatible with fence detection because it lies about the sizes of storage buffers (which causes the clamp size spam), causing large areas to be marked as GPU modified and emulator thinking it can't write yet. I might be able to think of something but strict readbacks are the only sure way to ensure proper sync

@yaya54840
Copy link

Hi
With this pr, Drive Club works for me at 15 fps without any bugs.
I'm attaching my log if it helps.
shad_log.zip

@raphaelthegreat
Copy link
Contributor Author

@coolllman If you add fenceDetection = 0 under [GPU] section in config.toml with newest commit does it fix the issue in Uncharted

@raphaelthegreat
Copy link
Contributor Author

raphaelthegreat commented Aug 9, 2025

PS: I'm open to suggestions about what the setting should be called, that name might be confusing to users but I couldn't think of anything else. Basically a setting about how strict readbacks are (0 is most strict like on main, 1 is with the fence detection so less strict)

@bigol83
Copy link
Contributor

bigol83 commented Aug 11, 2025

Does the stutter exist with readbackAccuracy = 2 ?

With readback accuracy to 2 it behaves just like on main, doesn't seem to be a noticeable stutter, but the game freezes during loading, maybe because performance is too low.

With readback accuracy set to 1 there is stutter too.

@raphaelthegreat
Copy link
Contributor Author

@rafael-57 (or anyone else familiar with tracy) can you profile the stutters that occur with the newer accuracy?

@raphaelthegreat
Copy link
Contributor Author

@coolllman Please enable validation in config and send new log

@bigol83
Copy link
Contributor

bigol83 commented Aug 11, 2025

I played Bloodborne a bit more with readback accuracy at 0 and unfortunately vertex explosion is still there even if it's much less frequent.

@Missake212
Copy link
Contributor

After testing for a while I managed to find a big problem, not only did I get a vertex explosion but the colors were all weird (testing methodology was simply going to the hunter's dream and to cathedral ward back and forth hoping to see a vertex explosion, on one of those warps the colors looked like this).

image image

CUSA03173.log

@raphaelthegreat
Copy link
Contributor Author

Does this happen only with low accuracy or also with high?

@Missake212
Copy link
Contributor

I will try with high and get back to you.

@Missake212
Copy link
Contributor

Missake212 commented Aug 11, 2025

Looks like it does also happen with high.

image

CUSA03173.log

@StevenMiller123
Copy link
Collaborator

People were reporting a similar color bug with LNDF's #3396 as well, maybe this is actually a regression in main?

@Missake212
Copy link
Contributor

People were reporting a similar color bug with LNDF's #3396 as well, maybe this is actually a regression in main?

I tried to replicate it on main with Readbacks enabled but I nearly always hang in loading screen when going back and forth between the dream and cathedral wards (I also got the softlock a few times with high accuracy on this PR, didn't get any with low). Also for LNDF's PR it seems like things go black, here colors are just wrong so I'd say this is a bit different.

I also got vertex explosions (not sure if you can call them that since they didn't happen on NPCs this time) with High accuracy so not sure what's wrong there.

Screenshot 2025-08-11 153443 Screenshot 2025-08-11 153457

But also next warp after that vertex explosion that wasn't created by an NPC colors were bugged out so perhaps it really is a regression from the GC PR I can't tell.

Screenshot 2025-08-11 153649 Screenshot 2025-08-11 153634 Screenshot 2025-08-11 153641

@raphaelthegreat
Copy link
Contributor Author

raphaelthegreat commented Aug 11, 2025

The GC PR didn't use the Dirty flag in SafeToDownload, so small (32x32 BC7 images or less) images that are marked with MaybeCpuDirty could be downloaded incorrectly and cause corruption, but I'm not sure

@Missake212
Copy link
Contributor

Missake212 commented Aug 11, 2025

I've been testing main for the past 15 minutes without readbacks and colors never changed, so that leaves us with a few possibilities:

-Regression from this PR.

-Regression with Readbacks in general (less likely since they've been merged for a while and someone probably would've reported it, but since i'm getting so many hangs in loading screens perhaps no one has tested).

-Regression from the GC PR but only when Readbacks are enabled.

Conclusion seems to be that it only happens with readbacks enabled. It would be nice to have more people testing this (and ideally without changing the emulator's code or using game patches)

Useless log:

CUSA03173.log

@raphaelthegreat
Copy link
Contributor Author

Probably best to try a build right before GC got merged or build locally and comment out the RunGarbageCollector calls here

@Missake212
Copy link
Contributor

Tried a build prior to GC merge and didn't have the color issue, I don't have VS set up so if someone else could test commenting out what Turtle said above to see if GC is indeed the issue it would help me out.

CUSA03173.log

@GHU7924
Copy link

GHU7924 commented Aug 11, 2025

I tested the latest Main build.

Test without Readbacks.
9998
00909

With Readbacks.
445
5567
5647

These are the problems that I personally saw. Apparently there are regressions in Main itself.

@coolllman
Copy link

coolllman commented Aug 11, 2025

Infamous
shad_log.txt

Update: fix last pr, thank you

@rafael-57
Copy link
Contributor

rafael-57 commented Aug 11, 2025

@rafael-57 (or anyone else familiar with tracy) can you profile the stutters that occur with the newer accuracy?

yes! But do I need to log something in particular or do I just run tracy with everything set to default? Also do I pull fix readbacks off or nah?

@rafael-57
Copy link
Contributor

rafael-57 commented Aug 11, 2025

image image image
PXL_20250811_180926681.mp4

After warping at least 100 times with readbacks on and garbage collector ON/OFF, I can say I'm seeing some weird stuff with readbacks on even with accuracy 1 and the garbage collector commented.

Not just character faces but models in the level like tombstones and even weapons (see video) are exploding now, even with readbacks 1 which I don't remember seeing before

This is without 4k patches and stock memory. I should say, this branch is very unstable and very prone to crashing when opening the loading screen/warping to an area

EDIT: I'm pretty sure it's not the garbage collector now but something is just off with this PR now

I did not manage to replicate the extreme color issues the others got, all I got were small corruptions even with the GC off

image

@rafael-57
Copy link
Contributor

rafael-57 commented Aug 11, 2025

image Readbacks-POC branch Readbacks accuracy 1 latest commit reverted Garbace Collector commented

Still getting garbage data. Not sure what is going on anymore...

In any case, it looks like every time a single particular asset has a corrupt texture/explodes. Sometimes it's the a particular wood plank texture, sometimes it's the sword exploding, sometimes it's every single column in the area. I'm guessing assets just get corrupted and by chance it happens only on 1 asset at a time per my testing

@raphaelthegreat
Copy link
Contributor Author

Can you keep reverting commits and test a bit to see if something broke it

@Missake212
Copy link
Contributor

I think I got an answer to this, this is an old screenshot I took when testing readbacks a while ago, I think the issue has always been there but it barely happened, and now changing the readbackAccuracy actually exposes it more, I could be wrong but I think this is it.

image

@rafael-57
Copy link
Contributor

rafael-57 commented Aug 11, 2025

Can you keep reverting commits and test a bit to see if something broke it

Well before this I just tested this morning right before "fix readbacks off" and everything was fine, and then I've also reverted it.

I've just got this in main too, with readbacks on.:
image

image

I've done a lot of testing with main and readbacks off and I'm not finding any corrupt textures after many warps.

I'm beginning to believe we just tested a lot today and started exposing some inherent issues that were already present with readbacks.

Is readbacks 2 more accurate than current main readbacks? I could test that too

@rafael-57
Copy link
Contributor

rafael-57 commented Aug 11, 2025

My main takeways after losing my sanity testing for 2 hours:

  1. Garbace colector 1 isn't broken
  2. Readbacks aren't perfect, even in main (prone to hang/crash the emu, can still have some corrupt textures and random vertex explosions, but they're rare)
  3. Readbacks to 0 is still a great improvement for BB, greatly reducing the occurence of vertex explosions and with a much lesser performance hit

@raphaelthegreat
Copy link
Contributor Author

Readbacks 2 is the same as on main. The garbage textures seems like it could happen because texture cache relies on a hashing workaround for that, but readbacks on main having vertex explosions sounds weird to me, does it actually happen?

@rafael-57
Copy link
Contributor

rafael-57 commented Aug 11, 2025

Readbacks 2 is the same as on main. The garbage textures seems like it could happen because texture cache relies on a hashing workaround for that, but readbacks on main having vertex explosions sounds weird to me, does it actually happen?

Ok, I was testing with readback accuracy 1 instead of 2. So that could be the cause for vertex explosions on objects (not faces unlike main with readbacks OFF).

I do remember making this comment in the original readbacks PR though:
#2668 (comment)

After finnicking a lot I just got this even in main, whatever it is:
https://github.com/user-attachments/assets/8abced79-ef88-493c-9316-e6296144faef

Anyway I wouldn't go too out of scope with this PR. Existing readbacks are already in main with all their pros and cons, and this improves performance a bit + adds more granular settings.

Would be nice if readbacks were more stable in general though. Would it help if I collect a stack with visualstudio when crashing?

What about tracy? Do I just run it with stock code + readbacks accuracy set to 0?

@raphaelthegreat
Copy link
Contributor Author

Would be nice if readbacks were more stable in general though. Would it help if I collect a stack with visualstudio when crashing?

Yes

What about tracy? Do I just run it with stock code + readbacks accuracy set to 0?

Yes

I suspect the reason of the stutter is the game overwriting some large GPU modified region from CPU which would cause page-by-page flush. I don't like that this PR has these bugs though so I was thinking of shelving it until its fixed

@rafael-57
Copy link
Contributor

Would be nice if readbacks were more stable in general though. Would it help if I collect a stack with visualstudio when crashing?

Yes

What about tracy? Do I just run it with stock code + readbacks accuracy set to 0?

Yes

I suspect the reason of the stutter is the game overwriting some large GPU modified region from CPU which would cause page-by-page flush. I don't like that this PR has these bugs though so I was thinking of shelving it until its fixed

Tracy, started right during warp:
https://drive.google.com/file/d/1Ss5ZT_cMSdsfidoXC0eOThcePE4sh29_/view?usp=sharing

@bigol83
Copy link
Contributor

bigol83 commented Aug 11, 2025

So testing more with Bloodborne, readback accuracy 0 is the only one that doesn't freeze the game but has big stutters. Best performance by far.

Readback accuracy 1 has better performance compared to main build by 5-10 fps, but it freezes the game

Readback accuracy 2 performance is the same as main build with readbacks enabled and it also freezes the game

The game freezes happen with readback enabled on main too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.