Skip to content

Conversation

@abaire
Copy link
Member

@abaire abaire commented Feb 3, 2026

Paired MAC+ILU and duplicate output instructions can write to oPos while reading from R12. This leads to a case where xemu's serialized emulation erroneously uses the output of a previous instruction when calculating the value of a later one.

E.g., in

 /* 0x00000000 0x0080201A 0xC4002868 0x7CB0E800 */
 MAD oPos.xyz, R12.xyz, R1.x, C[1].xyz
   +  MAD R11.xy, R12.xyz, R1.x, C[1].xyz

the value of oPos prior to the first instruction should be used for both MAD calculations.

This could alternatively be fixed by writing oPos to a temp vector and deferring the output vector update until after the token is fully processed.

Tests
HW results
PR results

Fixes #1864

@gemini-code-assist
Copy link

Summary of Changes

Hello @abaire, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the NV2A vertex shader emulation concerning the R12 register, which acts as a mirror for oPos. Previously, in complex instructions involving both reading from R12 and writing to oPos, the serialized emulation could incorrectly use an updated oPos value for subsequent parts of the same instruction. The changes introduce a "latching" mechanism for R12 by making it an independent variable and explicitly assigning oPos to it at the beginning of an instruction if R12 is used as an input, ensuring consistent input values throughout the instruction's execution.

Highlights

  • R12 Input Correction: Modified the emulation of R12 in the NV2A vertex shader to correctly capture the oPos value at the start of an instruction, preventing erroneous reads in paired or multi-output operations where oPos is also being written.
  • R12 Definition Change: Changed R12 from a direct #define alias of oPos to an independent vec4 variable, allowing its value to be explicitly latched before instruction execution.
  • Shader Code Generation Refactoring: Refactored the decode_opcode function to append generated shader code to an existing string, streamlining the code generation process.
Changelog
  • hw/xbox/nv2a/pgraph/glsl/vsh-prog.c
    • Introduced a bool *uses_r12_latch parameter to decode_opcode_input to track R12 usage.
    • Modified decode_opcode_input to set uses_r12_latch to true if R12 is detected as an input.
    • Refactored decode_opcode to take an MString *ret argument and append to it, rather than returning a new MString.
    • Implemented logic in decode_token to assign R12 = oPos; at the start of an instruction if R12 is used as an input.
    • Changed the definition of R12 from a #define R12 oPos to a vec4 R12 = vec4(0.0,0.0,0.0,0.0); to enable the latching behavior.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where paired instructions reading from R12 could get an incorrect value if a previous instruction in the same token wrote to oPos. The fix involves latching the value of oPos into a new R12 variable at the start of the instruction token if R12 is used as an input. The changes look correct and effectively solve the problem. A refactoring of decode_opcode also makes the code cleaner. However, this refactoring has introduced a minor memory leak, which I've pointed out in a specific comment with a suggested fix.

@abaire abaire force-pushed the fix_1864_r12_input_parallel_with_opos_write branch from 32d2b5d to edc15c1 Compare February 3, 2026 17:54
Paired MAC+ILU and duplicate output instructions can write to oPos while reading
from R12. This leads to a case where xemu's serialized emulation erroneously
uses the output of a previous instruction when calculating the value of a later
one.

E.g., in

```
 /* 0x00000000 0x0080201A 0xC4002868 0x7CB0E800 */
 MAD oPos.xyz, R12.xyz, R1.x, C[1].xyz
   +  MAD R11.xy, R12.xyz, R1.x, C[1].xyz
```

the value of oPos prior to the first instruction should be used for both MAD
calculations.

This could alternatively be fixed by writing oPos to a temp vector and deferring
the output vector update until after the token is fully processed.

Note that we always process output registers writes before temp register writes
so the only interesting case should be the oPos/R12 alias. An op that writes to
one of its temp inputs will always execute the non-modifying output register
write before updating the temp register.

[Tests](https://github.com/abaire/nxdk_pgraph_tests/blob/1745a45290a5d607f9582303d6882cbf62f003de/src/tests/vertex_shader_independence_tests.cpp#L125)
[HW results](https://abaire.github.io/nxdk_pgraph_tests_golden_results/results/Vertex_shader_independence_tests/index.html)

Fixes xemu-project#1864
@abaire abaire force-pushed the fix_1864_r12_input_parallel_with_opos_write branch from edc15c1 to 8ff499f Compare February 3, 2026 18:34
@abaire abaire marked this pull request as ready for review February 3, 2026 19:01
@viniciusol263
Copy link

Its fixed indeed, thanks!
image

@Triticum0
Copy link
Collaborator

Test a few game don't think any known issues are fixed or have same behaviour.

@mborgerson
Copy link
Member

mborgerson commented Feb 5, 2026

Thanks for the patch!

This could alternatively be fixed by writing oPos to a temp vector and deferring the output vector update until after the token is fully processed.

The extra register and additional latching/stale checks are adding more complexity. I think this alternative suggestion sounds simple and more maintainable--not limited to oPos, but analyze inputs/outputs to mitigate the hazard generally. We've encountered this issue already, so we should unify the approaches.

@abaire
Copy link
Member Author

abaire commented Feb 5, 2026

The extra register and additional latching/stale checks are adding more complexity. I think this alternative suggestion sounds simple and more maintainable--not limited to oPos, but analyze inputs/outputs to mitigate the hazard generally. We've encountered this issue already, so we should unify the approaches.

I don't think it can be trivially unified; the current temp register usage will only work for the MAC+ILU case.
The least complex approach is to detect writes to oPos and always defer them until after the instruction is fully processed with an additional special temp register. E.g.,

tempPos.mask = something;
oPos.mask = tempPos.mask

Then we'd be covered for the most complicated case, which would be a MAC + ILU pairing where one or the other writes to oPos + a temp reg and either/both of them read from R12.

The downside is that we don't (currently) parse the ILU args until after the MAC is fully processed, so there may be a situation where we assign to the temp unnecessarily since a MAC write to oPos would have to cover the worst case of an R12 ILU read.

@abaire
Copy link
Member Author

abaire commented Feb 5, 2026

Committed a version that piggybacks on the existing suffix workaround to conservatively defend modifications of oPos.

I think ideally this code would be refactored to have a bit more state tracking so that all inputs could be resolved with special cases (like R12 use) flagged as they're expanded into strings (we can also drop a couple string copies by threading the buffer through). Then we could reserve this hackery for the cases where oPos is modified before a use of R12. For now I'm optimistic that this simple approach won't have much perf impact since it's only triggered in MAC+ILU or multi-write situations that output position.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prince of Persia: The Sands of Time: Incorrect rendering of water surface

4 participants