Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion Info.txt
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,10 @@ LuaObjectFinalize
007D2EA0 SetCartographicVariables
007E19D0 SetMeshVariables
007EDFE0 GenerateRingCylinders
007EF5A0 RenderRings
007EF5A0 RenderRings (func_RenderRings)
007F5DA0 CRenFrame::InitTransformedVerts(this@ebx, float width, float height) retn 8
007F6030 CRenFrame::Render(this@edi, int width, int height) retn 8
004059E0 std::string::string(this@ecx, char const* s, size_t n) retn 8
005779C0 CreateMapData
004783D0 CreateTerrainHeights
00577890 InitSTIMap
Expand Down
5 changes: 5 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ These don't matter except for other assembly patches
- hooks/RangeRings.cpp
- section/RangeRings.cpp

- Fix range ring stencil overflow with 128+ overlapping units. The 7-bit stencil counter wraps around causing individual circle outlines to appear instead of a merged fill. Batch size reduced from 1000 to 30 with intermediate RangeMask flushes, also enabling GPU stencil early-out for dense clusters.

- hooks/RenderRingsFlush.hook
- section/RenderRingsFlush.cpp

- Camera performance improvements

- hooks/CameraPerf.cpp
Expand Down
33 changes: 33 additions & 0 deletions hooks/RenderRingsFlush.hook
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
// Fix range ring stencil overflow with 128+ overlapping units.
//
// Background: func_RenderRings (0x007EF5A0) renders range rings via two
// loops sharing the same "Cast" effect technique that increments the GPU
// stencil per overdraw layer. The stencil is 7-bit (max 127), so 128+
// overlapping rings cause counter wrap-around and the early-out optimization
// fails -- individual circle outlines appear instead of a unified shape.
//
// Loop 1 (0x007EF774-0x007EF7FC) draws the FILL of the ring band (1 quad
// per ring, high per-pixel overdraw on overlap) -- this is the loop that
// overflows. Loop 2 draws only the thin EDGE strips (2 quads per ring) and
// has near-zero per-pixel overdraw, so it does not need patching.
//
// Fix: reduce Loop 1 batch size from 1000 to 30, and inject an intermediate
// RangeMask flush between batches that resets the stencil counter.

// Loop 1 batch cap: cmp eax, 0x3E8 -> cmp eax, 30
// (patches the imm32 of the 5-byte 'cmp eax, imm32' instruction at 0x007EF77D)
0x007EF77E:
.byte 0x1E, 0x00, 0x00, 0x00

// Loop 1 batch ceiling: mov eax, 0x3E8 -> mov eax, 30
// (patches the imm32 of the 5-byte 'mov eax, imm32' instruction at 0x007EF784)
0x007EF785:
.byte 0x1E, 0x00, 0x00, 0x00

// Inject intermediate RangeMask flush at the bottom of the Loop 1 body.
// Replaces the 7-byte 'mov eax, [esp+0x84]' with a 5-byte call + 2 NOPs.
// IntermediateRangeMask reproduces the replaced instruction after the flush.
0x007EF7EA:
call @IntermediateRangeMask
nop
nop
106 changes: 106 additions & 0 deletions section/RenderRingsFlush.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
// Intermediate RangeMask flush to prevent 7-bit stencil counter overflow
// in the first Cast loop of func_RenderRings (0x007EF5A0).
//
// === Why this exists ===
//
// func_RenderRings draws range rings via two loops, both calling
// func_Draw_Rings with the same "Cast" effect technique. That technique
// increments the GPU stencil per overdraw layer, then later passes use
// the stencil count for early-out culling and edge masking. The stencil
// is 7-bit (max 127). With 128+ overlapping rings the counter wraps and
// the visual collapses to individual circle outlines instead of a merged
// shape.
//
// Loop 1 (0x007EF774-0x007EF7FC): draws the FILL band of each ring.
// sub_7EDC80 emits 1 quad per ring covering the area between
// inner-radius+thickness and outer-radius-thickness. High per-pixel
// overdraw at dense build sites -- this is the loop that overflows.
//
// Loop 2 (0x007EF861-0x007EF8D5): draws the two thin EDGE strips per ring.
// sub_7EDC80 emits 2 quads per ring at the inner and outer borders.
// Near-zero per-pixel overdraw because edges of differently-sized rings
// almost never coincide -- does not need patching in practice.
//
// Fix: drop Loop 1 batch size 1000 -> 30 and inject this flush between
// batches. The flush replays the engine's own end-of-loop sequence
// (InitTransformedVerts -> RangeMask string -> Render) which clears the
// stencil and reapplies the mask before the next batch starts fresh.
//
// === Calling conventions ===
//
// Despite the IDA mangled-name display showing standard __thiscall, this
// build uses a custom register-based thiscall for the CRenFrame methods
// (verified by inspecting the function bodies and the engine's own call
// sites at 0x007EF820-0x007EF849). A direct C++ call is impossible
// because no C++ calling convention puts `this` in EBX or EDI, so the
// flush is implemented in raw asm.
//
// 0x007F5DA0 Moho::CRenFrame::InitTransformedVerts
// this @ EBX, (float w, float h), retn 8
// first usage at 0x007F5DD1: cmp [ebx+0x1C], 0
//
// 0x004059E0 std::string::string (the engine builds an std::string
// in-place at the technique slot to select "RangeMask")
// this @ ECX, (char const* s, size_t n), retn 8
// standard MSVC __thiscall
//
// 0x007F6030 Moho::CRenFrame::Render
// this @ EDI, (int w, int h), retn 8
// first usage at 0x007F6055: cmp [edi+0x18], 0x10
//
// === Stack layout ===
//
// At entry esp = caller_esp - 4 (return address pushed by the call hook).
// After pushad esp = caller_esp - 36. The hook is injected at 0x007EF7EA
// inside func_RenderRings where its esp delta from frame base puts:
//
// [caller_esp + 0x78] = idxa (gpg::gal::Head*) -- needed for width/height
// [caller_esp + 0x74] = arg_0 (CRenFrame container, +0x4C = technique ptr)
//
// Adding 36 for the pushad/return-address gives the offsets used below.
//
// The instruction replaced at 0x007EF7EA is the 7-byte
// `mov eax, [esp+0x84]` (= load loop counter `i`). It is reproduced after
// popad as `mov eax, [esp+0x88]` because esp is still 4 bytes lower from
// the call's return address being pushed.

void IntermediateRangeMask()
{
Comment thread
4z0t marked this conversation as resolved.
asm(
"pushad;"

// esi = idxa (Head*), edi = technique ptr
"mov esi, dword ptr [esp+0x9C];" // [caller+0x78]
"mov edi, dword ptr [esp+0x98];" // [caller+0x74]
"add edi, 0x4C;"

// --- InitTransformedVerts(this=ebx, width, height) ---
"cvtsi2ss xmm0, dword ptr [esi+0x14];"
"sub esp, 8;"
"movss dword ptr [esp+4], xmm0;"
"cvtsi2ss xmm0, dword ptr [esi+0x10];"
"mov ebx, edi;" // this @ ebx
"movss dword ptr [esp], xmm0;"
"call 0x007F5DA0;"

// --- std::string::string(this=ecx, "RangeMask", 9) ---
"push 9;"
"push 0x00E3F8E8;" // address of "RangeMask" string
"mov ecx, edi;" // this @ ecx
"call 0x004059E0;"

// --- Render(this=edi, width, height) ---
"mov ecx, dword ptr [esi+0x14];"
"mov edx, dword ptr [esi+0x10];"
"push ecx;"
"push edx;"
"call 0x007F6030;" // this @ edi (preserved through above)

"popad;"
// Reproduce the 7-byte instruction we overwrote at 0x007EF7EA.
// esp is still -4 from the call's return address, so [esp+0x88]
// here equals [original_esp+0x84] in the host function.
"mov eax, dword ptr [esp+0x88];"
"ret;"
);
}