Skip to content

decomp: Add CFG, Dominance, and Loop Analysis for AST structuring#139

Open
kumarak wants to merge 2 commits intomainfrom
ast_improvements
Open

decomp: Add CFG, Dominance, and Loop Analysis for AST structuring#139
kumarak wants to merge 2 commits intomainfrom
ast_improvements

Conversation

@kumarak
Copy link
Collaborator

@kumarak kumarak commented Feb 24, 2026

This PR introduces new AST infrastructure for CFG construction, dominance analysis, loop detection, and structured code emission. It adds components such as CfgBuilder, DomTree, LoopInfo, and a Ghidra-inspired CollapseStructure algorithm to transform unstructured CFGs into structured SNode trees. It also introduces ClangEmitter for AST emission and SNodeDebug for Graphviz visualization. Minor naming and style updates improve consistency across related headers.

@kumarak kumarak marked this pull request as draft February 24, 2026 20:37
@kumarak kumarak force-pushed the ast_improvements branch 4 times, most recently from 6fe46e3 to 24fba92 Compare March 5, 2026 23:10
@kumarak kumarak force-pushed the ast_improvements branch 4 times, most recently from 3fb4be9 to be02750 Compare March 10, 2026 00:19
@kumarak kumarak marked this pull request as ready for review March 10, 2026 03:19
@kumarak kumarak changed the title [wip]decomp: llvm passes to improve ast generated from pcode decomp: llvm passes to improve ast generated from pcode Mar 10, 2026
@kumarak kumarak requested a review from Copilot March 10, 2026 14:47
@kumarak kumarak changed the title decomp: llvm passes to improve ast generated from pcode decomp: Add CFG, Dominance, and Loop Analysis for AST structuring Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new structured AST normalization/structuring path for patchir-decomp, adding CFG extraction + Ghidra-style CollapseStructure + re-emission back into Clang AST, along with various type/operation handling tweaks and updated lit JSON tests for switch/loop structuring.

Changes:

  • Add new AST structuring infrastructure: SNode IR, CFG builder + dominator/loop utilities, CollapseStructure, and a Clang AST emitter with pretty-print cleanup.
  • Update decompiler operation/type handling (e.g., switch cases metadata, string-literal LOAD handling, STORE deref simplification, PIECE/SUBPIECE adjustments).
  • Add CLI flags and extend/adjust lit tests for switch/loop structuring scenarios.

Reviewed changes

Copilot reviewed 39 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tools/patchir-decomp/main.cpp Adds CLI flags for structuring/goto-elimination and tightens codegen execution on diagnostics.
tools/CMakeLists.txt Fixes indentation/formatting for tool subdirectory inclusion.
test/patchir-decomp/switch_successor_blocks.json Updates switch test input/expectations for explicit switch case metadata and parameter usage.
test/patchir-decomp/switch_missing_target_blocks.json Adds/updates a test for missing switch target blocks and uses the new structuring flag in RUN lines.
test/patchir-decomp/switch_cases_switch_input_discriminant.json Updates discriminant handling expectations to use parameter-based selector and explicit cases.
test/patchir-decomp/ghidra_struct_while.json New test covering while structuring via Ghidra-style pipeline.
test/patchir-decomp/ghidra_struct_switch.json New test covering switch structuring via Ghidra-style pipeline (print-tu only).
test/patchir-decomp/ghidra_struct_sequential.json New test for sequential (non-loop/goto) structuring behavior.
test/patchir-decomp/ghidra_struct_nested_loop.json New test for nested loop structuring behavior.
test/patchir-decomp/ghidra_struct_irreducible.json New test for irreducible flow fallback behavior.
test/patchir-decomp/ghidra_struct_if_else.json New test for if/else structuring behavior.
test/patchir-decomp/ghidra_struct_dowhile.json New test for do/while structuring behavior.
test/lit.cfg.py Minor formatting change.
lib/patchestry/Passes/Utils.cpp Renames local constants to k... style for bit-width calculations.
lib/patchestry/Ghidra/JsonDeserialize.cpp Refactors bitfield/string deserialization and include ordering; adds warning on missing bitfield base type.
lib/patchestry/Dialect/Pcode/Deserialize.cpp Renames deserializer to Deserializer for consistency.
lib/patchestry/AST/TypeBuilder.cpp Updates to new TypeBuilder constant/member naming and error-message text.
lib/patchestry/AST/SNodeDebug.cpp New DOT emitter for visualizing SNode trees.
lib/patchestry/AST/SNode.cpp New SNode dump implementation utilities.
lib/patchestry/AST/OperationStmt.cpp Updates switch lowering, LOAD/STORE simplifications, call signature reconstruction, PIECE/SUBPIECE tweaks.
lib/patchestry/AST/OperationBuilder.cpp Adjusts operation-stmt caching/materialization logic and TypeBuilder API rename usage.
lib/patchestry/AST/LoopInfo.cpp New natural loop detection utilities.
lib/patchestry/AST/FunctionBuilder.cpp Updates for TypeBuilder API rename and documents calling convention limitations.
lib/patchestry/AST/DomTree.cpp New dominator/postdominator tree implementation.
lib/patchestry/AST/ClangEmitter.cpp New conversion from SNode back to Clang AST + cleanup passes (label/goto/decl fixups).
lib/patchestry/AST/CfgBuilder.cpp New CFG builder from Clang function bodies and switch metadata population from Ghidra JSON.
lib/patchestry/AST/CMakeLists.txt Adds new AST structuring sources to the build.
lib/patchestry/AST/ASTConsumer.cpp Integrates CFG+CollapseStructure+emitter pipeline into the consumer flow.
include/patchestry/Util/Options.hpp Adds goto-elimination-related fields to global options.
include/patchestry/Ghidra/PcodeTypes.hpp Encapsulates BitField base type member and clarifies StringType comment.
include/patchestry/Ghidra/Pcode.hpp Renames mnemonic mapper constants to k... style.
include/patchestry/Dialect/Pcode/Deserialize.hpp Renames deserializer to Deserializer in the public header.
include/patchestry/AST/TypeBuilder.hpp Renames constants/methods to k... / Get... / AstContext.
include/patchestry/AST/SNodeDebug.hpp Declares DOT emitter API for SNode.
include/patchestry/AST/SNode.hpp New SNode IR definition and arena factory.
include/patchestry/AST/LoopInfo.hpp Declares loop detection data structures/APIs.
include/patchestry/AST/DomTree.hpp Declares dominator tree API.
include/patchestry/AST/CollapseStructure.hpp Declares Ghidra-style structuring algorithm + internal graph types.
include/patchestry/AST/ClangEmitter.hpp Declares SNode→Clang emission and pretty-print cleanup APIs.
include/patchestry/AST/CfgBuilder.hpp Declares CFG structures/build + switch metadata population API.
Comments suppressed due to low confidence (1)

lib/patchestry/AST/OperationStmt.cpp:1658

  • In create_piece, low_width can remain 0 if the serialized type lookup fails for op.inputs[1].type_key, which makes the shift a no-op and produces an incorrect PIECE result (high and low parts overlap). Add a fallback to derive the bit-width from input1_expr->getType() (or another reliable source) when the type_key lookup fails.
        // Determine low-part bit width from input1's type.  The Varnode::size
        // field is not populated for operation inputs, so look up the type.
        unsigned low_width = 0;
        {
            auto low_type_it =
                type_builder().GetSerializedTypes().find(op.inputs[1].type_key);
            if (low_type_it != type_builder().GetSerializedTypes().end()) {
                low_width = static_cast< unsigned >(ctx.getTypeSize(low_type_it->second));
            }
        }

        auto merge_to_next = !op.output.has_value();

        // If Operation has type, convert expression to operation type and perform bit-shift and
        // or operation.
        if (op.type) {
            auto op_type           = type_it->second;
            auto *cast_expr_input0 = make_cast(ctx, input0_expr, op_type, location);
            if (!cast_expr_input0) {
                LOG(ERROR) << "Failed to create cast expression for PIECE input0. key: "
                           << op.key;
                return {};
            }
            input0_expr            = cast_expr_input0;
            auto *cast_expr_input1 = make_cast(ctx, input1_expr, op_type, location);
            if (!cast_expr_input1) {
                LOG(ERROR) << "Failed to create cast expression for PIECE input1. key: "
                           << op.key;
                return {};
            }
            input1_expr = cast_expr_input1;
        }

        // When low_width is 0, the shift is a no-op — skip it to avoid "<< 0".
        clang::Expr *high_expr = input0_expr;
        if (low_width > 0) {
            auto *shift_value = clang::IntegerLiteral::Create(
                ctx, llvm::APInt(ctx.getIntWidth(ctx.IntTy), low_width), ctx.IntTy, location
            );

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

decomp: update the ci pipeline to verify the function calls

decomp: add unit tests for decompilation

decomp: update option flags to disable structuring passes

decomp: update switch handling and remove fallback code entirely and add new tests
Copy link
Collaborator

@kyle-elliott-tob kyle-elliott-tob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a tests without a FileCheck and some general cleanup on naming too, lots of mixed cases/naming conventions

// References invalidated — use g.node() from here

// Redirect cond's edge from clause to copy
g.node(bl.id).succs[i] = copy_id;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment above suggests, references are invalidated but we still reference bl.id in two places after, which could be a use-after-free. Would recommend storing ID in a local

}

std::list<TraceDAG::BlockTrace *>::iterator
TraceDAG::openBranch(CGraph &g, BlockTrace *parent) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushBranches calls openBranch(g, bt) where bt = *current_activeriter. Inside openBranch, removeActive(parent) erases the list node that current_activeiter points to, causing the function to return an invalidated iterator which pushBranches then dereferences.

I think the fix is to grab the next iter from parent->activeiter before removeActive and return that, but this needs to be verified as I don't even know if I traced this logic correctly.

}

while (activecount > 0) {
int missedcount = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missedcount seems to always be zero, which makes this possibly infinitely loop. Guard condition for missedcount >= activecount will never be triggered.

std::vector< size_t > exits; // blocks outside the loop that are successors of body blocks
NaturalLoop *parent = nullptr; // enclosing loop (nullptr if outermost)
std::vector< NaturalLoop * > children; // nested loops
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Child pointers to other NaturalLoops can be invalidated by calls in detectLoops leading to dangling pointer references

return x == a;
}

std::vector< size_t > DomTree::DominanceFrontier(size_t block,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be totally wrong about this, my my understanding Is that Dominance Frontier analysis examines all edges. What's the use of the of the fg.blocks[b].succs.size() < 2 guard here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a guard to ignore any unconditional edges from the dominance frontier. Probably, we don't need this optimization here.

p.erase(std::remove(p.begin(), p.end(), nid), p.end());
}
}
for (size_t p : nodes[nid].preds) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two collapsed nodes share a predecessor, doesn't this get duped?

}
}

// Fallback goto after the switch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a switch block has case gotos but no post-switch fallback, fallthrough_succ stays at 0 which might produce a bogus edge. Idk if there's a way to have a sentinel value or an optional maybe.

// Add the sub-statement (what the label wraps)
clang::Stmt *sub = label_stmt->getSubStmt();
if (sub && !llvm::isa< clang::NullStmt >(sub)) {
// If sub is another label, recurse; otherwise add it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of handling only two levels, maybe we can use a while loop to peel nested LabelStmt nodes.

worklist.push(be.tail);
}

// Build predecessor map
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built ever iteration, can this be hoisted outside of the loop? Not sure as I can't trace all the logic here

}
}

void LoopBody::labelContainments(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should probably be situated here

@kumarak
Copy link
Collaborator Author

kumarak commented Mar 12, 2026

There's quite a tests without a FileCheck and some general cleanup on naming too, lots of mixed cases/naming conventions

@kyle-elliott-tob, Thanks for the review.
I checked that most of the unit tests include FileCheck. The one which is not make any call does not check for a function call. I also updated the PR to address comments and fixed the naming-convention mix-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants