diff --git a/docs/README.md b/docs/README.md index 1bd3d0e9..3c9d6e8a 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,4 +1,4 @@ -# MLIR-based binary patching framework +# Patchestry: Multi-Layered Binary Lifting and Patching Framework Patchestry aims to make the same impact to binary patching as compilers and high level languages did to early software development. Its main goal is to enable @@ -17,29 +17,27 @@ program, but at a different level of abstraction. MLIR brings a notable advantage by enabling the creation of representations to streamline communication between diverse state-of-the-art tools. For instance, -one can create an MLIR dialect specifically for P-Code (a program representation -utilized by Ghidra) to optimize integration with the Ghidra decompiler. -Alternatively, an LLVM IR dialect can be employed to compile back to the -executable, and MLIR can support LLVM-based contract validation through a +ClangIR provides an MLIR dialect that closely mirrors the Clang AST, preserving +high-level C/C++ semantics and enabling precise source-level analysis and +transformation. From there, an LLVM IR dialect can be employed to compile back +to the executable, and MLIR can support LLVM-based contract validation through a symbolic executor such as KLEE. Moreover, MLIR provides flexibility to devise our own dialects for representing contracts in specialized logic, such as -SMT. Finally, our [high-level dialect](https://trailofbits.github.io/vast/), -developed under DARPA V-SPELLS, captures the intricacies of full-featured C. Our -compiler stacks empower us to compile C into any of the previously mentioned +SMT. Our compiler stacks empower us to compile C into any of the previously mentioned representations, promoting seamless interconnection between them. ## Technical Rationale -Our recent experience on AMP, as well as our performance on other DARPA binary +Our experience on AMP, as well as our performance on other DARPA binary analysis programs (PACE, CFAR, LOGAN, CGC, Cyber Fast Track), have led us to four guiding principles that we believe patching solutions for legacy software must follow in order to be successful. -1. __Fully automated approaches are doomed to failure.__ In general, the process of +1. __Fully automated approaches are doomed to failure.__ The process of decompilation is an inherently intractable problem. However, developers are often capable of distinguishing between decompilation outcomes deemed 'good' or 'bad', but encoding that kind of heuristic logic into a system invariably yields -unpredictability and unsoundness. Hence, we assert that the involvement of +unpredictability and unsoundness. Hence, we assert that the involvement of semi-skilled or skilled human developers is essential in the process. The best-case scenario is that a developer can use an existing source code patch as a guide. Given this patch, they can locate the corresponding vulnerable machine @@ -48,12 +46,11 @@ involves the ad hoc application of tools (e.g. BinDiff, BSim) and reverse engineering skills to an opaque binary blob that is without symbols or debugging information. -2. __Developers must be able to leverage pre-existing software development experience__ -and not have to concern themselves with low level details. That is, they should -be able to operate as if the original source code and build process/environments -were available, and not be expected to have expert knowledge of every machine -code language that may be encountered. - +2. __Developers must be able to leverage pre-existing software development +experience__ and not have to concern themselves with low level details. They +should be able to operate as if the original source code and build +process/environments were available, and not be expected to have expert +knowledge of every machine code language that may be encountered. 3. __From-scratch development efforts do not scale.__ As much as possible, pre-existing tooling that already handles the inherent scalability challenges in @@ -62,7 +59,7 @@ example the Ghidra decompiler can decompile over 100 machine code languages to C, and the Clang compiler can generate machine code for over 20 machine code languages. Rolling new solutions from scratch is impractical. -4. __There is no one-size-fits-all way of representing code.__ A “complete” +4. __There is no one-size-fits-all way of representing code.__ A "complete" solution to machine code decompilation only exists at the end of a long tail of special cases. Patchestry aims to provide decompilation to a familiar, C-like language. Patchestry will not, however, decompile to C or a specific ISO dialect @@ -87,7 +84,7 @@ effectiveness and optimal outcomes across all desired functionalities. ### Incremental Decompilation -Patchestry’s innovative approach involves leveraging multiple program +Patchestry's innovative approach involves leveraging multiple program representations simultaneously across various layers of the Tower of IRs. While state-of-the-art decompilers already offer diverse representations, what sets the Tower of IRs apart is its capability to create custom user-defined @@ -99,7 +96,7 @@ abstractions tailored to unique platforms. ### Unifying Representations for Contracts, Patches, and Software The Tower of IRs also aligns with the fourth guiding principle: There is no -one-size-fits-all way of representing code. Maintaining multiple +one-size-fits-all way of representing code. Maintaining multiple representations simultaneously in the Tower of IRs allows us to establish meaningful relationships between them and innovate in how we connect tools and conduct analyses. Additionally, this approach allows us to consolidate all @@ -109,138 +106,304 @@ streamlines tooling for analysis and facilitates the recompilation of patched software, resulting in a single artifact that can undergo desired formal analyses, such as LLVM-based analysis. - ### Declarative Patching and Contracts Description To address our second guiding principle, which emphasizes the importance of allowing developers to leverage their existing software development experience, we mandate that all interactions with patching occur in a language commonly -understood by developers. Specifically, a C-like language. To facilitate this, -we propose a declarative library designed for describing patches, their -application. Following the same principle, Patchestry introduces contracts in -C-like DSL. These contracts serve as constraints guiding both decompilation and -recompilation, and they must hold at all relevant steps of each process. +understood by developers. Patches are written as C functions, and the +meta-programming layer for describing _where_ and _how_ patches and contracts +are applied is specified declaratively in YAML. This separates patch logic +(familiar C code) from patch orchestration (structured YAML configuration), +keeping both accessible to developers without requiring expertise in compiler +internals or custom DSLs. Contracts follow the same pattern: runtime contracts +are written in C, static contracts are expressed as YAML predicates, and +meta-contracts define where they are applied. + +## Why This Approach + +### Why Not Edit Ghidra's Output Directly? + +Patchestry's workflow does not allow the developer to modify Ghidra's +decompilation output and then re-compile it. There are two reasons: + +1. Ghidra's decompilation is not guaranteed to be syntactically correct or + compilable. The effort to fix it increases with the complexity of the target + function(s). +2. Ghidra's heuristic decompilation pipeline has been proven to be unfaithful + with respect to the execution semantics of the machine code. This could + result in a developer inadvertently introducing new vulnerabilities during the + patching process. + +Despite this, Ghidra's decompilation is good enough to be a productivity +multiplier for developers trying to locate functions that need patching. + +### Why Clang AST? + +Patchestry lifts Ghidra's P-Code representation into a Clang AST. This choice +is driven by pragmatic considerations: + +- __Recompilation for free.__ The Clang compiler can already target over 20 + machine code languages. By producing a valid Clang AST, Patchestry gets + recompilation to any Clang-supported target architecture without building a + custom compiler backend. +- __Function-level granularity.__ Functions are the smallest compilable unit of + code in compilers like Clang. Function granularity patches also enable + Patchestry to leverage stronger ABI guarantees: it is only at the entry and + exit points of a compiled function that higher-level, human-readable types can + be reliably mapped to low level machine locations (registers, memory). +- __Familiar output.__ The decompiled C output looks approximately similar + regardless of the platform/architecture, improving developer productivity. + Developers can read and modify the output using standard C knowledge. +- __Integration with MLIR.__ Clang's CIR (ClangIR) dialect provides a bridge + into the MLIR ecosystem, enabling instrumentation, patching, and contract + verification using MLIR-based passes before lowering to LLVM IR. + +### Why MLIR? + +Patchestry leverages MLIR as the foundational technology for its intermediate +representations instead of LLVM IR directly. MLIR allows for the specification, +transformation, and mixing of IR dialects. With MLIR, the decompilation and +patching process is stratified into a Tower of IRs (IR dialects), where each IR +represents the same program at a different level of abstraction. This enables: + +- Creation of an MLIR dialect specifically for P-Code to optimize integration + with the Ghidra decompiler. +- Use of an LLVM IR dialect to compile back to the executable. +- LLVM-based contract validation through symbolic executors such as KLEE or + SeaHorn. +- Custom dialects for representing contracts in specialized logic (e.g., SMT). + +## Decompilation Pipeline + +Patchestry's decompilation pipeline converts binary functions into editable, +recompilable C code through the following stages: + +``` +Binary --> Ghidra --> P-Code (JSON) --> Clang AST --> C Output + | + CIR (MLIR) --> Instrumentation --> LLVM IR --> Machine Code +``` + +1. __Ghidra P-Code serialization.__ A Ghidra plugin serializes the decompiled + P-Code representation of target function(s) to JSON format. This + serialization captures types, control flow, operations, and variable + information from Ghidra's analysis database. -## Decompilation Workflow +2. __Lifting to Clang AST.__ The `patchir-decomp` tool reads the serialized + P-Code JSON and constructs a Clang AST. This involves type reconstruction, + control flow structuring (recovering if/else, loops, and switch statements + from the flat P-Code graph), and mapping P-Code operations to C statements. -![Patchestry ls](img/patchestry-workflow.svg) +3. __C output.__ The Clang AST is emitted as human-readable C code that the + developer can inspect and edit. -Patchestry’s technical approach is designed to enable the following seven-step workflow: +4. __CIR and MLIR lowering.__ The Clang AST is lowered through ClangIR (CIR) + into the MLIR Tower of IRs. At this level, patches and contracts are applied + via the instrumentation engine. The result is lowered to LLVM IR for + recompilation. + +## Developer Workflow + +Patchestry's technical approach enables the following workflow: 1. A developer is tasked with patching a vulnerability in a program binary -running on a device. How the user acquires a copy of the binary (e.g. -downloaded from a vendor’s website, extracted from a network capture, extracted -directly from a device over serial port or JTAG, etc.) is not part of the project. - -2. The developer loads the binary into the open-source Ghidra interactive -decompiler. Developers will be enabled to leverage Ghidra’s features and plugins -to locate the function(s) to patch, though previous binary analysis expertise is -not required. We anticipate that developers will apply tools such as BinDiff or -BSim, rely on symbol names or debug information, or apply reverse engineering -techniques. - -The Patchestry workflow includes Ghidra because it is open-source and actively -maintained by the National Security Agency and because it supports a wide -variety of binary file formats (ELF, COFF, PE, etc.) and machine code languages -used by medical devices. Ghidra also implements a battery of heuristics that act -as good first guesses as to the locations and references between functions and -data in the binary. Although perfect identification/recovery of functions, data, -and data types in a binary is intractable, the value of interactivity in Ghidra -is that the human developer can fix incorrect conclusions drawn by the -decompiler’s heuristics. - -There are two reasons why Patchestry’s workflow does not allow the developer to -modify Ghidra’s decompilation output and then re-compile that into a patchable -representation. First, Ghidra’s decompilation is not guaranteed to be -syntactically correct or compilable. This can be mitigated through developer -effort; however, the level of effort increases with the complexity of and number -of references in the target function(s). Second, Ghidra’s heuristic -decompilation pipeline has been proven to be unfaithful with respect to the -execution semantics of the machine code. In the worst case, this could result in -a developer inadvertently introducing new vulnerabilities into the program -during the patching process. - -Despite Ghidra’s decompilation not being precise enough for recompilation, our -experience from AMP tells us that Ghidra’s decompilation is good enough to be a -productivity multiplier for developers trying to locate functions that need -patching. - -Moreover, the modular design of Patchestry affords the flexibility to seamlessly -integrate more formally rigorous decompilers and their representations in the -future, as their capabilities align with our technical requirements. Currently, -the majority of existing tools are predominantly of a research-oriented nature, -often concentrating on x86 architecture or even just its subset, which is -not sufficient for the diverse nature of software. - -3. After locating the relevant function(s) in Ghidra, the Patchestry plugin will -present the developer with an editable decompilation of the target function(s). -Patchestry’s decompilations will be sound and precise with respect to the -available information in Ghidra’s analysis database. Regardless of how small the -patch size could be, Patchestry will always formulate the problem at the -function granularity. There are theoretical and pragmatic reasons why -Patchestry’s minimum patch recompilation granularity is function-at-a-time. - -From a theoretical standpoint, function granularity patches enable Patchestry to -leverage stronger guarantees about the application binary interface (ABI). It is -only at the entry and exit points of a compiled function that higher-level, -human-readable types can be reliably mapped to low level machine locations -(registers, memory). - -Patchestry leverages the open-source Clang compiler, which can already -target relevant platforms. A restriction in compilers like Clang that -nonetheless favors our approach is that functions are the smallest compilable -unit of code. Our task in Patchestry is thus to convert code for recompilation -into LLVM IR functions, which Clang can convert to machine code. - -4. The developer edits the decompiled function(s), enacting the necessary -changes to patch the vulnerability in the decompiled code. Patchestry’s highest -level decompiled code (C-like) will look approximately similar, regardless of -the platform/architecture of the medical device software. This will help improve -developer productivity. Moreover, the meta-patch library will allow the -developer to automate the patching process. - -At this stage, the binary-level patch has not yet been formulated. What -particular changes are needed to patch a given vulnerability are beyond the -scope of the project and require an external tool. Patchestry will, however, -provide a library of “patch intrinsics” such as “add bounds check.” These will -be formulated as templates of meta-patches. - -A developer can make near arbitrary changes within the body of the decompiled -code (e.g. add, remove or replace its portions). Although Patchestry aims to -provide verifiable guarantees about feature- and bug-compatibility of its -decompilation with respect to the Ghidra database, absent contracts or -specifications about the intended behavior of the code, Patchestry cannot make -guarantees about the correctness of the edited decompilation. That is, -Patchestry cannot prevent a developer from introducing new flaws into the -binary, nor can it guarantee that a patch comprehensively fixes the root cause -of the vulnerability. - -To mitigate the problem of developer- or decompiler-introduced emergent -behaviors, Patchestry will allow developers to leverage model- and -contract-based software verification techniques. These techniques are normally -challenging to apply to lifting/decompilation due to a lack of end-to-end -visibility into the lifting process; usually the techniques only apply at the -very last stage, on the decompiled/lifted result. However, Patchestry’s approach -to decompilation is multi-level: decompilation progresses through a stage of -increasingly high-level IRs. By taking a multi-level approach, Patchestry can -instrument contracts at various stages of the process. - -5. Verification of contracts. To ensure the reliability of patched code along -with associated contracts, Patchestry offers a toolset for generating output -compatible with both static and dynamic analysis tools. The optimal choice for -this purpose is LLVM IR, given its verification confirms the fulfillment of -contracts before its compilation. Patchestry allows for easy integration of -LLVM-based analysis tools such as KLEE or SeaHorn, automating the verification -process. - -6. Patchestry formulates the patch by compiling developer-edited decompiled -function(s), and packages the patch for use by a binary patching tool. -Patchestry will utilize a pre-existing tool, such as Patcherex or OFRAK, to -enact the patch process, creating a new version of the binary. - -7. Finally, the developer will load the new version of the binary onto the -device. How the developer loads the new version of the binary is not part of the -project. + running on a device. + +2. The developer loads the binary into Ghidra and locates the function(s) to + patch using Ghidra's features, plugins, symbol names, or tools such as + BinDiff or BSim. Previous binary analysis expertise is not required. + +3. The Patchestry Ghidra plugin serializes the target function(s) to P-Code + JSON. The `patchir-decomp` tool then produces an editable C decompilation + that is sound and precise with respect to the available information in + Ghidra's analysis database. + +4. The developer edits the decompiled function(s) to patch the vulnerability. + Alternatively, the developer defines patches declaratively using YAML + specifications and the meta-patching library (see + [Patching Interface](#patching-interface)). + +5. Contracts are verified. Patchestry generates output compatible with LLVM-based + analysis tools such as KLEE or SeaHorn to ensure that the patched code + satisfies developer-defined contracts (see + [Contracts Interface](#contracts-interface)). + +6. Patchestry recompiles the patched function(s) through the MLIR pipeline to + LLVM IR, then to machine code. The resulting binary patch is packaged for + insertion into the original binary using a tool such as Patcherex or OFRAK. + +7. The developer loads the patched binary onto the device. + +## Patching Interface + +Patchestry provides a declarative YAML-based interface for specifying patches +and their application. This separates _what_ the patch does (the C code) from +_where_ and _how_ it is applied (the meta-patch configuration). + +### Patch Specification + +Patches are defined in a YAML library file. Each patch references a C source +file containing the patch implementation: + +```yaml +apiVersion: patchestry.io/v1 +metadata: + name: usb-security-patches + version: "1.0.0" + +patches: + - name: usb_endpoint_write_validation + id: "USB-PATCH-001" + description: "Validate USB endpoint write operations" + category: usb_security + severity: high + code_file: "patches/patch_usbd_ep_write_packet.c" + function_name: "patch::before::usbd_ep_write_packet" + parameters: + - name: usb_device + type: "usb_device_t*" + - name: buffer + type: "const void*" +``` + +### Meta-Patch Configuration + +Meta-patches define _where_ patches are applied using match rules and _how_ +they are applied using action modes. This enables automated, declarative +patching across the codebase: + +```yaml +meta_patches: + - name: usb_security_meta_patches + description: "Meta patches for USB security" + optimization: + - "inline-patches" + patch_actions: + - id: "USB-PATCH-001" + description: "Pre-validation security check" + match: + - name: "usbd_ep_write_packet" + kind: "operation" + function_context: + - name: "bl_usb__send_message" + action: + - mode: "apply_before" + patch_id: "USB-PATCH-001" + arguments: + - name: "operand_0" + source: "operand" + index: 0 +``` + +The instrumentation engine supports three action modes: + +- __`apply_before`__: Insert the patch function call before the matched + operation. +- __`apply_after`__: Insert the patch function call after the matched operation. +- __`replace`__: Replace the matched operation entirely with the patch function. + +Arguments to the patch function can be sourced from: + +- __`operand`__: An operand of the matched call or operation by index. +- __`variable`__: A local variable by name. +- __`symbol`__: A global symbol (variable or function) by name. +- __`constant`__: A literal constant value. +- __`return_value`__: The return value of the matched call. + +### Configuration File + +A top-level YAML configuration file ties together the target binary, patch +libraries, contract libraries, meta-patches, and meta-contracts, along with +an execution order: + +```yaml +apiVersion: patchestry.io/v1 +metadata: + name: "usb-security-deployment" +target: + binary: "firmware.bin" + arch: "ARM:LE:32" + +libraries: + - "patches/usb_security_patches.yaml" + - "contracts/usb_security_contracts.yaml" + +execution_order: + - "meta_patches::usb_security" + - "meta_contracts::usb_security" + +meta_patches: + - name: usb_security + # ... patch actions ... + +meta_contracts: + - name: usb_security + # ... contract actions ... +``` + +## Contracts Interface + +Patchestry provides contracts as a mechanism for specifying and verifying +correctness properties of patched code. Unlike patches, contracts do not alter +program state. There are two types of contracts: + +- __Runtime contracts__: Checks that persist in the compiled binary and execute + at runtime. These address unexpected states during execution and are written + as C functions referencing a `code_file`, similar to patches. +- __Static contracts__: Constraints used exclusively for formal verification. + These are checked by tools such as KLEE or SeaHorn and do not persist in the + compiled binary. Static contracts can specify preconditions and postconditions + using predicates. + +### Contract Specification + +Contracts are defined in YAML and can be either runtime or static: + +```yaml +contracts: + # Runtime contract: C code that executes at runtime + - name: "buffer_size_check" + type: RUNTIME + severity: high + code_file: "contracts/buffer_check.c" + function_name: "check_buffer_bounds" + parameters: + - name: buffer + type: "const void*" + - name: size + type: "size_t" + + # Static contract: formal predicate checked by verifier + - name: "nonnull_pointer" + type: STATIC + severity: critical + preconditions: + - id: "pre-001" + description: "Pointer argument must not be null" + pred: + kind: nonnull + target: arg0 +``` + +Static contract predicates support: + +- __`nonnull`__: Assert that a target is not null. +- __`relation`__: Assert a relational constraint (e.g., `arg0 <= value`). +- __`alignment`__: Assert pointer alignment. +- __`range`__: Assert that a value falls within a min/max range. +- __`expr`__: Assert an arbitrary expression. + +### Meta-Contract Configuration + +Meta-contracts define where contracts are applied, similar to meta-patches. +Contracts can be applied at three points: + +- __`apply_before`__: Check the contract before the matched function or + operation. +- __`apply_after`__: Check the contract after the matched function or operation. +- __`apply_at_entrypoint`__: Check the contract at the entry of the enclosing + function. ## Architecture @@ -249,7 +412,6 @@ developer interaction. The developer plays a key role, providing the binary pieces to be patched, a patch description, and instructions on how to apply these patches using the meta-programming framework (meta-patches). Contracts are similarly specified and applied by instrumentation using the same meta-language. -Utilizing state-of-the-art tools, we perform decompilation and program analysis. A significant architectural innovation is the MLIR Tower of IRs, which serves as the connecting element. This tower facilitates the association of @@ -257,75 +419,52 @@ representations between decompiled programs, such as from P-Code and compilable and structured representations like LLVM IR. The tower's modularity allows for the specification of any DSL for the decompiled program, with the only requirement being the translation of this DSL to a layer of the tower. In our -case, Ghidra's P-Code serves as a suitable starting point layer. However, this -modular design allows new decompilers to be integrated into Patchestry in the -future while preserving the rest of the architecture. +case, Ghidra's P-Code serves as a suitable starting point layer. This modular +design allows new decompilers to be integrated into Patchestry in the future +while preserving the rest of the architecture. Utilizing the same representation (MLIR dialects) for both the decompiled binary and the compiled patched version facilitates seamless instrumentation and inlining of patches, ultimately producing a patched MLIR (Tower of IRs). The tower's various abstraction layers enable precise specification of points of -interest, surpassing the limitations of a single representation. Additionally, -the tower abstracts away from the decompiled representation (P-Code), -facilitating modular design in the future. - -Contract handling follows a similar pattern. Described in a C-like language, -contracts can take the form of static or runtime assertions or error handlers. -These are inserted into the code while it is in the IR Tower form. Runtime -checks are then compiled and remain in the patched binary. Static contracts are -checked using a formal verifier. The flexibility to invent new contract -mechanisms according to specific needs is a key feature. - -In the verification phase, which is the final step, Patchestry is designed to -accommodate various verification methods. The Tower allows to produce a -customized representation for the analysis, but it is advisable to stick to the -same representation as the compilation (such as LLVM IR) to prevent errors -during translation. Slicing the codebase into independent parts influenced by -the patch makes LLVM-based static analysis of the representation with contracts -tractable. We expect that most of the patches being local influence only a small -part of the program, therefore using the dependency analysis, we can isolate the -part of the program that needs to be verified. - -## CVE-2021-22156 Patching +interest, surpassing the limitations of a single representation. + +In the verification phase, Patchestry is designed to accommodate various +verification methods. The Tower can produce a customized representation for the +analysis, but it is advisable to stick to the same representation as the +compilation (such as LLVM IR) to prevent errors during translation. Slicing the +codebase into independent parts influenced by the patch makes LLVM-based static +analysis tractable. Since most patches are local and influence only a small part +of the program, dependency analysis can isolate the part of the program that +needs to be verified. + +## Example: CVE-2021-22156 Patching An example of patching the CVE-2021-22156 vulnerability, addressing an integer overflow within the `calloc()` function of the C standard library. This vulnerability affects versions of the BlackBerry QNX Software Development -Platform (SDP) up to version 6.5.0SP1, QNX OS for Medical up to version 1.1, and -QNX OS for Safety up to version 1.0.1. A malicious actor could exploit this -integer overflow issue to execute arbitrary code or initiate a denial of service -attack. - -Consider the vulnerable code snippet below. Here, a user-defined function, -`get_num_elements()`, is employed to determine the size requirements for a dynamic -array of long integers assigned to the variable num_elements. During the -allocation of the buffer using `calloc()`, the variable num_elements is multiplied -by sizeof(long) to calculate the overall size requirements. If the resulting -multiplication exceeds the representable range of `size_t`, `calloc()` may allocate -a zeroed buffer of insufficient size. Subsequently, when data is copied into -this buffer, an overflow may occur, posing a potential security risk. - -A vulnerable code in which the standard library function `calloc()` may allocate -a zeroed buffer of insufficient size: - -```cpp -size_t num_elements = get_num_elements(); // from the outside environment +Platform (SDP) up to version 6.5.0SP1, QNX OS for Medical up to version 1.1, +and QNX OS for Safety up to version 1.0.1. -long *buffer = (long *)calloc(num_elements, sizeof(long)); +Vulnerable code in which `calloc()` may allocate a zeroed buffer of +insufficient size: +```c +size_t num_elements = get_num_elements(); +long *buffer = (long *)calloc(num_elements, sizeof(long)); if (buffer == NULL) { /* Handle error condition */ } ``` -The desired result of applying Patchestry to fix the vulnerable code: +The desired result after applying the patch: -```cpp -size_t num_elements = get_num_elements(); // from the outside environment +```c +size_t num_elements = get_num_elements(); /* Patch start */ if (num_elements > SIZE_MAX/sizeof(long)) { - /* Handle error condition */ + /* Handle error condition */ } /* Patch end */ @@ -334,61 +473,91 @@ if (buffer == NULL) { /* Handle error condition */ return; } - ``` -To create this simple patch, we require two essential components: the patch -itself and the specific locations where the patch should be applied. In -Patchestry, we offer developers a library, allowing them to articulate these -components using familiar C-like syntax: +The patch is written as a C function: -```cpp -// 1. Patch in C -[[CVE-2021-22156]] void patch(size_t num_elements) { +```c +// patches/cve_2021_22156_patch.c +void patch_calloc_overflow(size_t num_elements) { if (num_elements > SIZE_MAX/sizeof(long)) { - /* Handle error condition */ + /* Handle error condition */ } } +``` -// 2. Developer-defined meta-patch transformation -void meta_patch(source_module_t module) { - for (const callsite_t &place : module.calls("calloc")) { - place.apply_before("CVE-2021-22156::patch", { place.operand(1) }); - } -} +The meta-patch configuration in YAML describes where and how to apply it: + +```yaml +# Patch library +patches: + - name: calloc_overflow_check + id: "CVE-2021-22156" + description: "Integer overflow check before calloc" + severity: high + code_file: "patches/cve_2021_22156_patch.c" + function_name: "patch_calloc_overflow" + parameters: + - name: num_elements + type: "size_t" + +# Meta-patch: apply before every call to calloc +meta_patches: + - name: cve_2021_22156_meta + patch_actions: + - id: "CVE-2021-22156-ACTION" + match: + - name: "calloc" + kind: "operation" + action: + - mode: "apply_before" + patch_id: "CVE-2021-22156" + arguments: + - name: "num_elements" + source: "operand" + index: 0 ``` -Patchestry introduces a metaprogramming interface enabling developer-defined -code transformations. Patchestry’s interface supports common patching operations -such as code insertion, replacement, alteration, and deletion. To obtain the -source module from the program binary, we leverage state-of-the-art decompilers -and seamlessly integrate them into the MLIR source representation, as elaborated -later. Patchestry’s source representation can be modified and queried through -the metaprogramming API. - -The second crucial aspect of Patchestry involves providing assurances regarding -patches. Similar to the patching process, we empower developers to define -contracts in our C-like language, and seamlessly embed these checks into the -binary. Unlike patches, contracts are mandated not to alter program state. There -are two types of contracts in Patchestry: runtime contracts, addressing -unexpected states during runtime, and static contracts, exclusively used for -formal verification without persisting in the compiled binary. - -```cpp -// Contract -[[CVE-2021-22156]] void contract(size_t num_elements) { +Similarly, a contract can be defined to verify the property holds. A runtime +contract is written in C: + +```c +// contracts/cve_2021_22156_contract.c +void check_calloc_bounds(size_t num_elements) { assert(num_elements <= SIZE_MAX/sizeof(long)); } -// Meta-contract -void meta_contract(source_module_t module) { - for (const callsite_t &place : module.calls("calloc")) { - place.apply_before("CVE-2021-22156::contract", { place.operand(1) }); - } -} +``` +And its meta-contract specifies where to apply it: + +```yaml +contracts: + - name: calloc_bounds_contract + type: RUNTIME + severity: high + code_file: "contracts/cve_2021_22156_contract.c" + function_name: "check_calloc_bounds" + parameters: + - name: num_elements + type: "size_t" + +meta_contracts: + - name: cve_2021_22156_contract_meta + contract_actions: + - id: "CVE-2021-22156-CONTRACT" + match: + - name: "calloc" + kind: "function" + function_context: + - name: "*" + action: + - mode: "apply_before" + contract_id: "calloc_bounds_contract" + arguments: + - name: "num_elements" + source: "operand" + index: 0 ``` -An example of Patchestry contract that defines the expected functionality of the program -under test written in the C-like Patchestry contracts DSL. A contract is similar -to a regression test or a behavioral assertion that an analysis tool like KLEE -would check. \ No newline at end of file +A contract is similar to a regression test or a behavioral assertion that an +analysis tool like KLEE or SeaHorn would check. diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt index 174853be..21576a88 100644 --- a/test/CMakeLists.txt +++ b/test/CMakeLists.txt @@ -20,11 +20,6 @@ add_lit_testsuite(ghidra-output-tests ${CMAKE_CURRENT_SOURCE_DIR}/ghidra ) -add_lit_testsuite(pcode-translation-tests - "Running pcode-translate tests" - ${CMAKE_CURRENT_SOURCE_DIR}/pcode-translate -) - add_lit_testsuite(patchir-decomp-tests "Running patchir-decomp tests" ${CMAKE_CURRENT_SOURCE_DIR}/patchir-decomp @@ -41,11 +36,6 @@ add_test(NAME ghidra-output-tests --param BUILD_TYPE=$ ) -add_test(NAME pcode-translation-tests - COMMAND lit -v -j 4 "${CMAKE_CURRENT_BINARY_DIR}/pcode-translate" - --param BUILD_TYPE=$ -) - add_test(NAME patchir-decomp-tests COMMAND lit -v -j 4 "${CMAKE_CURRENT_BINARY_DIR}/patchir-decomp" --param BUILD_TYPE=$ diff --git a/test/lit.cfg.py b/test/lit.cfg.py index dd72d166..dbd0383e 100644 --- a/test/lit.cfg.py +++ b/test/lit.cfg.py @@ -58,7 +58,6 @@ def patchestry_tool_path(tool): return os.path.join(*path, tool) config.decompiler_headless_tool = os.path.join(config.patchestry_script_dir, 'decompile-headless.sh') -config.pcode_translate_tool = patchestry_tool_path('pcode-translate') config.json_strip_comments = os.path.join(config.test_scripts_dir, 'strip-json-comments.sh') @@ -121,7 +120,7 @@ def get_compiler_command(arch): ToolSubst('%host_cc', command=config.host_cc), ToolSubst('%host_cxx', command=config.host_cxx), ToolSubst('%decompile-headless', command=config.decompiler_headless_tool), - ToolSubst('%pcode-translate', command=config.pcode_translate_tool), + ToolSubst('%patchir-decomp', command=config.patchir_decomp_tool), ToolSubst('%patchir-transform', command=config.patchir_transform_tool), ToolSubst('%patchir-cir2llvm', command=config.patchir_cir2llvm_tool), diff --git a/test/pcode-translate/function.json b/test/pcode-translate/function.json deleted file mode 100644 index 45670694..00000000 --- a/test/pcode-translate/function.json +++ /dev/null @@ -1,17 +0,0 @@ -// RUN: bash %strip-json-comments %s | %pcode-translate --deserialize-pcode | %file-check %s -{ - // CHECK: pc.func @function - "name": "function", - "basic_blocks": [ - { - // CHECK: pc.block @fisrt_block - "label": "fisrt_block", - "instructions": [] - }, - { - // CHECK: pc.block @second_block - "label": "second_block", - "instructions": [] - } - ] -} \ No newline at end of file diff --git a/test/pcode-translate/help.json b/test/pcode-translate/help.json deleted file mode 100644 index 7319eda2..00000000 --- a/test/pcode-translate/help.json +++ /dev/null @@ -1,2 +0,0 @@ -// RUN: %pcode-translate --help | %file-check %s -// CHECK: --deserialize-pcode diff --git a/tools/CMakeLists.txt b/tools/CMakeLists.txt index 2c5880c4..ba8943f0 100644 --- a/tools/CMakeLists.txt +++ b/tools/CMakeLists.txt @@ -3,7 +3,6 @@ # This source code is licensed in accordance with the terms specified in the # LICENSE file found in the root directory of this source tree. -add_subdirectory(pcode-translate) add_subdirectory(patchir-decomp) add_subdirectory(patchir-cir2llvm) add_subdirectory(patchir-transform) diff --git a/tools/pcode-translate/CMakeLists.txt b/tools/pcode-translate/CMakeLists.txt deleted file mode 100644 index a22157b8..00000000 --- a/tools/pcode-translate/CMakeLists.txt +++ /dev/null @@ -1,33 +0,0 @@ -# Copyright (c) 2024, Trail of Bits, Inc. -# -# This source code is licensed in accordance with the terms specified in the -# LICENSE file found in the root directory of this source tree. - -set(LLVM_LINK_COMPONENTS - Support -) - - -add_executable(pcode-translate - main.cpp -) - -llvm_update_compile_flags(pcode-translate) -target_link_libraries(pcode-translate - PRIVATE - MLIRIR - MLIRParser - MLIRPass - MLIRTranslateLib - MLIRSupport - patchestry::ghidra -) - -mlir_check_link_libraries(pcode-translate) - -if (PATCHESTRY_INSTALL) - install(TARGETS pcode-translate - DESTINATION ${CMAKE_INSTALL_BINDIR} - COMPONENT patchestry-tools - ) -endif() diff --git a/tools/pcode-translate/main.cpp b/tools/pcode-translate/main.cpp deleted file mode 100644 index bc711f6f..00000000 --- a/tools/pcode-translate/main.cpp +++ /dev/null @@ -1,16 +0,0 @@ -/* - * Copyright (c) 2024, Trail of Bits, Inc. - * - * This source code is licensed in accordance with the terms specified in - * the LICENSE file found in the root directory of this source tree. - */ - -#include -#include - -#include - -int main(int argc, char **argv) { - patchestry::ghidra::register_pcode_translation(); - return mlir::failed(mlir::mlirTranslateMain(argc, argv, "P-Code translation driver\n")); -}