|
1 | 1 | (* ============================================================================= |
2 | | - CodeHawk Binary Analyzer |
| 2 | + CodeHawk Binary Analyzer |
3 | 3 | Author: Henny Sipma |
4 | 4 | ------------------------------------------------------------------------------ |
5 | 5 | The MIT License (MIT) |
6 | | - |
7 | | - Copyright (c) 2022-2024 Aarno Labs, LLC |
| 6 | +
|
| 7 | + Copyright (c) 2022-2025 Aarno Labs, LLC |
8 | 8 |
|
9 | 9 | Permission is hereby granted, free of charge, to any person obtaining a copy |
10 | 10 | of this software and associated documentation files (the "Software"), to deal |
11 | 11 | in the Software without restriction, including without limitation the rights |
12 | 12 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
13 | 13 | copies of the Software, and to permit persons to whom the Software is |
14 | 14 | furnished to do so, subject to the following conditions: |
15 | | - |
| 15 | +
|
16 | 16 | The above copyright notice and this permission notice shall be included in all |
17 | 17 | copies or substantial portions of the Software. |
18 | | - |
| 18 | +
|
19 | 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
20 | 20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
21 | 21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
25 | 25 | SOFTWARE. |
26 | 26 | ============================================================================= *) |
27 | 27 |
|
28 | | -(** Sequence of consecutive assembly instructions that represents a semantic unit. |
29 | | - |
30 | | - Examples: |
31 | | - - switch statement constructed by |
32 | | - - table branch instructions (TBB/TBH), or |
33 | | - - load from table into pc |
34 | | - - indirect jump from table |
35 | | - |
36 | | - - question expression constructed by Thumb-2 if-then instruction |
37 | | - |
38 | | - In translation to CHIF, semantics are obtained from the anchor, all other |
39 | | - instructions belonging to the aggregate are ignored. |
| 28 | + |
| 29 | +(** {1 Instruction aggregates: Overview} *) |
| 30 | + |
| 31 | +(** {2 Description} |
| 32 | +
|
| 33 | + An instruction aggregate is a (usually, but not always, contiguous) |
| 34 | + sequence of two or more instructions that is treated as a single |
| 35 | + semantic unit. The semantics for the entire sequence is assigned to |
| 36 | + one of the instructions (often the last one in the sequence) and all |
| 37 | + other instructions are considere no-ops by all subsequent analyses. |
| 38 | +
|
| 39 | + Instruction aggregates cover a variety of constructs ranging from |
| 40 | + predicate and ternary assignments to jump tables. |
| 41 | +
|
| 42 | + {2 Motivation} |
| 43 | +
|
| 44 | + The rationale behind combining a sequence of instructions into an |
| 45 | + aggregate is that the collective action of the individual instructions |
| 46 | + combined is not easily apparent from considering their semantics in |
| 47 | + isolation. The instructions usually collaborate in a very specific |
| 48 | + way to achieve a particular result, with operands from different instructions |
| 49 | + playing playing a specific role. This is especially true of the jump |
| 50 | + table aggregates. For some of the other aggregate kinds the semantics |
| 51 | + of the individual instructions combined would be similar, but analysis |
| 52 | + of the aggregate is more efficient and precise. For example, this is |
| 53 | + the case with the predicate assignment: |
| 54 | + {[ |
| 55 | + <test> |
| 56 | + MOVcc Rx, #1 |
| 57 | + MOVcc' Rx, #0 where cc' is (not cc) |
| 58 | + ]} |
| 59 | + The two MOV instructions can be combined into one assignment of the |
| 60 | + boolean <test-cc>: |
| 61 | + {[ |
| 62 | + Rx := <test-cc> |
| 63 | + ]} |
| 64 | + Translating and analyzing the two instructions in isolation produces |
| 65 | + two joins, and potentially three reaching definitions for a subsequent |
| 66 | + use of Rx. In contrast, the two instructions combined into one |
| 67 | + predicate assignment produces zero joins, and only a single reaching |
| 68 | + definition for a subsequent use of Rx, a considerable improvement. |
| 69 | +
|
| 70 | + {2 Identification} |
| 71 | +
|
| 72 | + Instruction aggregates are identified during disassembly by pattern |
| 73 | + matching. Identification is entirely syntactic, so only the information |
| 74 | + present in the instructions themselves, such as opcode and operands, |
| 75 | + are used; not the possible values that those operands may have, unless |
| 76 | + they are immediates. |
| 77 | +
|
| 78 | + {2 Representation} |
| 79 | +
|
| 80 | + The aggregate is represented by objects of the class type |
| 81 | + [arm_instruction_aggregate_int]. These objects contain the kind of |
| 82 | + aggregate, a list of the contributing instructions, and information |
| 83 | + about its entry, exit, and anchor address. The anchor is the |
| 84 | + instruction to which the full semantics of the aggregate is assigned. |
| 85 | +
|
| 86 | + Cross references to instances are recorded in the [arm_assembly_instruction_int] |
| 87 | + instances of the instructions that make up the aggregate to enable |
| 88 | + triggering proper translation and analysis. The collection of all |
| 89 | + aggregates themselves is maintained in the singleton object of type |
| 90 | + [arm_assembly_instructions_int]. |
| 91 | +
|
| 92 | + Most aggregates are contained within a single basic block and do not |
| 93 | + affect control flow. The exceptions are jump tables, whose components |
| 94 | + are directly incorporated into the CFG during disassembly and require |
| 95 | + no further semantic handling. |
| 96 | +
|
| 97 | + {2 Translation into CHIF} |
| 98 | +
|
| 99 | + The semantics of the aggregate is explicated in the translation into |
| 100 | + CHIF. In general, instructions are individually translated into CHIF |
| 101 | + (in the function [translate_arm_instruction]). In the case of aggregates |
| 102 | + CHIF is generated for the full semantics at the anchor instruction, |
| 103 | + while all other other instructions in the aggregate are treated as |
| 104 | + no-ops. In the case of jump tables the full semantics is already |
| 105 | + incorporated in the CFG during disassembly and all instructions in |
| 106 | + the aggregate are treated as no-ops. |
| 107 | +
|
| 108 | + {2 Transfer to front end} |
| 109 | +
|
| 110 | + The kind and structure of aggregates and analysis results involving |
| 111 | + their components are communicated to the (python) front end in the |
| 112 | + instruction results data contained in the function opcode dictionary |
| 113 | + (see bCHFnARMDictionary). The format varies with the kind of aggregate, |
| 114 | + as each kind has different types of values that characterize its |
| 115 | + operation. In the (python) front end these values are made accessible |
| 116 | + to the various instructions in the [InstrXData] objects created for |
| 117 | + each instruction. The documentation of each aggregate kind below and |
| 118 | + elsewhere presents more details on the actual format for each of these. |
| 119 | +
|
| 120 | + {2 Tests} |
| 121 | +
|
| 122 | + Currenly only two of the kinds of aggregates have associated unit tests: |
| 123 | + - bCHARMJumpTableTest |
| 124 | + - bCHThumbITSequenceTest |
| 125 | + These tests contain instances of code fragments encountered in actual |
| 126 | + binaries and give some insight in their use. |
| 127 | +
|
| 128 | +
|
| 129 | + {1 Predicate Assignments} |
| 130 | +
|
| 131 | + {2 Description} |
| 132 | +
|
| 133 | + A predicate assignment aggregate consists of two assignment instructions |
| 134 | + that combine into a single assignment of a boolean predicate. The pattern |
| 135 | + for the aggregate is two adjacent MOV instructions of the form: |
| 136 | + {[ |
| 137 | + <test> |
| 138 | + MOVcc Rx, #1 |
| 139 | + MOVcc' Rx, #0 where cc' is (not cc) |
| 140 | + ]} |
| 141 | + or |
| 142 | + {[ |
| 143 | + <test> |
| 144 | + MOVcc' Rx, #0 |
| 145 | + MOVcc Rx, #1 where cc is not (cc') |
| 146 | + ]} |
| 147 | + In all executions exactly one of these two MOV instructions is executed. |
| 148 | + If the joint condition <test-cc> is true Rx is assigned 1, if it is false |
| 149 | + Rx is assigned 0, and thus the postcondition of these two instructions |
| 150 | + is essentially the assignment of the boolean <test-cc>. |
| 151 | +
|
| 152 | + {3 Example} |
| 153 | +
|
| 154 | + {[ |
| 155 | + <CMP R0, #5> |
| 156 | + MOVNE R1, #0 |
| 157 | + MOVEQ R1, #1 |
| 158 | + ]} |
| 159 | + is translated into |
| 160 | + {[ R1 := (R0 == 5) ]} |
| 161 | +
|
| 162 | + {2 Representation} |
| 163 | +
|
| 164 | + The [arm_instruction_aggregate] instance of the predicate assignment |
| 165 | + records the address of the second instruction as the anchor, the |
| 166 | + destination operand, and whether the <test-cc> predicate is to be |
| 167 | + inverted, where cc is the condition code for the anchor instruction. |
| 168 | + The predicate is to be inverted if the anchor instruction assigns 0. |
| 169 | +
|
| 170 | + {2 Translation to CHIF} |
| 171 | +
|
| 172 | + The first MOV instruction is translated into a no-op. The second |
| 173 | + MOV instruction is translated into the assignment |
| 174 | + {[ Rx := [not] predicate ]} |
| 175 | + with the single defined variable Rx, and with variables used all |
| 176 | + variables referenced in the predicate. |
| 177 | +
|
| 178 | + If a predicate cannot be constructed or derived, the destination |
| 179 | + variable Rx is abstracted. |
| 180 | +
|
| 181 | + {2 Transfer to front end} |
| 182 | +
|
| 183 | + The first instruction is tagged as 'subsumed' by the second |
| 184 | + instruction, and thus to be treated as a no-op. The second |
| 185 | + instruction is tagged with 'agg:predassign' with a list of |
| 186 | + dependents (the first instruction in this case). It lists as |
| 187 | + a variable the destination operand, along with its (high) |
| 188 | + uses. If a predicate was construction, it lists |
| 189 | + (the possibly inverse of) the predicate in its original and |
| 190 | + rewritten form, along with its reaching definitions. |
40 | 191 | *) |
41 | 192 |
|
42 | 193 | (* bchlib *) |
|
0 commit comments