fn2hash - Function hashing and code similarity
fn2hash [--min-instructions=NUMBER] [--extra-data] [--basic-blocks] [--json=FILENAME] [--pretty-json[=INDENT]] [...Pharos options...] EXECUTABLE_FILE
fn2hash --help
fn2hash --rose-version
@PHAROS_OPTS_POD@
fn2hash calculates various function hashes for the functions in a program and dumps the data to stdout in the following CSV format:
filemd5, fn_addr, exact_hash, pic_hash, num_bytes, num_instructions, num_code_blocks, num_data_blocks
where those columns are:
- filemd5
-
The MD5 hash of the input file.
- fn_addr
-
The address of this function.
- exact_hash
-
The MD5 of the bytes of the function concatenated in flow order.
- pic_hash
-
Basically the same as the exact_hash, but address references (except local relative ones) are replaced with 0 values before hashing. The goal is to account for functions that are effectively exactly identical except for references to locations in memory (other functions, imports, global data addresses, etc) that might change with occurances in different programs.
- num_bytes
-
The number of bytes that make up the instructions in the function.
- num_instructions
-
The number of instructions in the function.
- num_code_blocks
-
The number of code blocks (basic blocks) in the function.
- num_data_blocks
-
The number of data blocks in the function.
- exact_bytes
-
The bytes that generate the exact_hash when the MD5 is calculated. These are the bytes of the functions instructions in address order.
- pic_bytes
-
The bytes that generate the pic_hash when the MD5 is calculated. These are the bytes of the function instructions in address order, with bytes representing addresses replaced with zeros.
If the --json option is specified, fn2hash will generate JSON output format instead with those same fields. When the --extra-data option is specified, the fn2hash JSON output will also contain the following additional fields:
- composite_pic_hash
-
A variant of the pic_hash that does not include bytes for control flow related instructions, and the hash is computed by computing the MD5 of each basic block separately (minus the control flow related bytes), and those basic block MD5s are ordered and concatenated, and that resulting string is hashed. The goal is to account for minor differences in output at compile time, like for instance the compiler deciding to use jz instead of jnz and reordering the otherwise identical basic blocks because of that.
- cf_exact_hash
-
The control flow ordering version of the exact_hash. This ordering was the default ordering for the exact hash prior to June 2024. This ordering increased the complexity of the algorithm, but did not improve the usefulness of the hash. This hash is provided for backwards compatibility.
- cf_pic_hash
-
The control flow ordering version of the pic_hash. This ordering was the default ordering for the exact hash prior to June 2024. This ordering increased the complexity of the algorithm, but did not improve the usefulness of the hash due to technical details of how the hash was calculated, specifically the inclusion of control flow instructions at the end of each block. This hash is provided for backwards compatibility.
- cf_exact_bytes
-
The bytes that generate the cf_exact_hash when the MD5 is calculated. These are the bytes of the functions instructions in an order that is dependent on the control flow of the function.
- cf_pic_bytes
-
The bytes that generate the cf_pic_hash when the MD5 is calculated. These are the bytes of the function instructions in an order that is dependent on the control flow of the function, with bytes representing addresses replaced with zeros.
- num_basic_blocks
-
The number of blocks (code and data blocks) in the function. The mildly confusing name is retained for backwards compatibility. The newer fields num_code_blocks and num_data_blocks should be used when possible.
- num_basic_blocks_in_cfg
-
The number of those blocks that are actually in the control flow graph of the function. This should generally match the number of code blocks in the function unless there is unusual control flow in the function.
- mnemonic_hash
-
Like the exact_hash but instead of concatenating the bytes of the instructions to hash, the mnemonics for the instructions are concatenated instead (without operands) and hashed.
- mnemonic_count_hash
-
This is a hash of a vector of ordered pairs of mnemonics and the number of occurances of that mnemonic in the function.
- mnemonic_category_hash
-
Like the mnemonic_hash but the mnemonics are mapped to a smaller set of categories instead. The categories are:
- XFER
-
Data transfer insns (eg: mov, push, xchg).
- MATH
-
Arithmetic insns (eg: add, sub, lea).
- LOGIC
-
Bitwise operations (eg: and, or, not, xor, shl, ror).
- CMP
-
Comparison insns (eg: test, cmp).
- BR
-
Branching insns (eg: jmp, jcc, call).
- FLT
-
Floating point insns (eg: fadd, fmul, fld).
- SIMD
-
SIMD (MMX/SSE* related) insn (eg: addps, mulss, psadbw).
- CRYPTO
-
Insn to aid in cryptography (AES and SHA) calculations (eg: aesdec, sha256rnds2).
- VMM
-
Virtual Machine Monitory (hypervisor) related insns.
- SYS
-
Various "system" level and privileged insns (eg: int, sysenter).
- STR
-
String related functions (eg: movsb).
- I/O
-
Port related insn (eg: in, out, insb, outsb).
- UNCAT
-
Any insns that haven't been assigned to one of the above categories.
- mnemonic_category_counts_hash
-
Like mnemonic_count_hash but using the mnemonic categories instead of mnemonics.
- mnemonic_counts
-
The data used to generate the mnemonic count hash.
- mnemonic_category_counts
-
The data used to generate the mnemonic category counts hash.
- mnemonic_category_count_string
-
The actual vector used in mnemonic_category_count_hash.
If the --basic-blocks option is specified in addition to --extra-data and --json, then the JSON output will also contain the following fields:
- opt_basic_block_data
-
Metadata about each basic block in the function.
- address
-
The starting address for the basic block.
- pic_hash
-
The PIC hash algorithm applied to the basic block.
- composite_pic_hash
-
The composite PIC hash algorithm (excluding control flow instruction) for the basic block.
- num_instructions
-
The number of instructions in the basic block.
- mnemonics
-
The category and mnemonic for each instruction in the block.
- opt_bb_cfg
-
This list describes the edges of the control flow graph for the function. Each entry is a pair of addresses indicating where control flow is from and to. Note that if there is only one basic block in the function, this list will be empty.
Note that since the file MD5 is the first column in the output, that the fn2hash output for multiple files can be combined easily, if desired. Might be convenient for working with data from related sets of files.
The following options are specific to the fn2hash program.
- --min-instructions=NUMBER, -m=NUMBER
-
Minimum number of instructions needed to output data for a function, so functions below this instruction count will not appear in the output.
- --extra-data, -E
-
This option enables reporting of additional data in the JSON output (see description above for the details of the additional fields).
- --basic-blocks, -B
-
The -B option adds basic block data to the JSON output. The --extra-data option must also be specified.
- --json=FILENAME, -j=FILENAME
-
Output hash information in JSON format to FILENAME. If FILENAME is
-, JSON will output to stdout. - --pretty-json[=INDENT], -p[=INDENT]
-
When outputting JSON, use newlines and indentation, making the output human-readable. INDENT is the indentation level, and defaults to
4.
@PHAROS_OPTIONS_POD@
$ fn2hash tests/ooex_vs2010/Debug/ooex1.exe >ooex1.fn2hash.csv
$ head -n3 ooex1.fn2hash.csv
1D28A053B8AD17E213278A05C5709E33,0x00411550,8E0F497DD360FC22C70B30C750A4727C,8E0F497DD360FC22C70B30C750A4727C,61,24,1,1
1D28A053B8AD17E213278A05C5709E33,0x004115A0,26E9F99BDB8200116761D199F8F08D16,26E9F99BDB8200116761D199F8F08D16,54,23,1,1
1D28A053B8AD17E213278A05C5709E33,0x004117FC,7651D12A0F1701880EB69780CC9DEF85,89047698F4380796A13F674942384C0D,6,1,1,0
$ fn2hash --pretty-json=2 --json=ooex1.fn2hash.json tests/ooex_vs2010/Debug/ooex1.exe
$ head -n12 ooex1.fn2hash.json
{
"analysis": [
{
"exact_hash": "8E0F497DD360FC22C70B30C750A4727C",
"filemd5": "1D28A053B8AD17E213278A05C5709E33",
"fn_addr": "0x00411550",
"num_bytes": 61,
"num_code_blocks": 1,
"num_data_blocks": 1,
"num_instructions": 24,
"pic_hash": "8E0F497DD360FC22C70B30C750A4727C"
},
$ fn2hash --pretty-json=4 --extra-data --basic-blocks --json=ooex1.fn2hash.json tests/ooex_vs2010/Debug/ooex1.exe
$ file ooex1.fn2hash.json
ooex1.fn2hash.json: JSON data@PHAROS_ENV_POD@
@PHAROS_FILES_POD@
Written by the Software Engineering Institute at Carnegie Mellon University. The primary author was Charles Hines.
Copyright 2024 Carnegie Mellon University. All rights reserved. This software is licensed under a "BSD" license. Please see LICENSE.txt for details.
See fse.py and possibly fn2yara.