Skip to content

Commit d123265

Browse files
committed
fixup! feat(math): introduce Chi Square entropy in EntropyReport
1 parent 1d264d6 commit d123265

File tree

9 files changed

+183
-162
lines changed

9 files changed

+183
-162
lines changed

docs/guide.md

Lines changed: 59 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -114,12 +114,18 @@ $ cat alpine-report.json
114114
]
115115
```
116116

117-
### Entropy calculation
117+
### Randomness calculation
118118

119119
If you are analyzing an unknown file format, it might be useful to know the
120-
entropy of the contained files, so you can quickly see for example whether the
120+
randomness of the contained files, so you can quickly see for example whether the
121121
file is **encrypted** or contains some random content.
122122

123+
Two values are calculated as part of randomness measurements:
124+
- Shannon's entropy
125+
- χ² probability
126+
127+
You can find detailed information about both measures [here](https://www.fourmilab.ch/random/).
128+
123129
Let's make a file with fully random content at the start and end:
124130

125131
```console
@@ -128,59 +134,61 @@ $ dd if=/dev/random of=random2.bin bs=10M count=1
128134
$ cat random1.bin alpine-minirootfs-3.16.1-x86_64.tar.gz random2.bin > unknown-file
129135
```
130136

131-
A nice ASCII entropy plot is drawn on verbose level 3:
137+
A nice ASCII randomness plot is drawn on verbose level 3:
132138

133139
```console
134140
$ unblob -vvv unknown-file | grep -C 15 "Entropy distribution"
135141

136-
2022-07-30 07:58.16 [debug ] Ended searching for chunks all_chunks=[0xa00000-0xc96196] pid=19803
137-
2022-07-30 07:58.16 [debug ] Removed inner chunks outer_chunk_count=1 pid=19803 removed_inner_chunk_count=0
138-
2022-07-30 07:58.16 [warning ] Found unknown Chunks chunks=[0x0-0xa00000, 0xc96196-0x1696196] pid=19803
139-
2022-07-30 07:58.16 [info ] Extracting unknown chunk chunk=0x0-0xa00000 path=unknown-file_extract/0-10485760.unknown pid=19803
140-
2022-07-30 07:58.16 [debug ] Carving chunk path=unknown-file_extract/0-10485760.unknown pid=19803
141-
2022-07-30 07:58.16 [debug ] Calculating entropy for file path=unknown-file_extract/0-10485760.unknown pid=19803 size=0xa00000
142-
2022-07-30 07:58.16 [debug ] Entropy calculated highest=99.99 lowest=99.98 mean=99.98 pid=19803
143-
2022-07-30 07:58.16 [warning ] Drawing plot pid=19803
144-
2022-07-30 07:58.16 [debug ] Entropy chart chart=
145-
Entropy distribution
146-
┌---------------------------------------------------------------------------┐
147-
100┤•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••│
148-
90┤ │
149-
80┤ │
150-
70┤ │
151-
60┤ │
152-
50┤ │
153-
40┤ │
154-
30┤ │
155-
20┤ │
156-
10┤ │
157-
0┤ │
158-
└┬---┬---┬---─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬┘
159-
1 4 7 12 16 20 24 29 33 37 41 46 50 54 59 63 67 71 76 80
160-
[y] entropy % [x] mB
161-
pid=19803
162-
2022-07-30 07:58.16 [info ] Extracting unknown chunk chunk=0xc96196-0x1696196 path=unknown-file_extract/13197718-23683478.unknown pid=19803
163-
2022-07-30 07:58.16 [debug ] Carving chunk path=unknown-file_extract/13197718-23683478.unknown pid=19803
164-
2022-07-30 07:58.16 [debug ] Calculating entropy for file path=unknown-file_extract/13197718-23683478.unknown pid=19803 size=0xa00000
165-
2022-07-30 07:58.16 [debug ] Entropy calculated highest=99.99 lowest=99.98 mean=99.98 pid=19803
166-
2022-07-30 07:58.16 [warning ] Drawing plot pid=19803
167-
2022-07-30 07:58.16 [debug ] Entropy chart chart=
168-
Entropy distribution
169-
┌---------------------------------------------------------------------------┐
170-
100┤•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••│
171-
90┤ │
172-
80┤ │
173-
70┤ │
174-
60┤ │
175-
50┤ │
176-
40┤ │
177-
30┤ │
178-
20┤ │
179-
10┤ │
180-
0┤ │
181-
└┬---┬---┬---─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬┘
182-
1 4 7 12 16 20 24 29 33 37 41 46 50 54 59 63 67 71 76 80
183-
[y] entropy % [x] mB
142+
2024-10-30 10:52.03 [debug ] Calculating chunk for pattern match handler=arc pid=1963719 real_offset=0x1685f5b start_offset=0x1685f5b
143+
2024-10-30 10:52.03 [debug ] Header parsed header=<arc_head archive_marker=0x1a, header_type=0x1, name=b'8\xa7i&po\xc77\xd5h\x9a\x9d\xf1', size=0x26d171fa, date=0x1bfd, time=0xe03f, crc=-0x3b95, length=0x349997d5> pid=1963719
144+
2024-10-30 10:52.03 [debug ] Ended searching for chunks all_chunks=[0xa00000-0xc96196] pid=1963719
145+
2024-10-30 10:52.03 [debug ] Removed inner chunks outer_chunk_count=1 pid=1963719 removed_inner_chunk_count=0
146+
2024-10-30 10:52.03 [warning ] Found unknown Chunks chunks=[0x0-0xa00000, 0xc96196-0x1696196] pid=1963719
147+
2024-10-30 10:52.03 [info ] Extracting unknown chunk chunk=0x0-0xa00000 path=unknown-file_extract/0-10485760.unknown pid=1963719
148+
2024-10-30 10:52.03 [debug ] Carving chunk path=unknown-file_extract/0-10485760.unknown pid=1963719
149+
2024-10-30 10:52.03 [debug ] Calculating randomness for file path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
150+
2024-10-30 10:52.03 [debug ] Shannon entropy calculated block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
151+
2024-10-30 10:52.03 [debug ] Chi square probability calculated block_size=0x20000 highest=97.88 lowest=3.17 mean=52.76 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
152+
2024-10-30 10:52.03 [debug ] Entropy chart chart=
153+
Randomness distribution
154+
┌───────────────────────────────────────────────────────────────────────────┐
155+
100┤ •• Shannon entropy (%) •••••••••♰••••••••••••••••••••••••••••••••••│
156+
90┤ ♰♰ Chi square probability (%) ♰ ♰ ♰♰♰♰ ♰ ♰ ♰ │
157+
80┤♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰♰♰♰♰♰♰♰ ♰ ♰♰♰♰♰♰ ♰♰ ♰♰ │
158+
70┤♰♰♰♰ ♰ ♰ ♰ ♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰♰♰♰♰♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰♰ ♰♰♰♰♰♰ │
159+
60┤♰♰♰♰ ♰♰ ♰♰ ♰ ♰♰♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰♰♰♰♰♰ ♰♰ ♰ ♰ ♰♰♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰♰ │
160+
50┤ ♰♰♰ ♰♰ ♰♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰ ♰ ♰ ♰ ♰♰♰♰♰ ♰ ♰♰♰ ♰ ♰♰♰♰♰ ♰ │
161+
40┤ ♰♰ ♰♰ ♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰ ♰♰ ♰♰ ♰♰♰♰♰♰ ♰ ♰♰♰ ♰ ♰♰♰♰ ♰♰ ♰│
162+
30┤ ♰ ♰♰ ♰♰ ♰♰♰♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰♰♰ ♰ ♰ ♰♰ ♰ ♰♰♰ ♰♰ ♰ │
163+
20┤ ♰♰ ♰♰ ♰♰♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ │
164+
10┤ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰ ♰ ♰♰ │
165+
0┤ ♰ ♰ │
166+
└─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
167+
0 2 5 7 11 16 20 23 27 30 34 38 42 47 51 56 60 63 68 71 76 79
168+
131072 bytes
169+
path=unknown-file_extract/0-10485760.unknown pid=1963719
170+
2024-10-30 10:52.03 [info ] Extracting unknown chunk chunk=0xc96196-0x1696196 path=unknown-file_extract/13197718-23683478.unknown pid=1963719
171+
2024-10-30 10:52.03 [debug ] Carving chunk path=unknown-file_extract/13197718-23683478.unknown pid=1963719
172+
2024-10-30 10:52.03 [debug ] Calculating randomness for file path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
173+
2024-10-30 10:52.03 [debug ] Shannon entropy calculated block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
174+
2024-10-30 10:52.03 [debug ] Chi square probability calculated block_size=0x20000 highest=99.03 lowest=0.23 mean=42.62 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
175+
2024-10-30 10:52.03 [debug ] Entropy chart chart=
176+
Randomness distribution
177+
┌───────────────────────────────────────────────────────────────────────────┐
178+
100┤ •• Shannon entropy (%) •••••••••••••••••••••♰••••••••••••••••••••••│
179+
90┤ ♰♰ Chi square probability (%) ♰ ♰♰ ♰ │
180+
80┤♰♰ ♰♰ ♰♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰♰ │
181+
70┤♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰♰ ♰♰♰ ♰ ♰♰ ♰♰ │
182+
60┤ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰♰♰♰♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰♰ ♰♰ ♰ ♰ ♰♰ ♰ │
183+
50┤ ♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰ ♰♰♰♰ ♰ ♰♰ ♰ ♰♰♰ ♰ ♰ ♰ ♰♰♰ ♰♰ ♰ ♰ ♰♰ ♰♰ ♰ │
184+
40┤ ♰♰♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰ ♰♰♰ ♰♰♰ ♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰♰♰ ♰♰ │
185+
30┤ ♰♰♰♰ ♰♰ ♰♰ ♰♰ ♰♰ ♰♰ ♰♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰│
186+
20┤ ♰♰♰ ♰ ♰ ♰♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰♰ ♰ ♰♰ ♰♰ ♰ ♰ │
187+
10┤ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰ │
188+
0┤ ♰ ♰ ♰♰ ♰ ♰♰ │
189+
└─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
190+
0 2 5 7 11 16 20 23 27 30 34 38 42 47 51 56 60 63 68 71 76 79
191+
131072 bytes
184192
```
185193

186194
### Skip extraction with file magic

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ unblob identifies known and unknown chunks of data within a file:
9494
extracted content, looking for chunks in extracted files.
9595

9696
- a report on metadata can be generated by unblob, providing detailed
97-
information about identified chunks (format, offsets, size, entropy) and their
97+
information about identified chunks (format, offsets, size, randomness) and their
9898
extracted content if available (ownership, permissions, timestamps, ...).
9999

100100
![unblob_architecture.webp](unblob_architecture.webp)
@@ -115,7 +115,7 @@ special **DirectoryExtractor**.
115115
- For extracting recognized formats, we use all kinds of different [Extractors](extractors.md).
116116
- For ELF analysis, we are using [LIEF](https://lief-project.github.io/) with
117117
its [Python bindings](https://pypi.org/project/lief/).
118-
- For CPU-intensive tasks (e.g. entropy calculation), we use
118+
- For CPU-intensive tasks (e.g. randomness calculation), we use
119119
[Rust](https://www.rust-lang.org/) to speed things up.
120120
- For the pretty command line interface, we are using the
121121
[Click library](https://click.palletsprojects.com/).

tests/test_cli.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ def test_dir_for_file(tmp_path: Path):
184184

185185

186186
@pytest.mark.parametrize(
187-
"params, expected_depth, expected_entropy_depth, expected_process_num, expected_verbosity, expected_progress_reporter",
187+
"params, expected_depth, expected_randomness_depth, expected_process_num, expected_verbosity, expected_progress_reporter",
188188
[
189189
pytest.param(
190190
[],
@@ -233,7 +233,7 @@ def test_dir_for_file(tmp_path: Path):
233233
def test_archive_success(
234234
params,
235235
expected_depth: int,
236-
expected_entropy_depth: int,
236+
expected_randomness_depth: int,
237237
expected_process_num: int,
238238
expected_verbosity: int,
239239
expected_progress_reporter: Type[ProgressReporter],
@@ -263,7 +263,7 @@ def test_archive_success(
263263
config = ExtractionConfig(
264264
extract_root=tmp_path,
265265
max_depth=expected_depth,
266-
entropy_depth=expected_entropy_depth,
266+
entropy_depth=expected_randomness_depth,
267267
entropy_plot=bool(expected_verbosity >= 3),
268268
process_num=expected_process_num,
269269
handlers=BUILTIN_HANDLERS,

0 commit comments

Comments
 (0)