Skip to content

Commit d86f9e4

Browse files
committed
Calculate and report weighted mean of entropy blocks
In most cases where entropy is calculated, and have more than 1K input, the file is split up into multiple "blocks" of minimum 1K size, and the entropy mean was calculated as the mean of the block-entropies. This introduced a bias, as the last block is usually not the same size as all the others, so its entropy had more say on the mean, than the other blocks. The above has more significant effect on files smaller than 80K, as that is the limit, where a different block size can level the differences in block sizes. Let's look at an extreme example of an encrypted file of size 1025. This would give us 2 "blocks" of sizes 1024 and 1 with entropies scaled to "percentages" (0-100) for these blocks as ~100 and 0 respectively. In this case naive mean is ~50 = (~100 + 0) / 2 in contrast, the weighted mean is a much better approximate: ~99.9 = (~100 * 1024 + 0 * 1) / (1024 + 1)
1 parent 2971550 commit d86f9e4

File tree

3 files changed

+12
-12
lines changed

3 files changed

+12
-12
lines changed

tests/test_processing.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -378,19 +378,17 @@ def get_all(file_name, report_type: Type[T]) -> List[T]:
378378
# with a percentages (scaled up bits) of 64 items, for 0, 6, 8, 8, ... bits of entropies
379379
[unknown_chunk_report] = get_all("input-file", UnknownChunkReport)
380380
unknown_entropy = unknown_chunk_report.entropy
381-
assert unknown_entropy == EntropyReport(
382-
percentages=[0.0, 75.0] + [100.0] * 62,
383-
block_size=1024,
384-
)
385381
assert (
386382
unknown_entropy is not None
387-
) # removes pyright complaints for the below 3 lines :(
383+
) # removes pyright complaints for the below lines :(
384+
assert unknown_entropy.percentages == [0.0, 75.0] + [100.0] * 62
385+
assert unknown_entropy.block_size == 1024
388386
assert round(unknown_entropy.mean, 2) == 98.05 # noqa: PLR2004
389387
assert unknown_entropy.highest == 100.0 # noqa: PLR2004
390388
assert unknown_entropy.lowest == 0.0 # noqa: PLR2004
391389

392390
# we should have entropy calculated for files without extractions, except for empty files
393391
assert [] == get_all("empty.txt", EntropyReport)
394-
assert [EntropyReport(percentages=[100.0], block_size=1024)] == get_all(
392+
assert [EntropyReport(percentages=[100.0], block_size=1024, mean=100.0)] == get_all(
395393
"0-255.bin", EntropyReport
396394
)

unblob/processing.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -521,13 +521,19 @@ def calculate_entropy(path: Path) -> EntropyReport:
521521
max_limit=1024 * 1024,
522522
)
523523

524+
entropy_sum = 0.0
524525
with File.from_path(path) as file:
525526
for chunk in iterate_file(file, 0, file_size, buffer_size=block_size):
526527
entropy = shannon_entropy(chunk)
527528
entropy_percentage = round(entropy / 8 * 100, 2)
528529
percentages.append(entropy_percentage)
530+
entropy_sum += entropy * len(chunk)
529531

530-
report = EntropyReport(percentages=percentages, block_size=block_size)
532+
report = EntropyReport(
533+
percentages=percentages,
534+
block_size=block_size,
535+
mean=entropy_sum / file_size / 8 * 100,
536+
)
531537

532538
logger.debug(
533539
"Entropy calculated",

unblob/report.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import hashlib
22
import os
33
import stat
4-
import statistics
54
import traceback
65
from enum import Enum
76
from pathlib import Path
@@ -177,10 +176,7 @@ class FileMagicReport(Report):
177176
class EntropyReport(Report):
178177
percentages: List[float]
179178
block_size: int
180-
181-
@property
182-
def mean(self):
183-
return statistics.mean(self.percentages)
179+
mean: float
184180

185181
@property
186182
def highest(self):

0 commit comments

Comments
 (0)