-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Context: I've had a few files with incorrect extensions, (i.e. a RAR file having a .gz extension) fail to get extracted.
Rather than taking the file extension and mapping that to a decompressor function (https://github.com/golift/xtractr/blob/main/files.go#L34), it may be more robust to read the first few bytes of a file and match against a list of known file signatures.
For example, a ZIP file will start with one of these set of bytes (represented in hex):
50 4B 03 04
50 4B 05 06 (empty archive)
50 4B 07 08 (spanned archive)
Another example with RAR file with .gz extension,
if we use the common linux command file (which uses these file signatures), we can see it correctly identifies the .gz file as a RAR file despite having the .gz ending:
file "/downloads/xxx.gz"
/downloads/xxx.gz: RAR archive data, v5
When Unpackerr tries to decompress this file, we can see it tries to use the gzip decompression function:
[ERROR] 2025/07/06 23:14:43.058115 handlers.go:216: Extraction Failed: xxx: gzip.NewReader: gzip: invalid header
If we read the first few bytes of this file, it's clear that the file is a RAR file based on the signature:
xxd -l 16 "/downloads/xxx.gz"
00000000: 5261 7221 1a07 0100 bd33 4e2f 1001 050c Rar!.....3N/....
(If you search 52 61 on the wikipedia page I linked before, you can see it maps to "Roshal ARchive compressed archive v5.00 onwards[24]" aka RAR)
I think the best action would be:
- Try and read the first 16 bytes of a file and match it with a lookup table of signatures that map to the decompressor function. If exists, then great, use it!
- If the signature is not in the table, fall back to parsing the file extension as we currently are doing.