Skip to content

Suggestion: Read file signature instead of extension when determing compression type #544

@ItzDerock

Description

@ItzDerock

Context: I've had a few files with incorrect extensions, (i.e. a RAR file having a .gz extension) fail to get extracted.

Rather than taking the file extension and mapping that to a decompressor function (https://github.com/golift/xtractr/blob/main/files.go#L34), it may be more robust to read the first few bytes of a file and match against a list of known file signatures.

For example, a ZIP file will start with one of these set of bytes (represented in hex):

50 4B 03 04
50 4B 05 06 (empty archive)
50 4B 07 08 (spanned archive)

Another example with RAR file with .gz extension,
if we use the common linux command file (which uses these file signatures), we can see it correctly identifies the .gz file as a RAR file despite having the .gz ending:

file "/downloads/xxx.gz"   
/downloads/xxx.gz: RAR archive data, v5

When Unpackerr tries to decompress this file, we can see it tries to use the gzip decompression function:

[ERROR] 2025/07/06 23:14:43.058115 handlers.go:216: Extraction Failed: xxx: gzip.NewReader: gzip: invalid header

If we read the first few bytes of this file, it's clear that the file is a RAR file based on the signature:

xxd -l 16 "/downloads/xxx.gz" 
00000000: 5261 7221 1a07 0100 bd33 4e2f 1001 050c  Rar!.....3N/....

(If you search 52 61 on the wikipedia page I linked before, you can see it maps to "Roshal ARchive compressed archive v5.00 onwards[24]" aka RAR)

I think the best action would be:

  • Try and read the first 16 bytes of a file and match it with a lookup table of signatures that map to the decompressor function. If exists, then great, use it!
  • If the signature is not in the table, fall back to parsing the file extension as we currently are doing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions