Suggestion: Read file signature instead of extension when determing compression type

Context: I've had a few files with incorrect extensions, (i.e. a RAR file having a .gz extension) fail to get extracted.

Rather than taking the file extension and mapping that to a decompressor function (https://github.com/golift/xtractr/blob/main/files.go#L34), it may be more robust to read the first few bytes of a file and match against a list of known [file signatures](https://en.wikipedia.org/wiki/List_of_file_signatures).

For example, a ZIP file will start with one of these set of bytes (represented in hex):
```
50 4B 03 04
50 4B 05 06 (empty archive)
50 4B 07 08 (spanned archive)
```

Another example with RAR file with .gz extension,
if we use the common linux command `file` (which uses these file signatures), we can see it correctly identifies the .gz file as a RAR file despite having the .gz ending:
```
file "/downloads/xxx.gz"   
/downloads/xxx.gz: RAR archive data, v5
```

When Unpackerr tries to decompress this file, we can see it tries to use the gzip decompression function:
```
[ERROR] 2025/07/06 23:14:43.058115 handlers.go:216: Extraction Failed: xxx: gzip.NewReader: gzip: invalid header
```

If we read the first few bytes of this file, it's clear that the file is a RAR file based on the signature:
```
xxd -l 16 "/downloads/xxx.gz" 
00000000: 5261 7221 1a07 0100 bd33 4e2f 1001 050c  Rar!.....3N/....
```

(If you search 52 61 on the wikipedia page I linked before, you can see it maps to "Roshal ARchive compressed archive v5.00 onwards[24]" aka RAR)

I think the best action would be:
- Try and read the first 16 bytes of a file and match it with a lookup table of signatures that map to the decompressor function. If exists, then great, use it!
- If the signature is not in the table, fall back to parsing the file extension as we currently are doing. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suggestion: Read file signature instead of extension when determing compression type #544

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suggestion: Read file signature instead of extension when determing compression type #544

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions