Skip to content

Commit 9e80f56

Browse files
authored
Version 2.0.0b1 (#107)
- Adding deep scan for improved accuracy #102 #94 #70 #69 #12 #3 - Changing to full semantic versioning to be able to denote bugfixes vs minor features - Removing support for python 3.7, 3.8, 3.9, 3.10 and 3.11 please stick to 1.x release chain to support older versions
1 parent 01746b3 commit 9e80f56

39 files changed

+637
-237
lines changed

.github/workflows/pythonpublish.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,14 @@ on:
1010
jobs:
1111
deploy:
1212

13-
runs-on: ubuntu-22.04
13+
runs-on: ubuntu-latest
1414

1515
steps:
1616
- uses: actions/checkout@v4
1717
- name: Set up Python
1818
uses: actions/setup-python@v5
1919
with:
20-
python-version: '3.9.22'
20+
python-version: '3.12'
2121
- name: Install dependencies
2222
run: |
2323
python -m pip install --upgrade pip

.github/workflows/tests.yml

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,13 @@ jobs:
1515
strategy:
1616
fail-fast: false
1717
matrix:
18-
os: [ubuntu-22.04]
19-
python-version: ["3.7.17", "3.8.18", "3.9.22", "3.10.17", "3.11.12", "3.12.10", "3.13.3"]
18+
os: [ubuntu-latest]
19+
python-version: ["3.12", "3.13"]
2020
include:
2121
- os: macos-latest
22-
python-version: '3.13.3'
22+
python-version: '3.13'
2323
- os: windows-latest
24-
python-version: '3.13.3'
24+
python-version: '3.13'
2525
runs-on: ${{ matrix.os }}
2626
steps:
2727
- uses: actions/checkout@v4
@@ -31,21 +31,17 @@ jobs:
3131
with:
3232
python-version: ${{ matrix.python-version }}
3333
allow-prereleases: true
34+
cache: 'pip'
3435

3536
- name: Install dependencies
3637
run: |
3738
python -m pip install --upgrade pip
38-
pip install -r requirements-test.txt
3939
pip install coveralls flake8 setuptools wheel twine
40-
41-
- name: Update coverage on newer Python versions
42-
if: ${{ matrix.python-version != '3.7.17' && matrix.python-version != '3.8.18' }}
43-
run: pip install coverage>=7.8.0 pytest-cov>=6.1.1
40+
pip install -r requirements-test.txt --upgrade
41+
pip install black==24.10.0
4442
4543
- name: Verify Code with Black
46-
if: ${{ matrix.python-version != '3.7.17' }}
4744
run: |
48-
pip install black==24.4.2
4945
black --check puremagic test
5046
5147
- name: Lint with flake8
@@ -54,14 +50,10 @@ jobs:
5450
flake8 puremagic --count --show-source --statistics
5551
5652
- name: Test with pytest
57-
env:
58-
COVERALLS_REPO_TOKEN: ${{ secrets.COVERALLS_REPO_TOKEN }}
5953
run: |
6054
python -m pytest --cov=puremagic test/
61-
coveralls || true
6255
6356
- name: Check distribution log description
64-
if: ${{ matrix.python-version == '3.9.22' }}
6557
shell: bash
6658
run: |
6759
python setup.py sdist bdist_wheel

.pre-commit-config.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
repos:
22
- repo: https://github.com/pre-commit/pre-commit-hooks
3-
rev: v4.6.0
3+
rev: v5.0.0
44
hooks:
55
# Identify invalid files
66
- id: check-ast
@@ -35,26 +35,26 @@ repos:
3535

3636

3737
- repo: https://github.com/astral-sh/ruff-pre-commit
38-
rev: v0.5.7
38+
rev: v0.7.2
3939
hooks:
4040
- id: ruff
4141

4242
- repo: https://github.com/ambv/black
43-
rev: 24.8.0
43+
rev: 24.10.0
4444
hooks:
4545
- id: black
4646

4747
- repo: https://github.com/pre-commit/mirrors-mypy
48-
rev: 'v1.11.1'
48+
rev: 'v1.13.0'
4949
hooks:
5050
- id: mypy
5151

5252
- repo: https://github.com/tox-dev/pyproject-fmt
53-
rev: 2.2.1
53+
rev: v2.5.0
5454
hooks:
5555
- id: pyproject-fmt
5656

5757
- repo: https://github.com/abravalheri/validate-pyproject
58-
rev: v0.18
58+
rev: v0.22
5959
hooks:
6060
- id: validate-pyproject

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
11
Changelog
22
=========
33

4+
Version 2.0.0
5+
-------------
6+
7+
- Adding deep scan for improved accuracy #102 #94 #70 #69 #12 #3
8+
- Changing to full semantic versioning to be able to denote bugfixes vs minor features
9+
- Removing support for python 3.7, 3.8, 3.9, 3.10 and 3.11 please stick to 1.x release chain to support older versions
10+
411
Version 1.29
512
------------
613

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
include puremagic/*.json
22
include puremagic/py.typed
3+
include puremagic/scanners/*.py
34
include LICENSE
45
include AUTHORS.rst
56
include CHANGELOG.md

README.rst

Lines changed: 3 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@ puremagic
55
puremagic is a pure python module that will identify a file based off
66
it's magic numbers.
77

8-
|CoverageStatus| |License| |PyPi|
9-
108
It is designed to be minimalistic and inherently cross platform
119
compatible. It is also designed to be a stand in for python-magic, it
1210
incorporates the functions from\_file(filename[, mime]) and
@@ -36,7 +34,9 @@ Disadvantages:
3634
Compatibility
3735
~~~~~~~~~~~~~
3836

39-
- Python 3.7+
37+
- Python 3.12+
38+
39+
For use with with 3.7 use the 1.x branch.
4040

4141
Using github ci to run continuous integration tests on listed platforms.
4242

@@ -151,18 +151,6 @@ file standard. The subset signature will be longer, therefore report
151151
with greater confidence, because it will have both the base file type
152152
signature plus the additional subset one.
153153

154-
*You don't have sliding offsets that could better detect plenty of
155-
common formats, why's that?*
156-
157-
Design choice, so it will be a lot faster and more accurate. Without
158-
more intelligent or deeper identification past a sliding offset I don't
159-
feel comfortable including it as part of a 'magic number' library.
160-
161-
*Your version isn't as complete as I want it to be, where else should I
162-
look?*
163-
164-
Look into python modules that wrap around libmagic or use something like
165-
Apache Tika.
166154

167155
Acknowledgements
168156
----------------
@@ -182,10 +170,3 @@ License
182170
-------
183171

184172
MIT Licenced, see LICENSE, Copyright (c) 2013-2025 Chris Griffith
185-
186-
.. |CoverageStatus| image:: https://coveralls.io/repos/github/cdgriffith/puremagic/badge.svg?branch=develop
187-
:target: https://coveralls.io/github/cdgriffith/puremagic?branch=develop
188-
.. |PyPi| image:: https://img.shields.io/pypi/v/puremagic.svg?maxAge=2592000
189-
:target: https://pypi.python.org/pypi/puremagic/
190-
.. |License| image:: https://img.shields.io/pypi/l/puremagic.svg
191-
:target: https://pypi.python.org/pypi/puremagic/

puremagic/magic_data.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -839,6 +839,7 @@
839839
["465753", 0, ".swf", "application/x-shockwave-flash", "Macromedia Shockwave Flash file"],
840840
["1a0b", 0, ".pak", "application/pak", "Compressed archive file (often associated with Quake Engine games)"],
841841
["7573746172", 257, ".tar", "application/x-tar", "Tape Archive file"],
842+
["7573746172", 257, ".cbt", "application/x-tar", "Comic Book in TAR Format"],
842843
["2d6c68", 2, ".lzh", "application/octet-stream", "Compressed archive file"],
843844
["504b0304", 0, ".zip", "application/zip", "PKZIP Archive file"],
844845
["504b030414000100630000000000", 0, ".zip", "application/zip", "ZLock Pro Encrypted ZIP file"],
@@ -1645,6 +1646,7 @@
16451646
["1f8b08", 0, ".gz", "application/x-gzip", "GZIP Archive file"],
16461647
["fd377a585a00", 0, ".xz", "application/x-xz", "LMZA XZ Archive file"],
16471648
["377abcaf271c", 0, ".7z", "application/x-7z-compressed", "7-Zip Compressed file"],
1649+
["377abcaf271c", 0, ".cb7", "application/x-7z-compressed", "Comic Book Archive 7z format"],
16481650
["04000000", 524, ".db", "application/octet-stream", "Windows Thumbs.db file"],
16491651
["23212f7573722f62696e2f656e7620707974686f6e", 0, ".py", "text/x-python", "Python file"],
16501652
["23202d2a2d20636f64696e67", 0, ".py", "text/x-python", "Python file"],

puremagic/main.py

Lines changed: 95 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/usr/bin/env python
22
"""
33
puremagic is a pure python module that will identify a file based off it's
4-
magic numbers. It is designed to be minimalistic and inherently cross platform
4+
magic numbers. It is designed to be minimalistic and inherently cross-platform
55
compatible, with no imports when used as a module.
66
77
© 2013-2025 Chris Griffith - License: MIT (see LICENSE)
@@ -19,9 +19,13 @@
1919
from binascii import unhexlify
2020
from collections import namedtuple
2121
from itertools import chain
22+
from pathlib import Path
23+
24+
if os.getenv("PUREMAGIC_DEEPSCAN") != "0":
25+
from puremagic.scanners import zip_scanner, pdf_scanner, text_scanner
2226

2327
__author__ = "Chris Griffith"
24-
__version__ = "1.29"
28+
__version__ = "2.0.0b1"
2529
__all__ = [
2630
"magic_file",
2731
"magic_string",
@@ -133,9 +137,6 @@ def _confidence(matches, ext=None) -> list[PureMagicWithConfidence]:
133137
if ext == magic_row.extension
134138
]
135139

136-
if not results:
137-
raise PureError("Could not identify file")
138-
139140
return sorted(results, key=lambda x: (x.confidence, len(x.byte_match)), reverse=True)
140141

141142

@@ -196,11 +197,22 @@ def _identify_all(header: bytes, footer: bytes, ext=None) -> list[PureMagicWithC
196197
return _confidence(matches, ext)
197198

198199

199-
def _magic(header: bytes, footer: bytes, mime: bool, ext=None) -> str:
200+
def _magic(header: bytes, footer: bytes, mime: bool, ext=None, filename=None) -> str:
200201
"""Discover what type of file it is based on the incoming string"""
201202
if not header:
202203
raise ValueError("Input was empty")
203-
info = _identify_all(header, footer, ext)[0]
204+
infos = _identify_all(header, footer, ext)
205+
if filename and os.getenv("PUREMAGIC_DEEPSCAN") != "0":
206+
results = _run_deep_scan(infos, filename, header, footer, raise_on_none=True)
207+
if results:
208+
if results[0].extension == "":
209+
raise PureError("Could not identify file")
210+
if mime:
211+
return results[0].mime_type
212+
return results[0].extension
213+
if not infos:
214+
raise PureError("Could not identify file")
215+
info = infos[0]
204216
if mime:
205217
return info.mime_type
206218
return info.extension if not isinstance(info.extension, list) else info[0].extension
@@ -268,7 +280,7 @@ def from_file(filename: os.PathLike | str, mime: bool = False) -> str:
268280
"""
269281

270282
head, foot = _file_details(filename)
271-
return _magic(head, foot, mime, ext_from_filename(filename))
283+
return _magic(head, foot, mime, ext_from_filename(filename), filename=filename)
272284

273285

274286
def from_string(string: str | bytes, mime: bool = False, filename: os.PathLike | str | None = None) -> str:
@@ -321,6 +333,8 @@ def magic_file(filename: os.PathLike | str) -> list[PureMagicWithConfidence]:
321333
except PureError:
322334
info = []
323335
info.sort(key=lambda x: x.confidence, reverse=True)
336+
if os.getenv("PUREMAGIC_DEEPSCAN") != "0":
337+
return _run_deep_scan(info, filename, head, foot, raise_on_none=False)
324338
return info
325339

326340

@@ -343,7 +357,10 @@ def magic_string(string, filename: os.PathLike | str | None = None) -> list[Pure
343357
return info
344358

345359

346-
def magic_stream(stream, filename: os.PathLike | str | None = None) -> list[PureMagicWithConfidence]:
360+
def magic_stream(
361+
stream,
362+
filename: os.PathLike | None = None,
363+
) -> list[PureMagicWithConfidence]:
347364
"""Returns tuple of (num_of_matches, array_of_matches)
348365
arranged highest confidence match first
349366
If filename is provided it will be used in the computation.
@@ -361,6 +378,75 @@ def magic_stream(stream, filename: os.PathLike | str | None = None) -> list[Pure
361378
return info
362379

363380

381+
def _single_deep_scan(
382+
bytes_match: bytes | bytearray | None,
383+
filename: os.PathLike | str,
384+
head=None,
385+
foot=None,
386+
):
387+
if os.getenv("PUREMAGIC_DEEPSCAN") == "0":
388+
return None
389+
if not isinstance(filename, os.PathLike):
390+
filename = Path(filename)
391+
match bytes_match:
392+
case zip_scanner.match_bytes:
393+
return zip_scanner.main(filename, head, foot)
394+
case pdf_scanner.match_bytes:
395+
return pdf_scanner.main(filename, head, foot)
396+
case None | b"":
397+
for scanner in (text_scanner, pdf_scanner):
398+
result = scanner.main(filename, head, foot)
399+
if result:
400+
return result
401+
return None
402+
403+
404+
def _run_deep_scan(
405+
matches: list[PureMagicWithConfidence],
406+
filename: os.PathLike | str,
407+
head=None,
408+
foot=None,
409+
raise_on_none=True,
410+
):
411+
if not matches or matches[0].byte_match == b"":
412+
try:
413+
result = _single_deep_scan(None, filename, head, foot)
414+
except Exception:
415+
pass
416+
else:
417+
if result:
418+
return [
419+
PureMagicWithConfidence(
420+
confidence=result.confidence,
421+
byte_match=None,
422+
offset=None,
423+
extension=result.extension,
424+
mime_type=result.mime_type,
425+
name=result.name,
426+
)
427+
]
428+
if raise_on_none:
429+
raise PureError("Could not identify file")
430+
431+
for pure_magic_match in matches:
432+
try:
433+
result = _single_deep_scan(pure_magic_match.byte_match, filename, head, foot)
434+
except Exception:
435+
continue
436+
if result:
437+
return [
438+
PureMagicWithConfidence(
439+
confidence=result.confidence,
440+
byte_match=pure_magic_match.byte_match,
441+
offset=pure_magic_match.offset,
442+
extension=result.extension,
443+
mime_type=result.mime_type,
444+
name=result.name,
445+
)
446+
]
447+
return matches
448+
449+
364450
def command_line_entry(*args):
365451
import sys
366452
from argparse import ArgumentParser

puremagic/scanners/__init__.py

Whitespace-only changes.

puremagic/scanners/helpers.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
from dataclasses import dataclass
2+
3+
4+
@dataclass
5+
class Match:
6+
extension: str
7+
name: str
8+
mime_type: str
9+
confidence: float = 1

0 commit comments

Comments
 (0)