Skip to content

Conversation

@NebularNerd
Copy link
Contributor

@NebularNerd NebularNerd commented Dec 10, 2025

Closes #116

MPEG Audio Scanner Version 2

This is my second pass at an MP3 Scanner (I scrapped v1 after I discovered some issues in my detection logic and workflows, see #120), it ended up being a whole lot more than that, it can scan and test any valid MPEG 1, 2 or 2.5 audio stream. In short, if it's an .mp1, .mp2 or .mp3 this should understand what it is.

The decoder grew in scope far beyond what PureMagic is aimed at, I'm going to eventually release a fully-featured decoder under its own repo as a standalone tool, this is not to compete with PureMagic but it will provide features outside of the PureMagic goals (i.e: Tag recovery/conversion for obscure formats, Data stream checking etc...). As I develop either this or that, code enhancements will pass back and forth, so this scanner will see updates.

Deepscan

I've altered the Best Match line to print Deepscan Match if a scanner returns a positive confidence=1 result, this makes it clearer to users that we're 100% certain the file is what it is. If the file fails the test we politely return None and let a regular magic_data match offer a best match.

magic_data.json

Sorry @cdgriffith, I've made it bigger again.

  • I realised that all but the assumed typical ffbb header for raw MPEG audio files (those without ID3v2) were missing. I've now added all valid byte combinations of .mp1, .mp2 and .mp3
  • Some extra tags for ID3v2 came to light and they have been added.
  • Renaming of all MPEG Audio entries to a consistent format (with some caveats1).
  • Changed extension on ffbb from the less common .mpga to .mp3

1: There is an issue with raw MPEG audio streams where the same header can be used for different MPEG version and Layer revisions, for the magic_data I have gone with what should be the most common file for that header, however, there will be fringe cases where the magic_data may give the wrong MPEG version/Layer. This is a limitation of the magic_data system (or in fact any file id tool that relies solely on magic bytes) that we cannot compensate for. If you're streaming or byte-stringing a file this could affect you, if you supply an MPEG audio file the decoder will give you the correct details in all cases.

MPEG Audio Scanner:

Overview

To test fully if the file is a true MP3 I've ending up building something close to a fully featured MP3 decoder:

  • If the decoder fails during any of it's tests, it will return None and puremagic will fall back to the magic_data.json. If this happens there is a high chance it's not an MPEG Audio file (or is highly corrupted).
  • If the decoder returns a match then you can be certain it's a bonafide MPEG Audio file, with correct version, layer and other information.
  • This decoder does not trust the Xing/Info header as a test of VBR/CBR, these can provide a false results (see the test files below), for VBR/CBR testing we test the bitrate across a few frames, if it changes it's VBR/ABR, if not it's almost certainly a CBR.
  • This will also scan and check pretty much every TAG/Metadata style out there, if it finds a valid one it will add it to the details.

Features

Deep scans any MPEG Audio files, tests and scans for:

  • MPEG versions 1, 2 and 2.5
  • Layer I (MP1), II (MP2), III (MP3)
  • Tests files with/without ID3v2 tags (ID3 header)
  • Tests ID3v2 for correct structure and size
  • Tests CBR/VBR files that are not LAME encoded
  • Tests CBR/VBR files that are LAME Xing VBR/ABR or Info CBR encoded files
  • Tests VBR files that are VBRI encoded
  • Tests the various End of File TAG formats (see Tags tested below)
  • Tests various factors to confirm it's a valid file
  • Returns: MPEG (1,2,2.5) Audio Layer (I,II,III) file [bitrate samplerate Stereo/Mono VBR/CBR LAME/VBRI(if present) metatags]

Issues, Limitations and MPEG Audio quirks

All issues I had in v1 of the scanner have been overcome, some limitations still apply and MPEG Audio files have some quirks/misconceptions.

  • End of file tags: These are a nightmare, they can push each other around meaning they fall out of accepted spec locations, yet still, in part be valid. We test for the tag in it's expected location per the specifications, if they are out of position we do not add them to the found tags. I could add the ability to scan around and try to locate them but this introduces a couple of issues such as how to present tags that are out of position in the output, how much time do we spend hunting around the file, and I believe it starts heading out of puremagic's core mission profile.
  • ID3v2: MP3's are not the only file to use them, the scanner will start decoding a file has an ID3 header, but will give up when the audio data is invalid. There is not much we can do about this, indeed if other scanners added later have to test for ID3 the same thing will crop up there as well. Performance wise this will mean that ID3 headed files will take slightly longer as each successive scanner has a go at decoding them, this can't be avoided as you need to know where the start of the audio data is. The mysterious .koz format is a prime example as it's an encrypted file with an ID3v2 at the front, I've found other files and formats that use them while working on this scanner.
  • ✅ FIXED MP2 (MPEG-1 Layer-2) files can be encoded with wonky frame data, this causes them to be identified as VBR which is not correct.
  • ✅ FIXED End of file tags: Most of these seem to have issues with their specs (padding, wrong size calculations etc...) even when not being pushed around, we have to use broad searches to overcome their quirks. I spent a long time on this and all tags are now calculated correctly and to specification
  • ✅ FIXED Refactoring: This scanner is pretty solid but I know it's not as efficient or easy to read as it should be. I wanted to get feedback from real world use. As mentioned, I'm going to be developing a stand-alone tool so I shall start the decoder from scratch to compartmentalise checks and improve things across the board. The refactor is solid and much easier to follow the logic
  • ℹ️Misconception: That they all start with ID3: If the file has ID3v2 tags then ID3 will be present at byte 0, otherwise it's the MPEG header frame starting with hex ffeX or fffX
  • ℹ️Misconception: For LAME encoded files Xing means VBR and Info means CBR, an MP3 can be encoded with Info and still be a VBR. 🤦 MediaInfo (an awesome tool) checks these flags and bases VBR/CBR-ness from this, if you hex edit Info to Xing or vice-versa it changes its report (see test files below).

Tags Tested

ID3v1.x

TAG 128 bytes from EOF. The original MP3 tags, limited but everything knows what to do with them. Validation relies on the 'TAG' signature AND either a 4-digit year (1700-3000 seems a sensible range) OR four null bytes in the Year field OR four spaces (hex 20 used by non compliant encoders/taggers).

ID3v2.2, ID3v2.3 and ID3v2.4

ID3 at start of file. These are the current standard, big lumps of data for all sorts of info, v2.2, v2.3 and v2.4 all differ slightly but are handled. We test for the correct size and validity of the tag with a few other checks. Audio data follows after this so we do quite a bit to make sure the tag is valid.

The following tags can be moved out of position by each other rendering them invalid in our tests (see test files below). The data may be there but would be 'invisible' to any player software.

APE Tag

APETAGEX at the absolute end of file or just before ID3v1 TAG. The APE tag competes with/compliments the standard ID3 tags, both v1 and v2 tags are detected. These are complicated tags, we currently test for the most common variants.

  • v1 with APETAGEX footer, at end of file or before ID3v1
  • v2 with APETAGEX header and footer, at end of file or before ID3v1

We currently do not test for weird variants such as:

  • v1 lacking the APETAGEX footer
  • v2 lacking the APETAGEX header, footer or both
  • v2 placed at the start of the file

I doubt there's any encoders/taggers that use these, if sample files with these ever appear we can look to test for them.

Validation relies on:

  • v1: finding the APETAGEX footer AND decode the tag for size and fixed marker checks.
  • v2: finding the APETAGEX header and footer AND decode the tag for size and fixed marker checks.

ID3v1.2 Enhanced Tag2

EXT: 256 bytes from EOF. Niche standard used in the late 1990's invented by BirdCageSoft. Their software supports it but I can't find anything else that does. It was designed to overcome the limits of ID3v1 by offering extra tacked on space for tags. Validation relies on the 'EXT' signature and correct tag size, unable to validate further as tag has no fixed content.

Links:

ID3v1 Enhanced Tag2

TAG+ at 227 bytes from EOF. Another niche standard aimed at addressing similar shortfalls in ID3v1 tags as EXT.
One tool created by the spec creators called SpeedTag exists on the WaybackMachine linked below for those wanting to play.
There is also a later tool MP3Manager (see LYRICS) created by one or more of the SpeedTag/TAG+ authors, this may have supported TAG+ in an earlier form but it's latest version seems to ignore them.

Validation relies on the 'TAG+' signature, correct tag size, AND either the approved speed bytes (01=slow, 02=medium, 03=fast, 04=hardcore) OR a null byte (00) if unpopulated.

Links:

2:Additional notes regarding TAG+ and EXT
Due to the nature of these tags it entirely possible for entries to be corrupted easily by other TAG editors. In addition to getting pushed out of the byte window by other EOF tags, there is the possibility of a regular ID3v1 tag editor altering the base TAG without affecting these two in any way.

    Both TAG+ and EXT work in the same way, say you have a Title longer than 30 characters (the limit of v1) like:
    Neon Reflections of a Thousand Forgotten Summer Dreams
                                  ^
    The ^ represents where this title would be carried over into the `TAG+` or `EXT` data, but if an ID3v1 edit was to change this:
    Neon Reflections of Summer     Forgotten Summer Dreams
                                  ^
    Now we have a corrupted title for `TAG+` or `EXT` as the editor only handles the data before ^.                                                                   

This is what really put the nail in the coffin for these extended formats, they were a great idea at the times but splitting the tags between two data fields caused weird or short names on devices that could not read them, or they could be easily corrupted by other tag editors.

LYRICS

Large block before ID3v1 TAG prefixed by LYRICSBEGIN. Created to address both the shortfalls of ID3v1 tags and add lyrics to your song. Seems to have been created in part or whole by some of the TAG+ developers. Lyrics3 (v1 and v2) became one of the first widely used standards to successfully add lyric information to MP3s. Lyrics3v2 upgrades allowed for timestamped lyrics for karaoke and other enhancements. These are large tags (upto 1MB) and should be located at either:

  • Upto 1024 bytes from end of file if no ID3v1
  • Upto 1152 bytes from end of file if ID3v1 present

Validation relies on:

  • v1: LYRICSBEGIN and LYRICSEND markers AND a scan for a metatag to see if any are present
    Unable to validate further as tag has no fixed content.
  • v2: LYRICSBEGIN and LYRICS200 markers
    AND a scan for a metatag to see if any are present
    AND check the size of the found tag, matches the size metatag data.

Links:

3DI Tag

3DI 10 bytes before the ID3v1 TAG. This is a super niche tag, According to the Library of Congress link it was meant to be placed 10 bytes before the ID3v1 TAG marker, or 10 bytes before the end of the file if not.
It's purpose as summarised by Google Gemini (about the only source of information I could find on what it was for):

While the structure varied slightly across different early applications, the 10-byte extension most commonly broke down like this, focused entirely on track information:
   Bytes 0-2: Identifier "3DI" (3 bytes).
   Byte 3: Track Number (1 byte, typically 1 to 255). This was the most important piece of data.
   Byte 4: Disc Number (1 byte, typically 1 to 255).
   Bytes 5-9: Reserved/Padding (5 bytes). These were often left empty or used inconsistently for things like a simple file checksum by specific tagging programs.

Once ID3v1.1 came along this became less relevant and obviously ID3v2 killed any need for it stone dead. I have no test files (but made one by hex editing a file) so this is a theoretical implementation, if a real file should appear we can test/adapt if needed. Validation relies on the '3DI' signature and correct tag size, unable to validate further as tag has no fixed content.

Links:

Sample files

For testing, all files with mp3_vbr and mpeg2_mp3 in the filename in the test\resources\audio are based off '3-second synth melody' from https://samplelib.com/sample-mp3.html, the files there are free of any use restrictions. Filename are based on the order in which the tags appear, i.e 3di_id3v1 means a 3DI tag followed by an ID3v1.

File Notes
test_mp3_vbr_info_128k_notags.mp3 VBR file with hex edited Info header and No Tags, MediaInfo will incorrectly call this a CBR, our tests show it's really a VBR
test_mp3_vbr_xing_128k_3di_id3v1.mp3 VBR Xing , 3DI and ID3v1 tags
test_mp3_vbr_xing_128k_apev1_id3v1.mp3 VBR Xing , APEv1 and ID3v1 tags
test_mp3_vbr_xing_128k_apev1.mp3 VBR Xing , APEv1 tags
test_mp3_vbr_xing_128k_apev2_tagplus_id3v1.mp3 VBR Xing , APEv2, TAG+ and ID3v1 tags. This file is an abomination, the APE tag has been pushed out of its location by TAG+, our test will not show the APE tag in the results due to this
test_mp3_vbr_xing_128k_ext_id3v1.mp3 VBR Xing , ID3v1.2 EXT and ID3v1 tags
test_mp3_vbr_xing_128k_lyrics3v2_id3v1.mp3 VBR Xing , LYRICS3v2 and ID3v1 tags
test_mp3_vbr_xing_128k_notags.mp3 VBR Xing , no tags
test_mp3_vbr_xing_128k_tagplus_apev2_id3v1.mp3 VBR Xing , TAG+, APEv2 and ID3v1 tags. Another abomination, the TAG+ tag has been pushed out of its location by APE, our test will not show the TAG+ tag in the results due to this
test_mp3_vbr_xing_128k_tagplus_id3v1.mp3 VBR Xing , TAG+ and ID3v1 tags
test_mpeg2_mp3_VBR_128k_id3v2_24.mp3 MPEG-2 VBR , ID3v2 2.4 tags

For MP3 VBRI a test file can be found here:

Some good files for testing various MPEG versions and Layers (especially 2.5) can found here:

For testing MPEG 1 and MPEG 2 files some samples can be found on these links below, I've not added them to the repo due to lack of notices/questionable wording regarding usage.:

Example outputs:

These all come from real files (don't judge me for my music tastes 🤣).

'D:\Data\03-TORIENA_-_Cockscomb_Jingle_Bells.mp3' : .mp3
Total Possible Matches: 1

        Deepscan Match
        Name: MPEG-1 Audio Layer 3 (MP3) audio file [320k 44.1Khz Joint-Stereo CBR ID3v2.4 ID3v1]
        Confidence: 100%
        Extension: .mp3
        Mime Type: audio/mpeg
        Byte Match: b'ID3'
        Offset: 0
'D:\Data\01-sieg_heilman-cruel_angels_thesis.mp3' : .mp3
Total Possible Matches: 1

        Deepscan Match
        Name: MPEG-1 Audio Layer 3 (MP3) audio file [320k 44.1Khz Joint-Stereo CBR ID3v2.4 APEv2 ID3v1]
        Confidence: 100%
        Extension: .mp3
        Mime Type: audio/mpeg
        Byte Match: b'ID3'
        Offset: 0
'D:\Data\Symphony No.6 (1st movement).mp2' : .mp2
Total Possible Matches: 1

        Deepscan Match
        Name: MPEG-1 Audio Layer 2 (MP2) audio file [384k 44.1Khz Stereo CBR]
        Confidence: 100%
        Extension: .mp2
        Mime Type: audio/mpeg
        Byte Match: b'\xff\xfd'
        Offset: 0
'D:\Data\mp1-sample.mp1' : .mp1
Total Possible Matches: 1

        Deepscan Match
        Name: MPEG-1 Audio Layer 1 (MP1) audio file [384k 32.0Khz Stereo CBR]
        Confidence: 100%
        Extension: .mp1
        Mime Type: audio/mpeg
        Byte Match: b'\xff\xfe'
        Offset: 0
'D:\Data\30-ff-16b-2c-44100hz.mp3' : .mp3
Total Possible Matches: 1

        Deepscan Match
        Name: MPEG-1 Audio Layer III (MP3) file [64k 44.1Khz Stereo VBR LAME(Info) ID3v2.4]
        Confidence: 100%
        Extension: .mp3
        Mime Type: audio/mpeg
        Byte Match: b'ID3'
        Offset: 0

Deep scans MPEG Audio files
Supports MP1, MP2, MP3
1) Major refactor of the decoder.
2) Major update of magic_data.json
3) Add plenty of test files.
@NebularNerd NebularNerd changed the base branch from master to develop December 10, 2025 09:34
@@ -0,0 +1,1107 @@
# cSpell:disable
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you miss committed a dup file - Copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I've deleted that, we should be all good now.

Adds another CFB entry for Outlook .msg files
@NebularNerd
Copy link
Contributor Author

Added a magic_json match for .msg to partially resolve #119

@cdgriffith
Copy link
Owner

This is truly an amazing addition, thank you so much @NebularNerd !

@cdgriffith cdgriffith merged commit f886e81 into cdgriffith:develop Dec 18, 2025
5 checks passed
@NebularNerd
Copy link
Contributor Author

This is truly an amazing addition, thank you so much @NebularNerd !

Glad you like it, it was an interesting challenge, learnt loads making it. 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants