Skip to content

Conversation

@johannedin
Copy link
Contributor

@johannedin johannedin commented Nov 4, 2025

Summary

This PR fixes an issue where EmbededImagesRepairToolBase.GetImageFormat could not correctly identify certain image formats due to incomplete header checks.

Changes

  • Added support for additional JPEG APP markers (ICC profile, Photoshop IRB, Adobe).
  • Improved detection for PNG and TIFF by checking full signatures.

Why

Previously, some valid images were misclassified as Authentication errors or not detected at all. This update ensures broader and more accurate format detection. (Might still have gaps).

Testing

  • Added unit tests for JPEG variants (APP2, APP13, APP14) and other formats.
  • Verified detection works for PNG, TIFF, BMP, GIF, both for "fake headers" and real test assets.

Closes #3048

Summary by CodeRabbit

  • New Features

    • Added SVG image format support.
  • Bug Fixes

    • Improved detection and validation for BMP, GIF, PNG, TIFF and JPEG, with more robust handling of short or null inputs.
  • Tests

    • Added comprehensive tests for image format detection and included image assets for test runs.
  • Chores

    • Updated gitignore to exclude development container files.

@johannedin johannedin requested a review from MrHinsh as a code owner November 4, 2025 15:02
@CLAassistant
Copy link

CLAassistant commented Nov 4, 2025

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

Refactors image-format detection to use length-validated, sequential byte-signature checks (BMP, GIF, PNG, TIFF LE/BE, JPEG via SOI, SVG via "<svg"), adds SVG to ImageFormat, introduces unit tests and test assets, updates csproj to copy assets, and adds .devcontainer to .gitignore.

Changes

Cohort / File(s) Summary
Repository configuration
\.gitignore
Added rule to ignore the .devcontainer directory.
Test project assets
src/MigrationTools.Tests/MigrationTools.Tests.csproj
Added six Content entries for image assets (bmp, gif, jpg, png, svg, tiff) with CopyToOutputDirectory=PreserveNewest.
New tests
src/MigrationTools.Tests/Tools/Infrastructure/EmbededImagesRepairToolBaseTests.cs
New test class exercising image-format detection from files and byte headers; introduces TestableEmbededImagesRepairTool wrapper, TestToolOptions, and LoadAsset helper; asserts formats jpeg/png/tiff/gif/svg/bmp.
Core detection logic
src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs
Reimplemented GetImageFormat(byte[]) with null/length guards and sequential signature checks for BMP, GIF, PNG, TIFF (LE/BE), JPEG (SOI 0xFF 0xD8), and SVG; added svg to the ImageFormat enum; defaults to unknown.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Caller as CallGetImageFormat
    participant Detector as GetImageFormat
    Note over Detector: Validate null / minimum length
    Caller->>Detector: byte[] bytes
    alt null or too short
        Detector-->>Caller: ImageFormat.unknown / early return
    else
        Note over Detector: Sequential signature checks
        Detector->>Detector: check BMP (0x42 0x4D)
        alt matches
            Detector-->>Caller: ImageFormat.bmp
        else
            Detector->>Detector: check GIF ("GIF87a"/"GIF89a")
            alt matches
                Detector-->>Caller: ImageFormat.gif
            else
                Detector->>Detector: check PNG (8-byte sig)
                alt matches
                    Detector-->>Caller: ImageFormat.png
                else
                    Detector->>Detector: check TIFF (II*/MM*)
                    alt matches
                        Detector-->>Caller: ImageFormat.tiff
                    else
                        Detector->>Detector: check JPEG (SOI 0xFF 0xD8)
                        alt matches
                            Detector-->>Caller: ImageFormat.jpeg
                        else
                            Detector->>Detector: check SVG ("<svg" or XML with svg)
                            alt matches
                                Detector-->>Caller: ImageFormat.svg
                            else
                                Detector-->>Caller: ImageFormat.unknown
                            end
                        end
                    end
                end
            end
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus review on:
    • Boundary and null checks in GetImageFormat (off-by-one / slice safety).
    • JPEG detection: confirm SOI-only check is intentional and sufficient.
    • SVG detection: text encoding handling and false-positive risk.
    • Test asset paths and csproj Content entries to ensure assets are available in CI.

Poem

Bytes once baffled, now neatly read,
SOI waves flags where others fled,
PNG and GIF stand in a row,
SVG whispers from text below,
Tests applaud — the formats go. 🎉

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: improving GetImageFormat to support additional image header variants, which directly aligns with the changeset modifications.
Linked Issues check ✅ Passed All coding objectives from #3048 are met: JPEG detection now relies on SOI marker (FF D8), support added for PNG, TIFF, BMP, GIF, and SVG, with comprehensive unit tests validating all formats.
Out of Scope Changes check ✅ Passed All changes are directly scoped to #3048: .gitignore and project file changes support test infrastructure, test suite validates GetImageFormat logic, and implementation provides required detection improvements.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 941cfde and bfd9ec5.

⛔ Files ignored due to path filters (6)
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/bmpsample.bmp is excluded by !**/*.bmp
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/gifsample.gif is excluded by !**/*.gif
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/jpgsample.jpg is excluded by !**/*.jpg
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/pngsample.png is excluded by !**/*.png
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/svgsample.svg is excluded by !**/*.svg
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/tiffsample.tiff is excluded by !**/*.tiff
📒 Files selected for processing (1)
  • src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs (2 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.gitignore (1)

120-121: Simplify the .devcontainer patterns.

The two patterns are redundant. Using .devcontainer is sufficient to ignore the directory and its contents; .devcontainer/* adds nothing. Better still, use the explicit directory pattern for clarity and consistency with git conventions.

-.devcontainer
-.devcontainer/*
+/.devcontainer/

The trailing slash signals "directory only" and anchors to root, which is clearer and idiomatic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29aaddf and a7b5b22.

⛔ Files ignored due to path filters (6)
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/bmpsample.bmp is excluded by !**/*.bmp
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/gifsample.gif is excluded by !**/*.gif
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/jpgsample.jpg is excluded by !**/*.jpg
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/pngsample.png is excluded by !**/*.png
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/svgsample.svg is excluded by !**/*.svg
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/tiffsample.tiff is excluded by !**/*.tiff
📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • src/MigrationTools.Tests/MigrationTools.Tests.csproj (1 hunks)
  • src/MigrationTools.Tests/Tools/Infrastructure/EmbededImagesRepairToolBaseTests.cs (1 hunks)
  • src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/MigrationTools.Tests/Tools/Infrastructure/EmbededImagesRepairToolBaseTests.cs (2)
src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs (3)
  • EmbededImagesRepairToolBase (14-143)
  • EmbededImagesRepairToolBase (19-22)
  • FixEmbededImages (28-28)
src/MigrationTools/Tools/Infrastructure/ITool.cs (1)
  • ITool (7-9)

Comment on lines 88 to 110
var svgTag = System.Text.Encoding.ASCII.GetBytes("<svg");

if (jpeg3.SequenceEqual(bytes.Take(jpeg3.Length)))
return ImageFormat.jpeg;
// Check GIF
if (gif87a.SequenceEqual(bytes.Take(gif87a.Length)) ||
gif89a.SequenceEqual(bytes.Take(gif89a.Length)))
return ImageFormat.gif;

if (jpeg4.SequenceEqual(bytes.Take(jpeg4.Length)))
return ImageFormat.jpeg;
// Check PNG
if (png.SequenceEqual(bytes.Take(png.Length)))
return ImageFormat.png;

// Check TIFF
if (tiffLE.SequenceEqual(bytes.Take(tiffLE.Length)) ||
tiffBE.SequenceEqual(bytes.Take(tiffBE.Length)))
return ImageFormat.tiff;

if (jpeg5.SequenceEqual(bytes.Take(jpeg5.Length)))
// Check JPEG
if (jpegSOI.SequenceEqual(bytes.Take(jpegSOI.Length)))
return ImageFormat.jpeg;

// Check SVG (text-based)
if (bytes.Length >= 4 && svgTag.SequenceEqual(bytes.Take(svgTag.Length)))
return ImageFormat.svg;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

SVG detection must tolerate prologs and BOMs

Most real SVG assets start with a UTF‑8 BOM and/or an XML declaration (<?xml …?>) before the <svg> element. Requiring the very first four bytes to be <svg means we will continue to misclassify those files—the exact scenario this fix set out to solve. Strip the BOM, peel off the XML declaration, and perform a case-insensitive <svg check within the first chunk so that common SVG variants are recognised. Please add a regression test covering an SVG with an XML declaration once you wire this up.

-            // SVG: starts with "<svg" (text-based)
-            var svgTag = System.Text.Encoding.ASCII.GetBytes("<svg");
+            // SVG: starts with "<svg" (text-based), but real files often include a BOM or XML declaration first
@@
-            // Check SVG (text-based)
-            if (bytes.Length >= 4 && svgTag.SequenceEqual(bytes.Take(svgTag.Length)))
-                return ImageFormat.svg;
+            // Check SVG (text-based)
+            if (bytes.Length > 0)
+            {
+                var prefix = Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, 256))
+                    .TrimStart('\uFEFF', ' ', '\t', '\r', '\n');
+
+                if (prefix.StartsWith("<?xml", StringComparison.OrdinalIgnoreCase))
+                {
+                    var terminator = prefix.IndexOf("?>", StringComparison.Ordinal);
+                    if (terminator >= 0 && terminator + 2 < prefix.Length)
+                    {
+                        prefix = prefix.Substring(terminator + 2).TrimStart(' ', '\t', '\r', '\n');
+                    }
+                }
+
+                if (prefix.StartsWith("<svg", StringComparison.OrdinalIgnoreCase))
+                    return ImageFormat.svg;
+            }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs
around lines 88 to 110, the SVG detection only checks the first four bytes for
"<svg" and therefore misses files that start with a UTF‑8 BOM and/or an XML
declaration; change the logic to read a small initial slice (e.g., first 512
bytes), decode it tolerant of invalid sequences while stripping a UTF‑8 BOM if
present, remove any leading XML declaration (skip up through the first "?>" if
it appears), then perform a case‑insensitive search for "<svg" within that
cleaned head to determine SVG; make sure the code is null/exception safe for
non‑text bytes and retains existing binary checks order; add a regression unit
test that supplies an SVG file starting with a BOM and an XML prolog and asserts
the method returns ImageFormat.svg.

@MrHinsh
Copy link
Member

MrHinsh commented Nov 4, 2025

@johannedin can you sort out the user issue that CLAassistant is complaining of? Other than that Im inclined to approve.

@johannedin
Copy link
Contributor Author

@johannedin can you sort out the user issue that CLAassistant is complaining of? Other than that Im inclined to approve.

Will do! I try to sort out some of the other review comments as well.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs (2)

70-71: Stop throwing—this breaks the graceful-degradation contract.

This guard detonates on short buffers instead of returning unknown, which aborts the entire repair pass for truncated downloads or legitimately small payloads. Your own BMP check at line 62 tolerates 2-byte buffers, and JPEG SOI is also 2 bytes, so this 4-byte floor is arbitrary and wrong. The established contract is to fall back to unknown when detection fails, not to throw.

Apply this diff:

-            if (bytes == null || bytes.Length < 4)
-                throw new ArgumentException("Byte array too short to determine image format.");
+            if (bytes == null)
+                return ImageFormat.unknown;

108-110: SVG detection is still broken—real SVGs have BOMs and XML prologs.

You're checking the first four bytes for <svg, which fails on any SVG that starts with a UTF-8 BOM (\uFEFF) or an XML declaration (<?xml ...?>). That's the majority of production SVG files. Decode the first ~256 bytes as UTF-8, strip the BOM, skip past the XML prolog if present, trim whitespace, then look for <svg case-insensitively.

Refer to the previous review comment for a concrete implementation that handles these cases. Add a test asset with an XML declaration to verify the fix.

🧹 Nitpick comments (3)
.gitignore (1)

120-121: Drop the redundant wildcard pattern on line 121.

In .gitignore, ignoring a directory automatically ignores all its contents. The /.devcontainer/* entry on line 121 is redundant.

  .devcontainer
- .devcontainer/*
src/MigrationTools.Tests/MigrationTools.Tests.csproj (1)

27-46: Simplify these paths—lose the redundant navigation.

The .. navigation is pointless here: you're going up from MigrationTools.Tests to src, then diving straight back into MigrationTools.Tests. Just reference the files directly relative to the project root: Tools\Infrastructure\Assets\bmpsample.bmp (or use forward slashes for better cross-platform hygiene).

Apply this diff to clean up the paths:

   <ItemGroup>     
-    <Content Include="..\MigrationTools.Tests\Tools\Infrastructure\Assets\bmpsample.bmp">
+    <Content Include="Tools/Infrastructure/Assets/bmpsample.bmp">
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </Content>
-    <Content Include="..\MigrationTools.Tests\Tools\Infrastructure\Assets\gifsample.gif">
+    <Content Include="Tools/Infrastructure/Assets/gifsample.gif">
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </Content>
-    <Content Include="..\MigrationTools.Tests\Tools\Infrastructure\Assets\jpgsample.jpg">
+    <Content Include="Tools/Infrastructure/Assets/jpgsample.jpg">
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </Content>
-    <Content Include="..\MigrationTools.Tests\Tools\Infrastructure\Assets\tiffsample.tiff">
+    <Content Include="Tools/Infrastructure/Assets/tiffsample.tiff">
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </Content>
-    <Content Include="..\MigrationTools.Tests\Tools\Infrastructure\Assets\pngsample.png">
+    <Content Include="Tools/Infrastructure/Assets/pngsample.png">
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </Content>
-    <Content Include="..\MigrationTools.Tests\Tools\Infrastructure\Assets\svgsample.svg">
+    <Content Include="Tools/Infrastructure/Assets/svgsample.svg">
       <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
     </Content>
   </ItemGroup>
src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs (1)

73-93: Consider direct byte-index checks instead of Take + SequenceEqual.

The Take().SequenceEqual() pattern works but allocates enumerables unnecessarily. For fixed-length signatures, a manual loop or Span<byte> comparison is cleaner and faster, especially since you're already doing length validation (or should be, once line 70 is fixed).

Example alternative approach:

// Check GIF
if (bytes.Length >= 6)
{
    if ((bytes[0] == 0x47 && bytes[1] == 0x49 && bytes[2] == 0x46 && 
         bytes[3] == 0x38 && bytes[5] == 0x61 && (bytes[4] == 0x37 || bytes[4] == 0x39)))
        return ImageFormat.gif;
}

Or use Span<byte> slicing if targeting a modern runtime.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7b5b22 and 941cfde.

⛔ Files ignored due to path filters (6)
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/bmpsample.bmp is excluded by !**/*.bmp
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/gifsample.gif is excluded by !**/*.gif
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/jpgsample.jpg is excluded by !**/*.jpg
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/pngsample.png is excluded by !**/*.png
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/svgsample.svg is excluded by !**/*.svg
  • src/MigrationTools.Tests/Tools/Infrastructure/Assets/tiffsample.tiff is excluded by !**/*.tiff
📒 Files selected for processing (4)
  • .gitignore (1 hunks)
  • src/MigrationTools.Tests/MigrationTools.Tests.csproj (1 hunks)
  • src/MigrationTools.Tests/Tools/Infrastructure/EmbededImagesRepairToolBaseTests.cs (1 hunks)
  • src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/MigrationTools.Tests/Tools/Infrastructure/EmbededImagesRepairToolBaseTests.cs
🔇 Additional comments (2)
src/MigrationTools/Tools/Infrastructure/EmbededImagesRepairEnricherBase.cs (2)

104-106: JPEG detection fixed—SOI is the correct marker.

Checking the mandatory SOI marker (FF D8) instead of specific 4-byte JFIF/Exif headers solves the original bug. This will catch all JPEG variants regardless of APP segment order.


140-141: Enum extension looks good.

Adding svg to the ImageFormat enum is the right move to support SVG detection, once the detection logic itself is fixed.

@MrHinsh MrHinsh merged commit d3570a6 into nkdAgility:main Nov 5, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: certain jpeg attachments not recognized by EmbededImagesRepairToolBase.GetImageFormat method

3 participants