Skip to content

Conversation

@nikneym
Copy link
Contributor

@nikneym nikneym commented Sep 23, 2025

This PR refactors Mime and introduces different parsing logic (prefers integer comparisons where possible) while giving up readability a bit.

Copy link
Collaborator

@karlseguin karlseguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of changes for the sake of changes, but can you benchmark it to compare? At least we'll have a better idea of further changes like this is worth it.


/// Matches the first 3 characters of data with given characters.
inline fn match3(data: *const [3]u8, c0: u8, c1: u8, c2: u8) bool {
return data[0] == c0 and data[1] == c1 and data[2] == c2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a u24 here? (genuine question)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure to prefer arbitrary bit widths, it can definitely suit here.

continue;
}

const charset_: u64 = @bitCast([_]u8{ 'c', 'h', 'a', 'r', 's', 'e', 't', '=' });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno if it's valid, but go parses "charset = hello" (space after charset)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same but the format specified as:

  parameter       = parameter-name "=" parameter-value
  parameter-name  = token
  parameter-value = ( token / quoted-string )
  
  token          = 1*tchar

  tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                 / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
                 / DIGIT / ALPHA
                 ; any VCHAR, except delimiters

I've also tested it in the wild, web servers seem to comply.

return parseOther(normalized);
},
// Perfect cases.
text_xml => break :blk .{ .text_xml = {} },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this match text/xmlover9000! ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It Fails since after text/xml, following bytes should contain charset= though text/xmlcharset=utf-8 passes sadly...

if (value.len == 0) {
return error.Invalid;
}
if (match4(rem[0..4], 'a', 's', 'c', 'r') and match3(rem[4..7], 'i', 'p', 't')) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, would match text/javascript_is_java_written_in_cursive ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, text/javascriptcharset=utf-8 passes, will handle semicolon case.

continue;
}

const charset_: u64 = @bitCast([_]u8{ 'c', 'h', 'a', 'r', 's', 'e', 't', '=' });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enough repeated logic here that I'd be tempted to try to DRY it if possible.

@nikneym
Copy link
Contributor Author

nikneym commented Sep 24, 2025

I approve of changes for the sake of changes, but can you benchmark it to compare? At least we'll have a better idea of further changes like this is worth it.

I've benchmarked both implementations in a separate environment with large data, I think the performance gain compared to code complexity isn't worth it:

Benchmark 1: ./zig-out/bin/new
  Time (mean ± σ):       0.3 ms ±   0.0 ms    [User: 0.2 ms, System: 0.1 ms]
  Range (min … max):     0.2 ms …   0.4 ms    3929 runs

Benchmark 2: ./zig-out/bin/old
  Time (mean ± σ):       0.3 ms ±   0.0 ms    [User: 0.2 ms, System: 0.1 ms]
  Range (min … max):     0.2 ms …   0.5 ms    3605 runs

Summary
  './zig-out/bin/new' ran
    1.00 ± 0.13 times faster than './zig-out/bin/old'

Closing in favor of this.

@nikneym nikneym closed this Sep 24, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Sep 24, 2025
@nikneym nikneym deleted the nikneym/mime-changes-v2 branch September 24, 2025 17:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants