Skip to content

Fix UTF-8 mojibake in #!vim highlighted blocks#663

Open
mateu wants to merge 2 commits intoperladvent:mainfrom
mateu:koan/fix-vimhtml-utf8-encoding
Open

Fix UTF-8 mojibake in #!vim highlighted blocks#663
mateu wants to merge 2 commits intoperladvent:mainfrom
mateu:koan/fix-vimhtml-utf8-encoding

Conversation

@mateu
Copy link
Copy Markdown

@mateu mateu commented Apr 8, 2026

What: Fix UTF-8 handling for #!vim highlighted code blocks by overriding VimHTML conversion and wiring build-time Perl library precedence.

Why: Issue #530 reports guillemets rendering as mojibake («...») in #!vim perl output while normal =begin perl blocks render correctly.

How: Added a repo-local Pod::Elemental::Transformer::VimHTML that normalizes incoming text before calling Text::VimColor (no pre-encoding to octets), and updated script/build-site.sh to export PERL5LIB so this override is used by advcal builds. Added t/vim_utf8_encoding.t to assert guillemet correctness in both code paths and guard against regression.

Testing: Ran docker compose run --rm perl-advent prove -lr t and verified generated 2024-12-15.html in-container no longer contains «/» while preserving «...».

Closes #530


Quality Report

Changes: 3 files changed, 128 insertions(+)

Code scan: clean

Tests: skipped

Branch hygiene: 1 issue(s)

  • Branch is not pushed to remote

Generated by Kōan post-mission quality pipeline

@oalders
Copy link
Copy Markdown
Contributor

oalders commented Apr 10, 2026

@mateu could we include a diff that shows the change from the CPAN version of Pod::Elemental::Transformer::VimHTML? Also, is this possibly something that should be fixed there rather than here?

@mateu
Copy link
Copy Markdown
Author

mateu commented Apr 10, 2026

Good call. Here’s the focused delta from the CPAN Pod::Elemental::Transformer::VimHTML implementation that this PR introduces.

The main behavioral change is in build_html():

 sub build_html {
   my ($self, $str, $param) = @_;

-  my $octets = Encode::encode('utf-8', $str, Encode::FB_CROAK);
+  if (utf8::is_utf8($str)) {
+    if ($str =~ /[\x{00C2}\x{00C3}][\x{0080}-\x{00BF}]/) {
+      my $fixed = eval {
+        Encode::decode(
+          'utf-8',
+          Encode::encode('latin-1', $str, Encode::FB_CROAK),
+          Encode::FB_CROAK
+        );
+      };
+      $str = $fixed if defined $fixed;
+    }
+  } else {
+    $str = Encode::decode('utf-8', $str, Encode::FB_CROAK);
+  }

   my $vim = Text::VimColor->new(
-    string   => $octets,
+    string   => $str,
     filetype => $param->{filetype},
     vim_options => [
       qw( -RXZ -i NONE -u NONE -N -n ), "+set nomodeline", '+set fenc=utf-8',
     ],
   );

-  my $html_bytes = $vim->html;
-  my $html = Encode::decode('utf-8', $html_bytes);
+  my $html = $vim->html;
+  $html = Encode::decode('utf-8', $html, Encode::FB_CROAK)
+    unless utf8::is_utf8($html);

   return $html;
 }

In other words:

  • CPAN version: always pre-encodes $str to UTF-8 octets before passing it to Text::VimColor, then decodes the returned HTML.
  • This PR: normalizes the input string first, avoids pre-encoding already-decoded Perl text to octets, and only decodes the HTML output if it is not already flagged as UTF-8.

That change is what addresses the mojibake we were seeing in #!vim perl blocks like «...».

I’m not sure yet whether the best long-term home for the fix is here or upstream, so for now I’m trying to keep the change narrowly focused on making Perl Advent’s build output correct and covered by regression tests.

@mateu
Copy link
Copy Markdown
Author

mateu commented Apr 11, 2026

@oalders what do you think about diff ^^ Is it sane?

@oalders
Copy link
Copy Markdown
Contributor

oalders commented Apr 14, 2026

Code review

No blocking issues found. Checked for bugs and CLAUDE.md compliance.


Minor note (non-blocking): parse_synhi_param calls confess without an explicit use Carp import. This is inherited from the upstream CPAN Pod::Elemental::Transformer::VimHTML, which has the same omission — so it's not a regression introduced here. In practice Moose's transitive loading makes confess resolvable, and the code path only fires on malformed #!vim parameters. Still, since this is a fresh local copy it would be worth adding use Carp qw(confess); to make the dependency explicit and avoid relying on Moose's internals. Consider reporting the same upstream to RJBS.

@oalders
Copy link
Copy Markdown
Contributor

oalders commented Apr 14, 2026

Design: the fix belongs in the existing submodule fork, not a new lib/ shadow

The fix itself is correct — the double-encoding detection and unwrap in build_html is the right approach to the mojibake bug. The question is where it should live.

Pod::Elemental::Transformer::VimHTML is part of the Pod-Elemental-Transformer-SynHi CPAN distribution, and this project already has a fork of that exact distribution at perladvent/Pod-Elemental-Transformer-SynHi — it's the git submodule in inc/Pod-Elemental-Transformer-SynHi/. That fork currently contains SynHi.pm, SynMux.pm, and Codebox.pm, but VimHTML.pm was left out because there was no reason to touch it until now.

What should happen here is option 1: add the corrected VimHTML.pm to the perladvent/Pod-Elemental-Transformer-SynHi fork, then update the submodule pointer in this repo. No PERL5LIB manipulation in build-site.sh needed, and VimHTML.pm stays with the rest of the distribution it belongs to. The pattern is already established — this PR just needs to use it.

The lib/ + PERL5LIB approach works, but it introduces a second override mechanism that's less discoverable than the existing inc/ submodule pattern, and it creates a new place to look for "things we've forked" that today only has one entry.

To be clear about priority:

  1. Do this now: move the fix into perladvent/Pod-Elemental-Transformer-SynHi, update the submodule — that's the right home for it
  2. Do this in parallel: open a PR or issue against the upstream CPAN module so the fix can eventually be dropped entirely

🤖 Written by Claude Code

@mateu
Copy link
Copy Markdown
Author

mateu commented Apr 17, 2026

Quick cleanup note because my previous comment got shell-mangled.

The intended update is:

  • I moved the VimHTML fix into the existing Pod-Elemental-Transformer-SynHi fork used as this repo's submodule.
  • I removed the separate top-level lib/ shadow override.
  • I removed the build-site.sh PERL5LIB tweak.
  • I added an explicit Carp import while touching VimHTML.pm.
  • I verified the focused UTF-8 regression test and the repo t/ suite pass in a local-lib environment.

Companion submodule PR:
perladvent/Pod-Elemental-Transformer-SynHi#3

That should make the submodule-side review much easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF-8 is double encoded in #!vim perl

2 participants