Skip to content

refactor: extract _rewrite_html() to avoid double mistune in rewrite_comment()#399

Open
its-me-maady wants to merge 5 commits intoopenzim:mainfrom
its-me-maady:fix/refactor-rewrite-html
Open

refactor: extract _rewrite_html() to avoid double mistune in rewrite_comment()#399
its-me-maady wants to merge 5 commits intoopenzim:mainfrom
its-me-maady:fix/refactor-rewrite-html

Conversation

@its-me-maady
Copy link
Copy Markdown
Contributor

rewrite_comment() previously called self.rewrite() which internally called self.markdown() again, causing comment text to go through mistune twice.

Extract the BeautifulSoup half of rewrite() into _rewrite_html(). rewrite() now calls _rewrite_html() after its markdown step. rewrite_comment() calls _rewrite_html() directly, skipping the second markdown pass.

Split the single try/except in rewrite() into two separate ones — one for the markdown step, one for BS4 init in _rewrite_html() — preserving the same error behavior (return content on failure).

Fixes #398

…comment()

rewrite_comment() previously called self.rewrite() which internally
called self.markdown() again, causing comment text to go through
mistune twice.

Extract the BeautifulSoup half of rewrite() into _rewrite_html().
rewrite() now calls _rewrite_html() after its markdown step.
rewrite_comment() calls _rewrite_html() directly, skipping the
second markdown pass.

Split the single try/except in rewrite() into two separate ones —
one for the markdown step, one for BS4 init in _rewrite_html() —
preserving the same error behavior (return content on failure).

Fixes openzim#398
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 42.85714% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 9.84%. Comparing base (49b1247) to head (23d8953).

Files with missing lines Patch % Lines
src/sotoki/utils/html.py 42.85% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #399      +/-   ##
========================================
- Coverage   9.91%   9.84%   -0.08%     
========================================
  Files         26      26              
  Lines       2441    2439       -2     
  Branches     316     316              
========================================
- Hits         242     240       -2     
- Misses      2186    2188       +2     
+ Partials      13      11       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@its-me-maady its-me-maady marked this pull request as draft March 27, 2026 08:10
@its-me-maady its-me-maady marked this pull request as ready for review March 27, 2026 09:54
Copy link
Copy Markdown
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was here before this PR, but it is now quite obvious that we should strive to make all these soup / markdown errors fatal, raising an Exception which will stop the scraper. I see little to no reason for these errors to happen, and should they happen it would make more sense to identify and fix the bug rather than creating broken ZIMs without noticing it.

I've checked few production log and I've never seen such errors, so it probably never happens on current dumps. Or it is a rare edge case which is probably worth fixing or detecting better to make more informed decisions in this edge case.

Let markdown and soup errors raise instead of being caught and
returning content silently. A broken ZIM produced without any
visible error is worse than a scraper crash.
@its-me-maady
Copy link
Copy Markdown
Contributor Author

Good point — I've removed all the try/except blocks around the markdown and soup steps and let them raise instead. If these ever fail, it's better to crash loud than silently produce a broken ZIM.

@its-me-maady
Copy link
Copy Markdown
Contributor Author

PR is ready for review.

Also wanted to propose an idea: while working on this repo I kept forgetting to run ruff and black --check before committing, and missed adding the CHANGELOG.md entry. Would it make sense to add a git pre-commit hook (or a CI step) that enforces both automatically? Happy to work on that if it sounds like something worth having.

@its-me-maady its-me-maady requested a review from benoit74 March 27, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix rewriting logic of comments

2 participants