refactor: extract _rewrite_html() to avoid double mistune in rewrite_comment()#399
refactor: extract _rewrite_html() to avoid double mistune in rewrite_comment()#399its-me-maady wants to merge 5 commits intoopenzim:mainfrom
Conversation
…comment() rewrite_comment() previously called self.rewrite() which internally called self.markdown() again, causing comment text to go through mistune twice. Extract the BeautifulSoup half of rewrite() into _rewrite_html(). rewrite() now calls _rewrite_html() after its markdown step. rewrite_comment() calls _rewrite_html() directly, skipping the second markdown pass. Split the single try/except in rewrite() into two separate ones — one for the markdown step, one for BS4 init in _rewrite_html() — preserving the same error behavior (return content on failure). Fixes openzim#398
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #399 +/- ##
========================================
- Coverage 9.91% 9.84% -0.08%
========================================
Files 26 26
Lines 2441 2439 -2
Branches 316 316
========================================
- Hits 242 240 -2
- Misses 2186 2188 +2
+ Partials 13 11 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
benoit74
left a comment
There was a problem hiding this comment.
I know this was here before this PR, but it is now quite obvious that we should strive to make all these soup / markdown errors fatal, raising an Exception which will stop the scraper. I see little to no reason for these errors to happen, and should they happen it would make more sense to identify and fix the bug rather than creating broken ZIMs without noticing it.
I've checked few production log and I've never seen such errors, so it probably never happens on current dumps. Or it is a rare edge case which is probably worth fixing or detecting better to make more informed decisions in this edge case.
Let markdown and soup errors raise instead of being caught and returning content silently. A broken ZIM produced without any visible error is worse than a scraper crash.
|
Good point — I've removed all the try/except blocks around the markdown and soup steps and let them raise instead. If these ever fail, it's better to crash loud than silently produce a broken ZIM. |
|
PR is ready for review. Also wanted to propose an idea: while working on this repo I kept forgetting to run |
rewrite_comment() previously called self.rewrite() which internally called self.markdown() again, causing comment text to go through mistune twice.
Extract the BeautifulSoup half of rewrite() into _rewrite_html(). rewrite() now calls _rewrite_html() after its markdown step. rewrite_comment() calls _rewrite_html() directly, skipping the second markdown pass.
Split the single try/except in rewrite() into two separate ones — one for the markdown step, one for BS4 init in _rewrite_html() — preserving the same error behavior (return content on failure).
Fixes #398