Conversation
|
This draft is currently made as a proof of concept with a disregard for any other renderer it might be breaking. |
|
I created some example ZIM files including no JavaScript, all JavaScript or only JavaScript trusted not to request external content. Please take a look at them and let me know how you feel about them.
|
|
I tried PoC ZIMs. For some reasons, the All javascripts version is obviously causing problems in the web console ... but nothing really visible in the articles. The "trusted javascript" works very well. I'm not sufficiently export to assess the list of trusted modules, or the one of "addModules" version, but this is more a detail we can refine over the years anyway. The only things I can say is that I confirm that the "trusted javascript with additional modules" is required/enough to solve #1491 and #1911 articles. Only "trusted javascript" is not enough. I've tested with JS disabled in Chrome and I confirm the ZIM degrades nicely (i.e. it looks like we get the same content than currently without this PR merged). All points above makes me consider that we might want to change default scraper behavior (compared to current PR behavior), so that scraper creates "near-perfect" ZIM for "newbies", without needing additional flags. Especially important for people willing to ZIM wikipedia which is somehow our primary audience:
@Markus-Rost Thanks a lot anyway, very impressive change. Glad you made it work! I did not had any look at the code since you specified it is only a PoC. Let me know when it is worth it. |
Oh, I'm getting that as well. Not sure what happened there. I tested it locally using 100.tsv which worked fine and created the 10.tsv version to fit the GitHub upload limit, but didn't actually test that one again.
Yes, those tables need the
Do you mean having all JS included by default? That doesn't seem like a good idea as some JS not expecting a request to fail can easily break stuff on the page.
I've already added
I've looked through basically all the modules I could find loaded on Wikipedia articles as well as a few other wikis and those are all the modules which won't try to load any non-scraped content. Sadly all three MathJax extensions either load external JS or from an extension path we can't easily rewrite or guess ahead of time. @benoit74 let me know about any specific scripts to add you have in mind and I'll take a look at them. |
|
OK, then what I meant is having And what I meant by having more modules was having everything needed for wikipedia (having the ancestry table in mind). But if this comes from the |
Yeah, the |
@benoit74 Oh please take a look at the code. I assume that I've broken all renderers besides ActionParse, though I'm also not familiar at all with those renderers and have no idea how badly broken they are. I'm certainly gonna need some help with getting those issues fixed. |
|
Apologies for the delay. I've now tested these archives in Kiwix JS Browser Extension, with the same results as @benoit74 reported.
I don't see any differences in formatting, suggesting to me that at least for the tested pages JS_none is as good as other formats, and would be faster to load on older devices due to fewer resources needing to be extracted. However, it would be necessary to remove references to scripts not in the archive to avoid the multiple console errors from attempts to load these. However, JS_trusted seems like a good compromise to me, so long as calls to untrusted resources are removed from the landing page. Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough. |
Currently the scraper has indeed more or less solved #2310 by removing all JS, so that we de-facto do not load anymore external scripts. But this is not really a long term solution. This is why we kept #2310 open saying this was more an interim fix and the medium term solution is this PR, adding back "just enough" scripts. |
The calling of external scripts seen in #2310 is mostly just the attempted loading of JS modules. By rewriting startup.js to load those modules from Of course JS_all can still have scripts attempting to load external resources, just that en.wikipedia is a bad example for that because they have very little JS doing that in the first place. Getting JS_all on es.wikipedia for example will load external scripts through the
I have the suspicion that my filtering of JS modules stopped the already existing landing page JS from being included 😅 Ah, I'll fix JS_none including startup.
You should notice small things like the navboxes at the bottom of articles being collapsed by default now or the video/audio player popup on "Michael Jackson", but yes there aren't a lot of differences in formatting as en.wikipedia does not rely as heavily on JS as other wikis might. |
299a154 to
efdfefe
Compare
|
@kelson42 we need your decision on this one as well, it sat idle for way too long. I again see no reason to be worried about the inline scripts, because as stated by @Markus-Rost the ZIM will anyway fallback to no-JS even if we remove these inline scripts (because startup.js also requires I tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-desktop (it was your main concern) and it works like a charm, including all JS-based functions (like show/hide at the bottom of "Michael Jackson" page). Taking measures to remove these inline scripts without providing any added value to the end users looks like a waste of our precious resources and a risk to introduce bugs. The no-JS fallback is working exactly as-of today (without this PR merged), i.e. the ZIM will still be high-quality enough. I understand you would prefer to provide same ZIM behavior on all readers, but at some point I feel like it is important to recognize that the effort is not worth it. I've also opened openzim/overview#62 to discuss the general policy for openZIM |
|
I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all). |
That one has a bunch of violations in ServiceWorkerLocal mode in Chrome, with the CSP violations looking pretty rough:
But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -
|
I'm only seeing the two CSP violations from the two inline scripts existing. These are expected (and can be reduced to just one).
This is worrying, as only the startup module does those requests and has been modified in this PR to make JS work properly in the first place. The only way I can think of for you to get these errors in the PR ZIM again is if your service worker is somehow using a startup.js file cached from a different Wikipedia ZIM, which seems like a major security issue. @Jaifroid |
@Markus-Rost Just to confirm that I only do these tests in the Browser Extension, which takes a purist approach and only uses the resources supplied in the ZIM except (currently) for an added dark CSS if the user selects that (to be changed soon to native dark). (Even in the PWA, I never supply JS, only CSS for some other transforms.) |
|
I'm having less time to work on this now, but I would like to avoid this PR becoming stale. @benoit74 can you review the PR further (besides the inline script existing), so that I have some pointers to work on whenever I'm available? |
benoit74
left a comment
There was a problem hiding this comment.
What about DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP? Looks like these were "useful" things at some point, but indeed never called since years since the module name was not correct so startup.js was never fixed.
The failing unit test is about inline scripts, it will be easily resolved once discussion about this has settled.
I've added a comment about what looks like the root cause of all failing e2e tests.
And this PR obviously misses a significant share of unit and e2e tests to ensure we will not break too easily this very important feature in the future.
I'm unsure what
No idea what |
|
OK then this PR should completely remove |
e951b80 to
36c6b9a
Compare
4641ad5 to
51c5d46
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2483 +/- ##
==========================================
+ Coverage 76.44% 77.13% +0.69%
==========================================
Files 49 49
Lines 3273 3359 +86
Branches 720 736 +16
==========================================
+ Hits 2502 2591 +89
+ Misses 645 644 -1
+ Partials 126 124 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Stuff left to do for this PR:
|
|
I'm not sure at all that |
51c5d46 to
bd38aa8
Compare
d6359cf to
fb3f68f
Compare
|
@Markus-Rost I'm currently in holidays till the end of the month ; I will review it asap early February |





Fix #2310
--javaScriptoption with values "none", "trusted" or "all" (default being "trusted").--addModulesoption for additional ResourceLoader modules