Skip to content

Add JavaScript option#2483

Open
Markus-Rost wants to merge 1 commit intomainfrom
enable-js
Open

Add JavaScript option#2483
Markus-Rost wants to merge 1 commit intomainfrom
enable-js

Conversation

@Markus-Rost
Copy link
Collaborator

Fix #2310

  • Add --javaScript option with values "none", "trusted" or "all" (default being "trusted").
  • Add --addModules option for additional ResourceLoader modules

@Markus-Rost
Copy link
Collaborator Author

This draft is currently made as a proof of concept with a disregard for any other renderer it might be breaking.

@Markus-Rost
Copy link
Collaborator Author

Markus-Rost commented Aug 18, 2025

I created some example ZIM files including no JavaScript, all JavaScript or only JavaScript trusted not to request external content. Please take a look at them and let me know how you feel about them.

  • All JavaScript disabled
  • Only trusted JavaScript enabled
  • Trusted JavaScript enabled, as well as some additional user created scripts not requesting external content
    • wikipedia_en_10_JS_trusted_addModules_2025-08.zip
    • mwoffliner --mwUrl=https://en.wikipedia.org/ --articleList=http://download.openzim.org/wp1/enwiki/tops/10.tsv --javaScript=trusted --addModules=site,ext.gadget.ReferenceTooltips,oojs,oojs-ui,ext.gadget.switcher,ext.gadget.Calculator,@wikimedia/codex
  • All JavaScript enabled, including scripts requesting external content

@benoit74
Copy link
Contributor

I tried PoC ZIMs.

For some reasons, the wikipedia_en_10_JS_trusted_addModules_2025-08.zim is causing a significant problem to my kiwix-serve / kiwix-apple readers. Both simply fails to load the ZIM main article (or any other article in kiwix-serve). Surprisingly I rebuild a ZIM with same command and few other articles and it works fine. Don't know what is broken, but something is.

All javascripts version is obviously causing problems in the web console ... but nothing really visible in the articles.

The "trusted javascript" works very well.

I'm not sufficiently export to assess the list of trusted modules, or the one of "addModules" version, but this is more a detail we can refine over the years anyway.

The only things I can say is that I confirm that the "trusted javascript with additional modules" is required/enough to solve #1491 and #1911 articles. Only "trusted javascript" is not enough.

I've tested with JS disabled in Chrome and I confirm the ZIM degrades nicely (i.e. it looks like we get the same content than currently without this PR merged).

All points above makes me consider that we might want to change default scraper behavior (compared to current PR behavior), so that scraper creates "near-perfect" ZIM for "newbies", without needing additional flags. Especially important for people willing to ZIM wikipedia which is somehow our primary audience:

  • we should probably have a --nojavascript option instead of a --javascript option, for those willing to not have any JS (even if ZIM with JS seems to degrade nicely in browsers with JS disabled, pretty sure some folk would prefer to not have any JS code in their ZIMs)
  • we should maybe add more scripts to the trusted list

@Markus-Rost Thanks a lot anyway, very impressive change. Glad you made it work! I did not had any look at the code since you specified it is only a PoC. Let me know when it is worth it.

@Markus-Rost
Copy link
Collaborator Author

For some reasons, the wikipedia_en_10_JS_trusted_addModules_2025-08.zim is causing a significant problem to my kiwix-serve / kiwix-apple readers. Both simply fails to load the ZIM main article (or any other article in kiwix-serve). Surprisingly I rebuild a ZIM with same command and few other articles and it works fine. Don't know what is broken, but something is.

Oh, I'm getting that as well. Not sure what happened there. I tested it locally using 100.tsv which worked fine and created the 10.tsv version to fit the GitHub upload limit, but didn't actually test that one again.

The only things I can say is that I confirm that the "trusted javascript with additional modules" is required/enough to solve #1491 and #1911 articles. Only "trusted javascript" is not enough.

Yes, those tables need the site module.

All points above makes me consider that we might want to change default scraper behavior (compared to current PR behavior), so that scraper creates "near-perfect" ZIM for "newbies", without needing additional flags. Especially important for people willing to ZIM wikipedia which is somehow our primary audience:

Do you mean having all JS included by default? That doesn't seem like a good idea as some JS not expecting a request to fail can easily break stuff on the page.

  • we should probably have a --nojavascript option instead of a --javascript option, for those willing to not have any JS (even if ZIM with JS seems to degrade nicely in browsers with JS disabled, pretty sure some folk would prefer to not have any JS code in their ZIMs)

I've already added --javaScript=none, does no js really need it's own option?

  • we should maybe add more scripts to the trusted list

I've looked through basically all the modules I could find loaded on Wikipedia articles as well as a few other wikis and those are all the modules which won't try to load any non-scraped content. Sadly all three MathJax extensions either load external JS or from an extension path we can't easily rewrite or guess ahead of time. @benoit74 let me know about any specific scripts to add you have in mind and I'll take a look at them.

@benoit74
Copy link
Contributor

OK, then what I meant is having --javascript=trusted as default. If we have --javascript=none, this is OK.

And what I meant by having more modules was having everything needed for wikipedia (having the ancestry table in mind). But if this comes from the site modules, and since (correct me if I'm wrong) we could have anything in the site module, then we can't do that. Or maybe we could add the site modules automatically only for wikimedia projects... not sure it is such a good idea.

@Markus-Rost
Copy link
Collaborator Author

OK, then what I meant is having --javascript=trusted as default. If we have --javascript=none, this is OK.

--javaScript=trusted is already the default in this PR

And what I meant by having more modules was having everything needed for wikipedia (having the ancestry table in mind). But if this comes from the site modules, and since (correct me if I'm wrong) we could have anything in the site module, then we can't do that. Or maybe we could add the site modules automatically only for wikimedia projects... not sure it is such a good idea.

Yeah, the site module can contain anything and change at any time. We also can't just include it for all Wikimedia projects, as on some wikis it might load external content (like on es.wikipedia.org).

@Markus-Rost
Copy link
Collaborator Author

@Markus-Rost Thanks a lot anyway, very impressive change. Glad you made it work! I did not had any look at the code since you specified it is only a PoC. Let me know when it is worth it.

@benoit74 Oh please take a look at the code. I assume that I've broken all renderers besides ActionParse, though I'm also not familiar at all with those renderers and have no idea how badly broken they are. I'm certainly gonna need some help with getting those issues fixed.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Aug 22, 2025

Apologies for the delay. I've now tested these archives in Kiwix JS Browser Extension, with the same results as @benoit74 reported.

  • The JS_all shows errors in console relating to failures to fetch some scripts (but I'm not actually seeing blocked calls to external resources).
  • JS_trusted shows a bunch of failures on the landing page, but not on other pages I tested.
  • JS_none also shows failures on the landing page (clearly references to modules not included in the archive have not been removed), and also on article pages a failure to find C/_mw_/startup.js.
  • JS_trusted_addModules fails to find the landing page, but article pages are fine.

I don't see any differences in formatting, suggesting to me that at least for the tested pages JS_none is as good as other formats, and would be faster to load on older devices due to fewer resources needing to be extracted. However, it would be necessary to remove references to scripts not in the archive to avoid the multiple console errors from attempts to load these. However, JS_trusted seems like a good compromise to me, so long as calls to untrusted resources are removed from the landing page.

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

@benoit74
Copy link
Contributor

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

Currently the scraper has indeed more or less solved #2310 by removing all JS, so that we de-facto do not load anymore external scripts. But this is not really a long term solution. This is why we kept #2310 open saying this was more an interim fix and the medium term solution is this PR, adding back "just enough" scripts.

@Markus-Rost
Copy link
Collaborator Author

  • The JS_all shows errors in console relating to failures to fetch some scripts (but I'm not actually seeing blocked calls to external resources).

Regarding the core issue of calling external scripts or resources #2310, it seems none of the test pages I noticed had that problem in any of these test archives, but possibly I didn't test deeply enough.

The calling of external scripts seen in #2310 is mostly just the attempted loading of JS modules. By rewriting startup.js to load those modules from C/_mw_/MODULE.js, they are now the failures to fetch some scripts you noticed in JS_all (though a lot of the called scripts are included in the ZIM).

Of course JS_all can still have scripts attempting to load external resources, just that en.wikipedia is a bad example for that because they have very little JS doing that in the first place. Getting JS_all on es.wikipedia for example will load external scripts through the site module as mentioned in an earlier comment.

  • JS_trusted shows a bunch of failures on the landing page, but not on other pages I tested.
  • JS_none also shows failures on the landing page (clearly references to modules not included in the archive have not been removed), and also on article pages a failure to find C/_mw_/startup.js.
  • JS_trusted_addModules fails to find the landing page, but article pages are fine.

I have the suspicion that my filtering of JS modules stopped the already existing landing page JS from being included 😅 Ah, I'll fix JS_none including startup.

I don't see any differences in formatting, suggesting to me that at least for the tested pages JS_none is as good as other formats, and would be faster to load on older devices due to fewer resources needing to be extracted. However, it would be necessary to remove references to scripts not in the archive to avoid the multiple console errors from attempts to load these. However, JS_trusted seems like a good compromise to me, so long as calls to untrusted resources are removed from the landing page.

You should notice small things like the navboxes at the bottom of articles being collapsed by default now or the video/audio player popup on "Michael Jackson", but yes there aren't a lot of differences in formatting as en.wikipedia does not rely as heavily on JS as other wikis might.

@benoit74
Copy link
Contributor

benoit74 commented Oct 2, 2025

@kelson42 we need your decision on this one as well, it sat idle for way too long.

I again see no reason to be worried about the inline scripts, because as stated by @Markus-Rost the ZIM will anyway fallback to no-JS even if we remove these inline scripts (because startup.js also requires eval which is blocked in same context than inline scripts).

I tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-desktop (it was your main concern) and it works like a charm, including all JS-based functions (like show/hide at the bottom of "Michael Jackson" page).

Taking measures to remove these inline scripts without providing any added value to the end users looks like a waste of our precious resources and a risk to introduce bugs.

The no-JS fallback is working exactly as-of today (without this PR merged), i.e. the ZIM will still be high-quality enough.

I understand you would prefer to provide same ZIM behavior on all readers, but at some point I feel like it is important to recognize that the effort is not worth it.

I've also opened openzim/overview#62 to discuss the general policy for openZIM

@benoit74
Copy link
Contributor

benoit74 commented Oct 2, 2025

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

@Jaifroid
Copy link
Collaborator

Jaifroid commented Oct 2, 2025

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

That one has a bunch of violations in ServiceWorkerLocal mode in Chrome, with the CSP violations looking pretty rough:

image

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

@Markus-Rost
Copy link
Collaborator Author

Markus-Rost commented Oct 2, 2025

I also tried the wikipedia_en_10_JS_trusted_2025-08.zip ZIM in kiwix-js and kiwix-pwa in Restricted mode and I confirm it fallbacks very nice to no-JS mode (i.e. the ZIM looks just like a currently prod ZIM not supporting JS at all).

That one has a bunch of violations in ServiceWorkerLocal mode in Chrome, with the CSP violations looking pretty rough:

image

I'm only seeing the two CSP violations from the two inline scripts existing. These are expected (and can be reduced to just one).

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

This is worrying, as only the startup module does those requests and has been modified in this PR to make JS work properly in the first place. The only way I can think of for you to get these errors in the PR ZIM again is if your service worker is somehow using a startup.js file cached from a different Wikipedia ZIM, which seems like a major security issue. @Jaifroid

@Markus-Rost
Copy link
Collaborator Author

But more worryingly, in ServiceWorker mode, we get a bunch of attempts to access an external PHP server (wikipedia.org/w/load.php). At the very least those calls should be removed ISTM! -

image

Oh, I found where this is coming from. It's the sourceMappingURL for modules loaded from local storage. Should be fairly easy to just remove that annotation, though the source map is only loaded by the browser dev tools anyway.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Oct 2, 2025

This is worrying, as only the startup module does those requests and has been modified in this PR to make JS work properly in the first place. The only way I can think of for you to get these errors in the PR ZIM again is if your service worker is somehow using a startup.js file cached from a different Wikipedia ZIM, which seems like a major security issue. @Jaifroid

@Markus-Rost Just to confirm that I only do these tests in the Browser Extension, which takes a purist approach and only uses the resources supplied in the ZIM except (currently) for an added dark CSS if the user selects that (to be changed soon to native dark). (Even in the PWA, I never supply JS, only CSS for some other transforms.)

@Markus-Rost
Copy link
Collaborator Author

I'm having less time to work on this now, but I would like to avoid this PR becoming stale. @benoit74 can you review the PR further (besides the inline script existing), so that I have some pointers to work on whenever I'm available?

@Markus-Rost Markus-Rost requested a review from benoit74 October 10, 2025 17:02
Copy link
Contributor

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP? Looks like these were "useful" things at some point, but indeed never called since years since the module name was not correct so startup.js was never fixed.

The failing unit test is about inline scripts, it will be easily resolved once discussion about this has settled.

I've added a comment about what looks like the root cause of all failing e2e tests.

And this PR obviously misses a significant share of unit and e2e tests to ensure we will not break too easily this very important feature in the future.

@Markus-Rost
Copy link
Collaborator Author

What about DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP? Looks like these were "useful" things at some point, but indeed never called since years since the module name was not correct so startup.js was never fixed.

I'm unsure what DO_PROPAGATION was supposed to solve. It removed a short timeout before each modules got loaded. I suspect the reason was to avoid using mw.requestIdleCallback because the mw class wasn't properly initialized before. Regarding the reason of the timeout, the MW script has a comment about it:

 		// Yield for two reasons:
		// * Allow successive calls to mw.loader.impl() from the same
		//   load.php response, or from the same asyncEval() to be in the
		//   propagation batch.
		// * Allow the browser to breathe between the reception of
		//   module source code and the execution of it.
		//
		// Use a high priority because the user may be waiting for interactions
		// to start being possible. But, first provide a moment (up to 'timeout')
		// for native input event handling (e.g. scrolling/typing/clicking).
		mw.requestIdleCallback( doPropagation, { timeout: 1 } );

ALL_READY_FUNCTION just disabled the check if all dependency modules are loaded. Again likely existed because the resource loader didn't work fully, but is no longer needed as the resource loader now works as intended inside the ZIM.

No idea what LOAD_PHP was intended to do, it does not match anything on any recent startup module (besides possibly nuking half the minified module code due to an overly agressive regex)

@benoit74
Copy link
Contributor

OK then this PR should completely remove DO_PROPAGATION, ALL_READY_FUNCTION and LOAD_PHP for the time being, no need to keep in codebase stuff we do not really understand and does not seem to provide much added value.

@codecov
Copy link

codecov bot commented Dec 21, 2025

Codecov Report

❌ Patch coverage is 91.93548% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.13%. Comparing base (3c2d018) to head (4b89619).

Files with missing lines Patch % Lines
src/sanitize-argument.ts 20.00% 2 Missing and 2 partials ⚠️
src/util/dump.ts 92.30% 3 Missing ⚠️
src/Downloader.ts 83.33% 1 Missing ⚠️
src/renderers/abstractDesktop.render.ts 0.00% 0 Missing and 1 partial ⚠️
src/util/misc.ts 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2483      +/-   ##
==========================================
+ Coverage   76.44%   77.13%   +0.69%     
==========================================
  Files          49       49              
  Lines        3273     3359      +86     
  Branches      720      736      +16     
==========================================
+ Hits         2502     2591      +89     
+ Misses        645      644       -1     
+ Partials      126      124       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Markus-Rost
Copy link
Collaborator Author

Stuff left to do for this PR:

  1. Write tests for the new parameters.
  2. Figure out what to do with jsConfigVars, it's a custom JS module shared across all articles for some renderers, but for this PR I need it as a key-value object specific to each article. (Also did this ever work properly for the renderers having it shared across all articles? It's being pulled from the html of individual articles so some config variables are most likely page specific.)
  3. Make webpHandler work again, it's the only custom module in ActionParse renderer so I just commented it out for now:
    if (Downloader.webp) {
    articleJsScripts.push(...['webpHandler'].map((oneJsDep: string) => genHeaderScript(config, oneJsDep, articleId, config.output.dirs.webp)))
    }*/

@benoit74
Copy link
Contributor

I'm not sure at all that jsConfigVars ever worked correctly in other renderers. Note that we plan to also implement #2210 for this same release, so looks like you should probably not be too ashamed of making things incompatible with other renderers.

@Markus-Rost Markus-Rost force-pushed the enable-js branch 4 times, most recently from d6359cf to fb3f68f Compare December 28, 2025 22:45
@Markus-Rost Markus-Rost requested a review from benoit74 December 29, 2025 06:06
@Markus-Rost Markus-Rost marked this pull request as ready for review December 29, 2025 13:22
@benoit74
Copy link
Contributor

@Markus-Rost I'm currently in holidays till the end of the month ; I will review it asap early February

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wikipedia articles scraped with ActionParse API contain scripts that attempt to load external content

4 participants