Skip to content

Usual noise

Matti Schneider edited this page Nov 9, 2022 · 7 revisions

Generally speaking, noises are unwanted content in versions. You will find in this page the different types of noise that we have encountered.

Irrelevant content

Content that is not related to the terms.

CSS selectors are a first step as they permit to select an area instead of the whole page, but they let pass through content such as headers, footers, buttons, drop-down lists…

Filtering permits to get rid of the remaining irrelevant content.

Example with a drop-down

A drop-down list let user select which document he would like to see but this list doesn't interest us in the final document.

HTML file

<div class="filter-holder">
  <select class="filter-options">
      <option value="https://www.redditinc.com/policies/user-agreement" selected>User Agreement</option>
      <option value="https://www.redditinc.com/policies/privacy-policy">Privacy Policy</option>
      <option value="https://www.redditinc.com/policies/content-policy">Content Policy</option>
      <option value="https://www.redditinc.com/policies/broadcasting-content-policy">Broadcasting Content Policy</option>
  </select>
</div>
<h1>Reddit User Agreement</h1>

Markdown file

User Agreement Privacy Policy Content Policy Broadcasting Content Policy Moderator Guidelines Transparency Report 2017 Transparency Report 2018 Guidelines for Law Enforcement Transparency Report 2019

Reddit User Agreement
=====================

Wished Markdown file


Reddit User Agreement
=====================

Filter in Javascript

export function removeOptionsList(document) {
  document.querySelectorAll('.filter-holder').forEach(element => element.remove());
}

Invisible HTML elements

Elements that are invisible in the original page that become visible in the version or that otherwise disrupt version rendering.

These elements were usually hidden in the original page via CSS stylesheets.

Example with an invisible paragraph

An invisible paragraph (with display: none style) appearing in the version:

HTML file

<h1>Twitter Terms of Service</h1>
<p style="display: none;">goglobalwithtwitterbanner</p>

Markdown file

Twitter Terms of Service
========================
goglobalwithtwitterbanner

Wished Markdown file

Twitter Terms of Service
========================

Filter in Javascript

export function removeNotDisplayedElements(document) {
  document.querySelectorAll('[style="display: none;"]').forEach(element => element.remove());
}

Invisible elements that disrupt Markdown rendering usually do so by being taken into account by HTML to Markdown conversion, whereas they were not in the original page.

Example with invisible links

Invisible links disrupts numbering.

HTML file

<h2>AGREEMENT</h2>
<ol>
  <a id="1"></a>
  <li>
    <span>Eligibility</span>
  </li>
  <div class="divider"></div>
  <a id="2"></a>
  <li>
    <span>Term, Terms and Termination</span>
  <li>
</ol>

Markdown file

AGREEMENT
---------

2.  Eligibility

5.  Term, Terms and Termination

Wished Markdown file

AGREEMENT
---------

1.  Eligibility

2.  Term, Terms and Termination

Filter in Javascript

export function numberListCorrectly(document) {
  document.querySelectorAll('ol')
    .forEach(listToClean => Array.from(listToClean.children)
      .filter(element => element.tagName != 'LI')
      .map(element => element.remove()));
}

Content generating frequent and legally irrelevant changes

Content whose changes are both too frequent and legally irrelevant. We found that those contents are usually hypertext links, since two links can point to the same website yet they can be written differently. A case in point are links passing parameters: a change in parameters will not change where the link point at.

Example with a link parameter

A link has a parameter h= changing too frequently and irrelevant to the address the link points to.

HTML file

You can only use our copyrights or <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fen.facebookbrand.com%2Ftrademarks%2F&amp;h=AT0_izDHO3yJuXJuJJeWQyJFVilQqIDOA3oMwr51t6gEq1q4UbyH2VtU7UhNzhg1LH0YzUHAjw0TADuoufWgb_YEuzoFpvyIR8_4rkUfjDXxUw3q1KmpsYL_H3C4OIm3xHzrUZRatmWQ6PAk">trademarks (or any similar marks)</a>

Markdown file

You can only use our copyrights or [trademarks (or any similar marks)](https://l.facebook.com/l.php?u=https%3A%2F%2Fen.facebookbrand.com%2Ftrademarks%2F&h=AT1XEFWtw25SbFSSD7W2MOS1LQIsUwaUrq4qh5dNmI21qm42JE5lUiv9g8MsTSnvi3DjYfJxOPoBxEKyBQjo7qkxfcUkDzedQzBLWgGJYWC6CwDBI0S5pefB4oiuh8Jo63phreoUKQ3BF4O5)

Wished Markdown file

You can only use our copyrights or [trademarks (or any similar marks)](https://l.facebook.com/l.php?u=https%3A%2F%2Fen.facebookbrand.com%2Ftrademarks%2F)

Filter in Javascript

export function cleanUrls(document) {
  const links = document.querySelectorAll('[href*="https://l.facebook.com/l.php?"]');
  links.forEach(link => {
    link.href = link.href.replace(/&h=\S*/, '');
  });
}

Email address obfuscation

Some services use services such as Cloudflare Email Address Obfuscation to help in spam prevention by hiding email addresses from harvesters and others bots while remaining visible to the site visitors.

If we don't wait until the document is loaded and scripts are executed we can record the following type of blink.

Example of recorded blink

HTML file

Recorded blink on Markdown file

Two successive versions of Markdown recorded versions:

[dpo.a\[email protected\]](https://airmalta.com/cdn-cgi/l/email-protection#removed).
[\[email protected\]](https://airmalta.com/cdn-cgi/l/email-protection#removed).

Wished Markdown file

[\[email protected\]](https://airmalta.com/cdn-cgi/l/email-protection#removed).

Solution

Add executeClientScript: true in your declaration file.

{
  "name": "Air Malta",
  "documents": {
    "Privacy Policy": {
      "fetch": "...",
      "select": "...",
      "remove": "...",
      "executeClientScript": true
    }
  }
}
Clone this wiki locally