diff --git a/content/terms/explanation/filters.md b/content/terms/explanation/filters.md new file mode 100644 index 00000000..5e574e9e --- /dev/null +++ b/content/terms/explanation/filters.md @@ -0,0 +1,38 @@ +--- +title: "Filters" +weight: 3 +--- + +# Filters + +Filters enable solving [noise]({{< relref "/terms/guideline/declaring/#usual-noise" >}}) issues in versions that cannot be addressed with direct selection or removal of content using selectors. + +## When filters are needed + +Use filters when: + +- **Content selectors are insufficient**, for example when noise appears within content that can't be targeted with CSS selectors or [range selectors]({{< relref "terms/explanation/range-selectors" >}}) with the [`select`]({{< relref "terms/reference/declaration/#ref-select" >}}) and [`remove`]({{< relref "terms/reference/declaration/#ref-remove" >}}) properties. +- **Content is dynamically generated**, for example when elements change on each page load with changing classes or IDs that cannot be targeted with [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors). +- **Complex tasks are needed**, for example when content transformation is required such as converting images to base64 to store them in the terms version or converting date-based content to a stable format (like “Updated X days ago” to “Last updated on YYYY-MM-DD”). + +## How filters work + +Filters are JavaScript functions that can manipulate the DOM structure directly. They modify the document structure and content in-place. + +## Filter design principles + +Filters should follow these core principles: + +- **Specific**: target only the noise to remove. Avoid broad selectors that might accidentally remove important content. + + > For example, if a filter converts relative dates to absolute dates, make sure to scope the targeted dates. This might translate to selecting with `.metadata time`, not `time`, which might also affect important effective dates within the terms content. + +- **Idempotent**: filters should produce the same result even if run multiple times on their own output. This ensures consistency. + + > For example, if a filter adds section numbers like "1." to headings, it should check if the numbers already exist, to prevent "1. Privacy Policy" from becoming "1. 1. Privacy Policy" on repeated runs. + +- **Efficient**: DOM queries should be optimised and filters should avoid unnecessary operations, processing only the elements needed. + + > For example, if a filter updates timestamp elements with a specific class, using `document.querySelectorAll('.timestamp')` is more efficient than `document.querySelectorAll('*')` followed by filtering for timestamp elements. + +- **Safe**: filters must not accidentally remove important content. The generated version should always be checked after adding a filter to ensure it still contains the whole terms content. diff --git a/content/terms/how-to/apply-filters.md b/content/terms/how-to/apply-filters.md new file mode 100644 index 00000000..b250e8a6 --- /dev/null +++ b/content/terms/how-to/apply-filters.md @@ -0,0 +1,161 @@ +--- +title: Apply filters +weight: 7 +--- + +# How to apply filters + +This guide explains how to add filters to existing declarations to remove meaningless content that cannot be removed with CSS selectors, to prevent noise in the versions. + +## Prerequisites + +- An existing terms declaration file. +- Having already identified the noise to remove and having double-checked it cannot be removed with CSS selectors with the [`remove`]({{< relref "terms/reference/declaration/#ref-remove" >}}) property. + +## Step 1: Check for built-in filters + +Built-in filters are pre-defined functions that handle common noise patterns. They are the easiest way to clean up content. + +Review the available [built-in filters]({{< relref "/terms/reference/built-in-filters" >}}) to find if one matches your needs. + +If you find a suitable built-in filter, proceed to [Step 3](#step-3-declare-the-filter), otherwise you will need to create a custom filter. + +## Step 2: Create a custom filter _(optional)_ + +If no built-in filter matches your needs, you will need to create a custom filter. This requires JavaScript knowledge and familiarity with DOM manipulation. + +### Create the filter file + +Create a JavaScript file in the same folder and with the same name as your service declaration, but with `.filters.js` extension. + +> For example, if your declaration is `declarations/MyService.json`, create `declarations/MyService.filters.js`. + +### Write the filter function + +Define your filter function with the following signature: + +```js +export function myCustomFilter(document, [parameters]) { + // Your filter logic here +} +``` + +#### Parameters + +- `document`: JSDOM document instance representing the web page +- `parameters`: values passed from the declaration _(optional)_ + +#### Example: Remove session IDs from text content + +For example, let's say you want to remove session IDs from text content: + +```html +

We collect your data for the following purposes:

+ +

Last updated on 2023-12-07 (Session: abc123def456)

+``` + +You can implement this filter as follows: + +```js +export function removeSessionIds(document) { + // Find all paragraphs that might contain session IDs + const paragraphs = document.querySelectorAll('.session-id'); + + paragraphs.forEach(paragraph => { + let text = paragraph.textContent; + // Remove session ID patterns like "Session: abc123" or "(Session: def456)" + text = text.replace(/\s*\(?Session:\s*[a-zA-Z0-9]+\)?/g, ''); + paragraph.textContent = text.trim(); + }); +} +``` + +Result after applying the filter: + +```diff +

We collect your data for the following purposes:

+ +-

Last updated on 2023-12-07 (Session: abc123def456)

++

Last updated on 2023-12-07

+``` + +## Step 3: Declare the filter + +Open your service declaration file (e.g. `declarations/MyService.json`) and locate the `filter` property of the specific terms you want to apply the filter to. If it doesn't exist, add it as an array. + +### Filter without parameters + +For filters that don’t require parameters, add the filter name as a string: + +```json +{ + "name": "MyService", + "terms": { + "Privacy Policy": { + "fetch": "https://my.service.example/en/privacy-policy", + "select": ".textcontent", + "filter": [ + "removeSessionIds" + ] + } + } +} +``` + +### Filter with parameters + +For filters that take parameters, use an object format, for example with the built-in filter `removeQueryParams` to remove query parameters from URLs: + +```json +{ + "name": "MyService", + "terms": { + "Privacy Policy": { + "fetch": "https://my.service.example/en/privacy-policy", + "select": ".textcontent", + "filter": [ + { + "removeQueryParams": ["utm_source", "utm_medium", "utm_campaign"] + } + ] + } + } +} +``` + +### Multiple filters + +You can combine multiple filters in the same declaration: + +```json +{ + "name": "MyService", + "terms": { + "Privacy Policy": { + "fetch": "https://my.service.example/en/privacy-policy", + "select": ".textcontent", + "filter": [ + { + "removeQueryParams": ["utm_source", "utm_medium"] + }, + "removeSessionIds" + ] + } + } +} +``` + +## Step 4: Test the filter + +After adding the filter, test your declaration to ensure it works correctly: + +1. Start the terms tracking process +2. Check that the noise has been removed +3. Verify that important content is preserved diff --git a/content/terms/reference/built-in-filters.md b/content/terms/reference/built-in-filters.md new file mode 100644 index 00000000..c3075330 --- /dev/null +++ b/content/terms/reference/built-in-filters.md @@ -0,0 +1,27 @@ +--- +title: "Built-in filters" +--- + +# Built-in filters + +This reference details all available built-in [filters]({{< relref "terms/explanation/filters" >}}) that can be applied to avoid noise in versions. + +{{< refItem + name="removeQueryParams" + description="Removes specified query parameters from URLs in links and images." +>}} + +```json +"filter": [ + { + "removeQueryParams": ["utm_source", "utm_medium"] + } +] +``` + +```diff +-

Read the list of our affiliates.

++

Read the list of our affiliates.

+``` + +{{< /refItem >}} diff --git a/content/terms/reference/declaration.md b/content/terms/reference/declaration.md index 454d895d..3a8b5c5a 100644 --- a/content/terms/reference/declaration.md +++ b/content/terms/reference/declaration.md @@ -139,10 +139,18 @@ As an array of those: {{< refItem name="filter" - type="array of strings" - description="Array of filter function names to apply. Function will be executed in the order of the array. See the [Filters]({{< relref \"terms/reference/filters\" >}}) section for more information." - example="[\"filterName1\", \"filterName2\"]" -/>}} + type="array of strings or objects" + description="Array of filter functions to apply. Each item can be either a string (function name) or an object (function name as key, parameters as value). Functions will be executed in the order of the array. See the [Filters]({{< relref \"terms/reference/filters\" >}}) section for more information." +>}} +```json +"filter": [ + "filterName1", + { + "filterName2": "param" + } +] +``` +{{< /refItem >}} {{< refItem name="combine" diff --git a/content/terms/reference/filters.md b/content/terms/reference/filters.md index 6006c8a0..6031c2c6 100644 --- a/content/terms/reference/filters.md +++ b/content/terms/reference/filters.md @@ -4,51 +4,224 @@ title: "Filters" # Filters -Some documents require more complex filtering beyond basic element selection and removal. For example, web pages often contain dynamically generated content like tracking IDs in URLs that change on each page load. While these elements are part of the page, they are not meaningful to the terms content itself. If such dynamic content is included in the archived versions, it creates a lot of insignificant versions and pollutes the archive with noise that makes it harder to identify actual changes to the terms. - -Filters address this need by providing a way to programmatically clean up and normalize the content before archiving. They are implemented as JavaScript functions that can manipulate the downloaded web page using the [DOM API](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model), allowing for sophisticated content transformations beyond what's possible with simple CSS selectors. - -Filters take the document DOM and the terms declaration as parameters and are: +Filters are JavaScript functions that take the document DOM as parameter and are: - **in-place**: they modify the document structure and content directly; - **idempotent**: they return the same document structure and content even if run repeatedly on their own result. +- **ordered**: they are run sequentially in the order specified in the declaration. -Filters are loaded automatically from files named after the service they operate on. For example, filters for the Meetup service, which is declared in `declarations/Meetup.json`, are loaded from `declarations/Meetup.filters.js`. +Learn more about the concept and constraints on the [filters explanation]({{< relref "terms/explanation/filters" >}}). + +## Signature The generic function signature for a filter is: +- For filters that take no parameter: + +```js +export [async] function filterName(document, [documentDeclaration]) +``` + +- For filters that take parameters: + +```js +export [async] function filterName(document, parameters, [documentDeclaration]) +``` + +Each filter is exposed as a named function export that takes a `document` parameter and behaves like the `document` object in a browser DOM. The `document` parameter is actually a [JSDOM](https://github.com/jsdom/jsdom) document instance. + +These functions can be `async`, but they will still run sequentially. + +## Usage + +### Filters that take no parameter + ```js -export [async] function filterName(document, documentDeclaration) +// .filters.js +export function customFilter(document) { + // filter logic here +} ``` -Each filter is exposed as a named function export that takes a `document` parameter and behaves like the `document` object in a browser DOM. These functions can be `async`, but they will still run sequentially. The whole document declaration is passed as second parameter. +Can be used as follows in the declaration: + +```json +// .json +{ + "name": "", + "terms": { + "": { + "fetch": "", + "select": "", + "filter": [ + "customFilter" + ] + } + } +} +``` -> The `document` parameter is actually a [JSDOM](https://github.com/jsdom/jsdom) document instance. +#### Example -You can learn more about usual noise and ways to handle it [in the guidelines]({{< relref "/terms/guideline/declaring#usual-noise" >}}). +```js +export function convertTimeAgoToDate(document) { + const timeElements = document.querySelectorAll('.metadata time'); + + timeElements.forEach(timeElement => { + const dateTimeValue = timeElement.getAttribute('datetime'); + const textNode = document.createTextNode(dateTimeValue); + timeElement.parentNode.replaceChild(textNode, timeElement); + }); +} +``` -### Example +```json +{ + "name": "MyService", + "terms": { + "Privacy Policy": { + "fetch": "https://my.service.example/privacy", + "select": ".content", + "filter": [ + "convertTimeAgoToDate" + ] + } + } +} +``` -Let's assume a service adds a unique `clickId` parameter in the query string of all link destinations. These parameters change on each page load, leading to recording noise in versions. Since links should still be recorded, it is not appropriate to use `remove` to remove the links entirely. Instead, a filter will manipulate the links destinations to remove the always-changing parameter. Concretely, the goal is to apply the following filter: +Result: ```diff -- Read the list of our affiliates. -+ Read the list of our affiliates. +- ++ ``` -The code below implements this filter: +### Filter with parameters ```js -function removeTrackingIdsQueryParam(document) { - const QUERY_PARAM_TO_REMOVE = 'clickId'; +// .filters.js +export function customParameterizedFilter(document, params) { + // filter logic here +} +``` + +Can be used as follows in the declaration: + +```json +// .json +{ + "name": "", + "terms": { + "": { + "fetch": "", + "select": "", + "filter": [ + { + "customParameterizedFilter": ["param1", "param2"] + } + ] + } + } +} +``` - document.querySelectorAll('a').forEach(link => { // iterate over every link in the page - const url = new URL(link.getAttribute('href'), document.location); // URL is part of the DOM API, see https://developer.mozilla.org/en-US/docs/Web/API/URL - const params = new URLSearchParams(url.search); // URLSearchParams is part of the DOM API, see https://developer.mozilla.org/en-US/docs/Web/API/URLSearchParams +#### Example 1 - params.delete(QUERY_PARAM_TO_REMOVE); // we use the DOM API instead of RegExp because we can't know in advance in which order parameters will be written - url.search = params.toString(); // store the query string without the parameter - link.setAttribute('href', url.toString()); // write the destination URL without the parameter +```js +export function removeLinksWithText(document, textArray) { + const links = document.querySelectorAll('a'); + const textsToRemove = Array.isArray(textArray) ? textArray : [textArray]; + + links.forEach(link => { + if (textsToRemove.includes(link.textContent.trim())) { + link.remove(); + } }); } ``` + +```json +{ + "name": "MyService", + "terms": { + "Privacy Policy": { + "fetch": "https://my.service.example/privacy", + "select": ".content", + "filter": [ + { "removeLinksWithText": ["Return to previous section", "Go to next section"] } + ] + } + } +} +``` + +Result: + +```diff +
+- Go to next section +

...

+
+ + +``` + +#### Example 2 + +```js +import fetch from 'isomorphic-fetch'; + +export async function convertImagesToBase64(document, selector, documentDeclaration) { + const images = Array.from(document.querySelectorAll(selector)); + + return Promise.all(images.map(async ({ src }, index) => { + if (src.startsWith('data:')) { + return; // Already a data-URI, skip + } + + const imageUrl = new URL(src, documentDeclaration.fetch).href; // Ensure url is absolute + const response = await fetch(imageUrl); + const mimeType = response.headers.get('content-type'); + const content = await response.arrayBuffer(); + + const base64Content = btoa(String.fromCharCode(...new Uint8Array(content))); + + images[index].src = `data:${mimeType};base64,${base64Content}`; + })); + +} +``` + +```json +{ + "name": "MyService", + "terms": { + "Privacy Policy": { + "fetch": "https://my.service.example/privacy", + "select": ".content", + "filter": [ + { "convertImagesToBase64": ".meaningful-illustration" } + ] + } + } +} +``` + +Result: + +```diff +- ++ +``` + +## Third-party libraries + +As can be seen in the last example, third-party libraries can be imported in the filters. These should be declared in the `package.json` of the collection to be available. diff --git a/themes/opentermsarchive/assets/css/components/refItem.css b/themes/opentermsarchive/assets/css/components/refItem.css index 5c84ee57..df6f53cb 100644 --- a/themes/opentermsarchive/assets/css/components/refItem.css +++ b/themes/opentermsarchive/assets/css/components/refItem.css @@ -6,6 +6,14 @@ box-shadow: inset 0 1px var(--colorBlack200); } +.refItem-name { + position: relative; + display: flex; + align-items: center; + gap: 0.5rem; + color:inherit +} + .refItem-name code { font-weight: 600; overflow-y: auto; @@ -27,6 +35,26 @@ font-weight: 400; } +.refItem-anchor-icon { + opacity: 0; + color: var(--colorBlack600); + text-decoration: none; + transition: opacity 0.1s ease; +} + +.refItem-name:hover .refItem-anchor-icon { + opacity: 1; +} + +.refItem-anchor-icon:hover { + color: var(--colorBlack800); +} + +.refItem-anchor-icon { + width: 1em; + height: 1em; +} + .refItem-details { display: flex; } diff --git a/themes/opentermsarchive/assets/css/components/textContent.css b/themes/opentermsarchive/assets/css/components/textContent.css index 58cc7ab7..05ad1ff2 100644 --- a/themes/opentermsarchive/assets/css/components/textContent.css +++ b/themes/opentermsarchive/assets/css/components/textContent.css @@ -214,8 +214,10 @@ } & code { - background-color: var(--colorBlack200); + background-color: var(--colorBlack100); font-size: 0.9em; + white-space: pre-wrap; + border-radius: 0.2em; } & button { diff --git a/themes/opentermsarchive/assets/css/elements/syntax.css b/themes/opentermsarchive/assets/css/elements/syntax.css index f40c741e..c47fdd4c 100644 --- a/themes/opentermsarchive/assets/css/elements/syntax.css +++ b/themes/opentermsarchive/assets/css/elements/syntax.css @@ -1,5 +1,5 @@ -/* Background */ .bg { background-color: var(--colorBlack200); } -/* PreWrapper */ .chroma { background-color: var(--colorBlack200); font-size: 1.4rem; } +/* Background */ .bg { background-color: var(--colorBlack100); border-radius: 0.2em; } +/* PreWrapper */ .chroma { background-color: var(--colorBlack100); font-size: 1.4rem; border-radius: 0.2em;} /* Other */ .chroma .x { } /* Error */ .chroma .err { color: var(--colorError); } /* CodeLine */ .chroma .cl { } @@ -72,11 +72,11 @@ /* CommentPreproc */ .chroma .cp { color: #67707b; font-style: italic } /* CommentPreprocFile */ .chroma .cpf { color: #67707b; font-style: italic } /* Generic */ .chroma .g { } -/* GenericDeleted */ .chroma .gd { } +/* GenericDeleted */ .chroma .gd { background-color: rgb(255, 206, 203); } /* GenericEmph */ .chroma .ge { } /* GenericError */ .chroma .gr { } /* GenericHeading */ .chroma .gh { } -/* GenericInserted */ .chroma .gi { } +/* GenericInserted */ .chroma .gi { background-color: rgb(172, 238, 187); } /* GenericOutput */ .chroma .go { } /* GenericPrompt */ .chroma .gp { } /* GenericStrong */ .chroma .gs { } diff --git a/themes/opentermsarchive/assets/css/elements/titles.css b/themes/opentermsarchive/assets/css/elements/titles.css index 031f60c1..1d9a4c3c 100644 --- a/themes/opentermsarchive/assets/css/elements/titles.css +++ b/themes/opentermsarchive/assets/css/elements/titles.css @@ -120,17 +120,26 @@ h6, line-height: 1.25; } +.title-link { + color: inherit; + text-decoration: none; +} + +.title-link:hover { + color: inherit; +} + .title-anchor { - color: var(--colorBlack400); font-size: 0.8em; font-weight: normal; - display: none; + opacity: 0; + transition: opacity 0.1s ease; } h2, h3, h4, h5, h6 { - &:hover { - a.title-anchor { - display: inline; + .title-link:hover { + .title-anchor { + opacity: 1; } } } diff --git a/themes/opentermsarchive/assets/js/icons.js b/themes/opentermsarchive/assets/js/icons.js index ff135d18..d67b247f 100644 --- a/themes/opentermsarchive/assets/js/icons.js +++ b/themes/opentermsarchive/assets/js/icons.js @@ -1,7 +1,7 @@ import { ChevronDown, X, - + Link, createIcons, } from 'lucide'; @@ -9,6 +9,7 @@ createIcons({ icons: { X, ChevronDown, + Link, }, attrs: { 'aria-hidden': true }, }); diff --git a/themes/opentermsarchive/layouts/_default/_markup/render-heading.html b/themes/opentermsarchive/layouts/_default/_markup/render-heading.html index 106e65a5..2b7737c1 100644 --- a/themes/opentermsarchive/layouts/_default/_markup/render-heading.html +++ b/themes/opentermsarchive/layouts/_default/_markup/render-heading.html @@ -1,6 +1,12 @@ + {{ if ne .Level 1 }} + + {{ end }} {{ .Text | safeHTML }} {{ if ne .Level 1 }} - 🔗 + + {{ end }} + + diff --git a/themes/opentermsarchive/layouts/shortcodes/refItem.html b/themes/opentermsarchive/layouts/shortcodes/refItem.html index 8bd96902..c0804292 100644 --- a/themes/opentermsarchive/layouts/shortcodes/refItem.html +++ b/themes/opentermsarchive/layouts/shortcodes/refItem.html @@ -7,15 +7,17 @@ {{/* Get description either from attribute or nested content */}} {{ $description := .Get "description" }} {{ $example := .Get "example" }} +{{ $anchorID := $name | lower | replaceRE "[^a-z0-9-]" "-" | replaceRE "-+" "-" | replaceRE "^-" "" | replaceRE "-$" "" }} -
-
- {{ $name }} - {{ $type }} - {{ with $required }} - {{ if eq . true }}required{{ else }}{{ . | markdownify }}{{ end }} - {{ end }} -
+