-
Notifications
You must be signed in to change notification settings - Fork 6
Update filters documentation #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
0afb29f
a0beaa7
6a1fb28
65d3606
796a4e7
1664c0d
829ce0f
05292b1
1e346d8
be7bb36
d19bad9
f97d959
75acc97
1ef4725
4bd20bc
f825bf8
f5afdf7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
--- | ||
title: "Filters" | ||
weight: 3 | ||
--- | ||
|
||
# Filters | ||
|
||
Filters solve noise issues in terms versions that cannot be addressed with direct selection or removal of content using CSS selectors or range selectors. | ||
|
||
## Why filters are needed | ||
|
||
Web pages often contain dynamically generated content or content that cannot be targeted with CSS selectors that creates noise in the archive: | ||
Ndpnt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
- Tracking parameters in URLs, for example `utm_source`, `utm_medium`, … | ||
- Content that are date based and can change between visits, for example "Updated X days ago" can be converted to a "Last updated on YYYY-MM-DD". | ||
- Dynamic elements with changing classes or IDs | ||
|
||
Without filters, this dynamic content creates changes that are not meaningful to the terms. | ||
|
||
## How filters work | ||
|
||
Filters are JavaScript functions that receive a JSDOM document instance and can manipulate the DOM structure directly. They modify the document structure and content in-place and they run sequentially in the order specified in the declaration. | ||
|
||
## Filter design principles | ||
|
||
When designing filters, follow these core principles: | ||
|
||
- **Be specific**: Target only the noise you want to remove. Avoid broad selectors that might accidentally remove important content. | ||
- **Be safe**: Ensure your filter doesn't accidentally remove important content. Always check that the generated version still contains the whole terms content. | ||
- **Be idempotent**: Your filter should produce the same result even if run multiple times on its own output. This ensures consistency and prevents unexpected behavior. | ||
- **Be efficient**: Use efficient DOM queries and avoid unnecessary operations. Process only the elements you need to modify. | ||
|
||
|
||
## When to use filters | ||
|
||
|
||
Use filters when: | ||
|
||
- **CSS selectors are insufficient**: When noise appears within content that can't be targeted with selectors or [range selectors]({{< relref "terms/explanation/range-selectors" >}}) with the [`select`]({{< relref "terms/reference/declaration/#ref-select" >}}) and [`remove`]({{< relref "terms/reference/declaration/#ref-remove" >}}) properties. | ||
- **Meaningful content is dynamic**: When elements change on each page load, for example "Updated X days ago" can be converted to a "Last updated on YYYY-MM-DD". | ||
- **Patterns are complex**: When simple removal isn't possible, for example removing all the tracking parameters in URLs. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
--- | ||
title: Apply filters | ||
weight: 7 | ||
--- | ||
|
||
# Apply filters | ||
|
||
This guide explains how to apply filters to existing declarations to remove meaningless content that changes on each page load or that cannot be removed with CSS selectors to avoid noise in the terms changes history. | ||
|
||
## Prerequisites | ||
|
||
- An existing terms declaration file | ||
- Identified the noise you want to remove and ensure it cannot be removed with CSS selectors with the [`remove`]({{< relref "terms/reference/declaration/#ref-remove" >}}) property. | ||
|
||
|
||
## Step 1: Check for built-in filters | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I'll use a structure based on the principle that we most often use a built-in filter and optionally a custom filter, so I would put everything related to creating a custom filter on a dedicated how-to page. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we currently do not have enough builtin filters to justify splitting into two pages |
||
|
||
Built-in filters are pre-defined functions that handle common noise patterns. They're the easiest way to clean up content without writing custom code. | ||
|
||
Review the available [built-in filters]({{< relref "/terms/reference/built-in-filters" >}}) to find if one matches your needs. | ||
|
||
If you find a suitable built-in filter, proceed to [Step 2](#step-2-declare-the-filter), otherwise you will need to create a custom filter. | ||
|
||
### Create a custom filter (optional) | ||
|
||
If no built-in filter matches your needs, you'll need to create a custom filter. This requires JavaScript knowledge and familiarity with DOM manipulation. | ||
|
||
#### Create the filter file | ||
|
||
Create a JavaScript file with the same name as your service declaration but with `.filters.js` extension. For example, if your declaration is `declarations/MyService.json`, create `declarations/MyService.filters.js`. | ||
|
||
#### Write the filter function | ||
|
||
Define your filter function following this signature: | ||
|
||
```js | ||
export function myCustomFilter(document, [parameters]) { | ||
// Your filter logic here | ||
} | ||
``` | ||
|
||
**Parameters:** | ||
|
||
- `document`: JSDOM document instance representing the web page | ||
- `parameters`: Values passed from the declaration (optional) | ||
|
||
**Example: Remove session IDs from text content** | ||
|
||
For example, let's say you want to remove session IDs from text content: | ||
|
||
```html | ||
<p>We collect your data for the following purposes:</p> | ||
<ul> | ||
<li>To provide our services</li> | ||
<li>To improve user experience</li> | ||
</ul> | ||
<p class="session-id">Last updated on 2023-12-07 (Session: abc123def456)</p> | ||
``` | ||
|
||
You can implement this filter as follows: | ||
|
||
```js | ||
export function removeSessionIds(document) { | ||
// Find all paragraphs that might contain session IDs | ||
const paragraphs = document.querySelectorAll('p.session-id'); | ||
|
||
paragraphs.forEach(paragraph => { | ||
let text = paragraph.textContent; | ||
// Remove session ID patterns like "Session: abc123" or "(Session: def456)" | ||
text = text.replace(/\s*\(?Session:\s*[a-zA-Z0-9]+\)?/g, ''); | ||
paragraph.textContent = text.trim(); | ||
}); | ||
} | ||
``` | ||
|
||
Result after applying the filter: | ||
|
||
```diff | ||
<p>We collect your data for the following purposes:</p> | ||
<ul> | ||
<li>To provide our services</li> | ||
<li>To improve user experience</li> | ||
</ul> | ||
- <p class="session-id">Last updated on 2023-12-07 (Session: abc123def456)</p> | ||
+ <p class="session-id">Last updated on 2023-12-07</p> | ||
``` | ||
|
||
## Step 2: Declare the filter | ||
|
||
Open your service declaration file (e.g., `declarations/MyService.json`) and locate the `filter` property of the specific terms you want to apply the filter to. If it doesn't exist, add it as an array. | ||
|
||
### Filter without parameters | ||
|
||
For filters that don't require parameters, add the filter name as a string: | ||
|
||
```json | ||
{ | ||
"name": "MyService", | ||
"terms": { | ||
"Privacy Policy": { | ||
"fetch": "https://my.service.com/en/privacy-policy", | ||
"select": ".textcontent", | ||
"filter": [ | ||
"removeSessionIds" | ||
] | ||
} | ||
} | ||
} | ||
``` | ||
|
||
### Parameterized filter | ||
|
||
For filters that require parameters, use an object format, for example with the built-in filter `removeQueryParams` to remove query parameters from URLs: | ||
|
||
```json | ||
{ | ||
"name": "MyService", | ||
"terms": { | ||
"Privacy Policy": { | ||
"fetch": "https://my.service.com/en/privacy-policy", | ||
"select": ".textcontent", | ||
"filter": [ | ||
{ | ||
"removeQueryParams": ["utm_source", "utm_medium", "utm_campaign"] | ||
} | ||
] | ||
} | ||
} | ||
} | ||
``` | ||
|
||
### Multiple filters | ||
|
||
You can combine multiple filters in the same declaration: | ||
|
||
```json | ||
{ | ||
"name": "MyService", | ||
"terms": { | ||
"Privacy Policy": { | ||
"fetch": "https://my.service.com/en/privacy-policy", | ||
"select": ".textcontent", | ||
"filter": [ | ||
{ | ||
"removeQueryParams": ["utm_source", "utm_medium"] | ||
}, | ||
"removeSessionIds" | ||
] | ||
} | ||
} | ||
} | ||
``` | ||
|
||
## Step 3: Test the filter | ||
|
||
After adding the filter, test your declaration to ensure it works correctly: | ||
|
||
1. Start the terms tracking process | ||
2. Check that the noise has been removed | ||
3. Verify that important content is preserved |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
--- | ||
title: "Built-in filters" | ||
--- | ||
|
||
# Built-in filters | ||
|
||
This reference documentation details all available built-in filters that can be used to avoid noise in the terms content. | ||
|
||
## Filters | ||
|
||
{{< refItem | ||
name="removeQueryParams" | ||
description="Removes specified query parameters from URLs in links and images within the terms content" | ||
>}} | ||
|
||
```json | ||
"filter": [ | ||
{ | ||
"removeQueryParams": ["utm_source", "utm_medium"] | ||
} | ||
] | ||
``` | ||
|
||
Result: | ||
|
||
```diff | ||
- <p>Read the <a href="https://example.com/example-page?utm_source=OGB&utm_medium=website&lang=en">list of our affiliates</a>.</p> | ||
+ <p>Read the <a href="https://example.com/example-page?lang=en">list of our affiliates</a>.</p> | ||
``` | ||
|
||
{{< /refItem >}} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,51 +4,76 @@ title: "Filters" | |
|
||
# Filters | ||
|
||
Some documents require more complex filtering beyond basic element selection and removal. For example, web pages often contain dynamically generated content like tracking IDs in URLs that change on each page load. While these elements are part of the page, they are not meaningful to the terms content itself. If such dynamic content is included in the archived versions, it creates a lot of insignificant versions and pollutes the archive with noise that makes it harder to identify actual changes to the terms. | ||
|
||
Filters address this need by providing a way to programmatically clean up and normalize the content before archiving. They are implemented as JavaScript functions that can manipulate the downloaded web page using the [DOM API](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model), allowing for sophisticated content transformations beyond what's possible with simple CSS selectors. | ||
|
||
Filters take the document DOM and the terms declaration as parameters and are: | ||
Filters are JavaScript functions that take the document DOM as parameter and are: | ||
|
||
- **in-place**: they modify the document structure and content directly; | ||
- **idempotent**: they return the same document structure and content even if run repeatedly on their own result. | ||
Comment on lines
9
to
10
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds like explanations. |
||
|
||
Filters are loaded automatically from files named after the service they operate on. For example, filters for the Meetup service, which is declared in `declarations/Meetup.json`, are loaded from `declarations/Meetup.filters.js`. | ||
|
||
The generic function signature for a filter is: | ||
|
||
```js | ||
export [async] function filterName(document, documentDeclaration) | ||
export [async] function filterName(document, [parameters]) | ||
|
||
``` | ||
|
||
Each filter is exposed as a named function export that takes a `document` parameter and behaves like the `document` object in a browser DOM. These functions can be `async`, but they will still run sequentially. The whole document declaration is passed as second parameter. | ||
|
||
Each filter is exposed as a named function export that takes a `document` parameter and behaves like the `document` object in a browser DOM. | ||
> The `document` parameter is actually a [JSDOM](https://github.com/jsdom/jsdom) document instance. | ||
|
||
You can learn more about usual noise and ways to handle it [in the guidelines]({{< relref "/terms/guideline/declaring#usual-noise" >}}). | ||
These functions can be `async`, but they will still run sequentially. | ||
|
||
## Usage | ||
|
||
### Example | ||
### Simple filter | ||
|
||
Let's assume a service adds a unique `clickId` parameter in the query string of all link destinations. These parameters change on each page load, leading to recording noise in versions. Since links should still be recorded, it is not appropriate to use `remove` to remove the links entirely. Instead, a filter will manipulate the links destinations to remove the always-changing parameter. Concretely, the goal is to apply the following filter: | ||
```js | ||
// <service name>.filters.js | ||
export function customFilter(document) { | ||
// filter logic here | ||
} | ||
``` | ||
|
||
```diff | ||
- Read the <a href="https://example.com/example-page?clickId=349A2033B&lang=en">list of our affiliates</a>. | ||
+ Read the <a href="https://example.com/example-page?lang=en">list of our affiliates</a>. | ||
Can be used as follows in the declaration: | ||
|
||
```json | ||
// <service name>.json | ||
{ | ||
"name": "<service name>", | ||
"terms": { | ||
"<terms type>": { | ||
"fetch": "<URL>", | ||
"select": "<CSS or Range selectors>", | ||
"filter": [ | ||
"customFilter" | ||
] | ||
} | ||
} | ||
} | ||
``` | ||
|
||
The code below implements this filter: | ||
### Filter with parameters | ||
|
||
```js | ||
function removeTrackingIdsQueryParam(document) { | ||
const QUERY_PARAM_TO_REMOVE = 'clickId'; | ||
|
||
document.querySelectorAll('a').forEach(link => { // iterate over every link in the page | ||
const url = new URL(link.getAttribute('href'), document.location); // URL is part of the DOM API, see https://developer.mozilla.org/en-US/docs/Web/API/URL | ||
const params = new URLSearchParams(url.search); // URLSearchParams is part of the DOM API, see https://developer.mozilla.org/en-US/docs/Web/API/URLSearchParams | ||
// <service name>.filters.js | ||
export function customParameterizedFilter(document, params) { | ||
// filter logic here | ||
} | ||
``` | ||
|
||
params.delete(QUERY_PARAM_TO_REMOVE); // we use the DOM API instead of RegExp because we can't know in advance in which order parameters will be written | ||
url.search = params.toString(); // store the query string without the parameter | ||
link.setAttribute('href', url.toString()); // write the destination URL without the parameter | ||
}); | ||
Can be used as follows in the declaration: | ||
|
||
```json | ||
// <service name>.json | ||
{ | ||
"name": "<service name>", | ||
"terms": { | ||
"<terms type>": { | ||
"fetch": "<URL>", | ||
"select": "<CSS or Range selectors>", | ||
"filter": [ | ||
{ | ||
"customParameterizedFilter": ["param1", "param2"] | ||
} | ||
] | ||
} | ||
} | ||
} | ||
``` |
Uh oh!
There was an error while loading. Please reload this page.