Skip to content

Commit 3ae5c07

Browse files
Cli4dclementbiron
authored andcommitted
Add draft of local contribution guide
1 parent 71e8dea commit 3ae5c07

File tree

1 file changed

+369
-0
lines changed

1 file changed

+369
-0
lines changed

content/contributing-terms.en.md

Lines changed: 369 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,375 @@ To add a declaration, you need to follow these steps:
2828
6. After you've properly added your selectors and structured your JSON file, you need to test and validate your JSON file to make sure it is ok. To do this, you need to run `npx ota validate --services [service name]` from the root of the repository. This will run a validation on the declaration, highlighting any changes required.
2929
7. If all tests are good, make a pull request to the main repository.
3030

31+
> If you have a hard time finding the service name, check out the [practical guidelines to find the service name](declarations-guidelines.md#service-name), and feel free to mention your uncertainties in the pull request! We will help you improve the service name if necessary 🙂
32+
33+
### Service ID
34+
35+
The service ID is exposed to developers. It should be easy to handle with scripts and other tools.
36+
37+
- Non-ASCII characters are not supported. Service IDs are derived from the service name by normalising it into ASCII.
38+
- _Example: `RTÉ``RTE`_.
39+
- _Example: `historielærer.dk``historielaerer.dk`_.
40+
- _Example: `туту.ру``tutu.ru`_.
41+
- _Example: `抖音短视频``Douyin`_.
42+
- Punctuation is supported, except characters that have meaning at filesystem level (`:`, `/`, `\`). These are replaced with a dash (`-`).
43+
- _Example: `Booking.com``Booking.com`_.
44+
- _Example: `Yahoo!``Yahoo!`_.
45+
- _Example: `re:start``re-start`_.
46+
- _Example: `we://``we---`_.
47+
- Capitals and spaces are supported. Casing and spacing are expected to reflect the official service name casing and spacing.
48+
- _Example: `App Store``App Store`_.
49+
- _Example: `DeviantArt``DeviantArt`_.
50+
51+
> If you have a hard time defining the service ID, check out the [practical guidelines to derive the ID from the service name](declarations-guidelines.md#service-id), and feel free to mention your uncertainties in the pull request! We will help you improve the service ID if necessary 🙂
52+
53+
> More details on the ID and naming constraints and recommendations can be found in the relevant [decision record](https://github.com/OpenTermsArchive/engine/blob/main/decision-records/0001-service-name-and-id.md).
54+
55+
### Service declaration
56+
57+
Once you have the [service name](#service-name) and the [service ID](#service-id), create a JSON file in the `declarations` folder named after the ID of the service you want to add, with the following structure:
58+
59+
```json
60+
{
61+
"name": "<service name>",
62+
"documents": {}
63+
}
64+
```
65+
66+
Within the `documents` JSON object, we will now declare terms.
67+
68+
- - -
69+
70+
## Declaring terms
71+
72+
Terms are declared in a service declaration file, under the `documents` property.
73+
74+
Most of the time, terms are written in only one source document (for example [Facebook Terms of Service](https://www.facebook.com/legal/terms)) but sometimes terms can be spread across multiple online source documents, and their combination constitutes the terms (for example [Facebook Community Guidelines](https://transparency.fb.com/policies/community-standards/)).
75+
76+
#### Source document
77+
78+
The way in which a source document is obtained is defined in a JSON object:
79+
80+
```json
81+
{
82+
"fetch": "The URL where the document can be found",
83+
"executeClientScripts": "A boolean to execute client-side JavaScript loaded by the document before accessing the content, in case the DOM modifications are needed to access the content; defaults to false (fetch HTML only)",
84+
"filter": "An array of service specific filter function names",
85+
"remove": "A CSS selector, a range selector or an array of selectors that target the insignificant parts of the document that has to be removed. Useful to remove parts that are inside the selected parts",
86+
"select": "A CSS selector, a range selector or an array of selectors that target the meaningful parts of the document, excluding elements such as headers, footers and navigation"
87+
}
88+
```
89+
90+
- For HTML files, `fetch` and `select` are mandatory.
91+
- For PDF files, only `fetch` is mandatory.
92+
93+
Let’s start by defining these keys!
94+
95+
#### `fetch`
96+
97+
This property should simply contain the URL at which the terms you want to track can be downloaded. HTML and PDF files are supported.
98+
99+
When terms coexist in different languages and jurisdictions, please refer to the [scope of the collection](../README.md#collections) to which you are contributing. This scope is usually defined in the README.
100+
101+
#### `select`
102+
103+
_This property is not needed for PDF documents._
104+
105+
Most of the time, contractual documents are exposed as web pages, with a header, a footer, navigation menus, possibly ads… We aim at tracking only the significant parts of the document. In order to achieve that, the `select` property allows to extract only those parts in the process of [converting from snapshot to version](../README.md#how-it-works).
106+
107+
The `select` value can be either a [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors), a [range selector](#range-selectors) or an array of those.
108+
109+
##### CSS selectors
110+
111+
CSS selectors should be provided as a string. See the [specification](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) for how to write CSS selectors.
112+
113+
> For example, the following selector will select the content in the `<main>` tag of the HTML document:
114+
>
115+
> ```json
116+
> "select": "main"
117+
> ```
118+
119+
##### Range selectors
120+
121+
A range selector is defined with a _start_ and an _end_ CSS selector. It is also necessary to define if the range starts before or after the element targeted by the _start_ CSS selector and to define if it ends before or after the element targeted by the _end_ CSS selector.
122+
123+
To that end, a range selector is a JSON object containing two keys out of the four that are available: `startBefore`, `startAfter`, `endBefore` and `endAfter`.
124+
125+
```json
126+
{
127+
"start[Before|After]": "<CSS selector>",
128+
"end[Before|After]": "<CSS selector>"
129+
}
130+
```
131+
132+
> For example, the following selector will select the content between the element targeted by the CSS selector `#privacy-eea`, including it, and the element targeted by the CSS selector `footer`, excluding it:
133+
>
134+
> ```json
135+
> {
136+
> "startBefore": "#privacy-eea",
137+
> "endBefore": "footer"
138+
> }
139+
> ```
140+
141+
#### `remove`
142+
143+
_This property is optional._
144+
145+
Beyond [selecting a subset of a web page](#select), some documents will have non-significant parts in the middle of otherwise significant parts. For example, they can have “go to top” links or banner ads. These can be removed by listing [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors), [range selectors](#range-selectors) or an array of them under the `remove` property.
146+
147+
##### Example
148+
149+
Let's assume a web page contains the following content:
150+
151+
```html
152+
<main>
153+
<div class="filter-holder">
154+
<select class="filter-options">
155+
<option value="https://www.example.com/policies/user-agreement" selected>User Agreement</option>
156+
<option value="https://www.example.com/policies/privacy-policy">Privacy Policy</option>
157+
<option value="https://www.example.com/policies/content-policy">Content Policy</option>
158+
<option value="https://www.example.com/policies/broadcasting-content-policy">Broadcasting Content Policy</option>
159+
</select>
160+
</div>
161+
<h1>User Agreement</h1>
162+
<div>…terms…</div>
163+
</main>
164+
```
165+
166+
If only `main` is used in `select`, the following version will be extracted:
167+
168+
```md
169+
User Agreement Privacy Policy Content Policy Broadcasting Content Policy Moderator Guidelines Transparency Report 2017 Transparency Report 2018 Guidelines for Law Enforcement Transparency Report 2019
170+
171+
User Agreement
172+
==============
173+
174+
…terms…
175+
```
176+
177+
Whereas we want instead:
178+
179+
```md
180+
User Agreement
181+
==============
182+
183+
…terms…
184+
```
185+
186+
This result can be obtained with the following declaration:
187+
188+
```json
189+
{
190+
"fetch": "https://example.com/user-agreement",
191+
"select": "main",
192+
"remove": ".filter-holder"
193+
}
194+
```
195+
196+
##### Complex selectors examples
197+
198+
```json
199+
{
200+
"fetch": "https://support.google.com/adsense/answer/48182",
201+
"select": ".article-container",
202+
"remove": ".print-button, .go-to-top"
203+
}
204+
```
205+
206+
```json
207+
{
208+
"fetch": "https://www.wechat.com/en/service_terms.html",
209+
"select": "#agreement",
210+
"remove": {
211+
"startBefore": "#wechat-terms-of-service-usa-specific-terms-",
212+
"endBefore": "#wechat-terms-of-service-european-union-specific-terms-"
213+
}
214+
}
215+
```
216+
217+
```json
218+
{
219+
"fetch": "https://fr-fr.facebook.com/legal/terms/plain_text_terms",
220+
"select": "div[role=main]",
221+
"remove": [
222+
{
223+
"startBefore": "[role=\"separator\"]",
224+
"endAfter": "body"
225+
},
226+
"[style=\"display:none\"]"
227+
]
228+
}
229+
```
230+
231+
#### `executeClientScripts`
232+
233+
_This property is optional._
234+
235+
In some cases, the content of the document is only loaded (or is modified dynamically) by client scripts.
236+
When set to `true`, this boolean property loads the page in a headless browser to load all assets and execute client scripts before trying to get the document contents.
237+
238+
Since the performance cost of this approach is high, it is set to `false` by default, relying on the HTML content only.
239+
240+
#### `filter`
241+
242+
_This property is optional._
243+
244+
Finally, some documents will need more complex filtering beyond simple element selection and removal, for example to remove noise (changes in textual content that are not meaningful to the terms of services). Such filters are declared as JavaScript functions that modify the downloaded web page through the [DOM API](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model).
245+
246+
Filters take the document DOM and the terms declaration as parameters and are:
247+
248+
- **in-place**: they modify the document structure and content directly;
249+
- **idempotent**: they should return the same document structure and content even if run repeatedly on their own result.
250+
251+
Filters are loaded automatically from files named after the service they operate on. For example, filters for the Meetup service, which is declared in `declarations/Meetup.json`, are loaded from `declarations/Meetup.filters.js`.
252+
253+
The generic function signature for a filter is:
254+
255+
```js
256+
export [async] function filterName(document, documentDeclaration)
257+
```
258+
259+
Each filter is exposed as a named function export that takes a `document` parameter and behaves like the `document` object in a browser DOM. These functions can be `async`, but they will still run sequentially. The whole document declaration is passed as second parameter.
260+
261+
> The `document` parameter is actually a [JSDOM](https://github.com/jsdom/jsdom) document instance.
262+
263+
You can learn more about usual noise and ways to handle it [in the guidelines](declarations-guidelines.md#Usual-noise).
264+
265+
##### Example
266+
267+
Let's assume a service adds a unique `clickId` parameter in the query string of all link destinations. These parameters change on each page load, leading to recording noise in versions. Since links should still be recorded, it is not appropriate to use `remove` to remove the links entirely. Instead, a filter will manipulate the links destinations to remove the always-changing parameter. Concretely, the goal is to apply the following filter:
268+
269+
```diff
270+
- Read the <a href="https://example.com/example-page?clickId=349A2033B&lang=en">list of our affiliates</a>.
271+
+ Read the <a href="https://example.com/example-page?lang=en">list of our affiliates</a>.
272+
```
273+
274+
The code below implements this filter:
275+
276+
```js
277+
function removeTrackingIdsQueryParam(document) {
278+
const QUERY_PARAM_TO_REMOVE = 'clickId';
279+
280+
document.querySelectorAll('a').forEach(link => { // iterate over every link in the page
281+
const url = new URL(link.getAttribute('href'), document.location); // URL is part of the DOM API, see https://developer.mozilla.org/en-US/docs/Web/API/URL
282+
const params = new URLSearchParams(url.search); // URLSearchParams is part of the DOM API, see https://developer.mozilla.org/en-US/docs/Web/API/URLSearchParams
283+
284+
params.delete(QUERY_PARAM_TO_REMOVE); // we use the DOM API instead of RegExp because we can't know in advance in which order parameters will be written
285+
url.search = params.toString(); // store the query string without the parameter
286+
link.setAttribute('href', url.toString()); // write the destination URL without the parameter
287+
});
288+
}
289+
```
290+
291+
##### Example usage of declaration parameter
292+
293+
The second parameter can be used to access the defined document URL or selector inside the filter.
294+
295+
Let's assume a service stores some of its legally-binding terms in images. To track these changes properly, images should be stored as part of the terms. By default, images are not stored since they significantly increase the document size. The filter below will store images inline in the terms, encoded in a [data URL](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs). In order to download the images for conversion, the base URL of the web page is needed to resolve relative links. This information is obtained from the declaration.
296+
297+
```js
298+
import fetch from 'isomorphic-fetch';
299+
300+
export async function convertImagesToBase64(document, documentDeclaration) {
301+
const { fetch: baseUrl, select: selector } = documentDeclaration;
302+
303+
const images = Array.from(document.querySelectorAll(`${selector} img`));
304+
305+
return Promise.all(images.map(async ({ src }, index) => {
306+
const imageAbsoluteUrl = new URL(src, baseUrl).href;
307+
const response = await fetch(imageAbsoluteUrl);
308+
const mimeType = response.headers.get('content-type');
309+
const content = await response.arrayBuffer();
310+
311+
const base64Image = btoa(String.fromCharCode(...new Uint8Array(content)));
312+
313+
images[index].src = `data:${mimeType};base64,${base64Image}`;
314+
}));
315+
}
316+
```
317+
318+
#### Terms with a single source document
319+
320+
In the case where terms are extracted from one single source document, they are declared by simply declaring that source document:
321+
322+
```json
323+
324+
"documents": {
325+
"<terms type>": {
326+
"fetch": "",
327+
"executeClientScripts": "",
328+
"filter": "",
329+
"remove": "",
330+
"select": ""
331+
}
332+
}
333+
334+
```
335+
336+
#### Terms with multiple source documents
337+
338+
When the terms are spread across multiple source documents, they should be declared by declaring their combination:
339+
340+
```json
341+
342+
"documents": {
343+
"<terms type>":
344+
"combine": [
345+
{
346+
"fetch": "",
347+
"executeClientScripts": "",
348+
"filter": "",
349+
"remove": "",
350+
"select": ""
351+
},
352+
{
353+
"fetch": "",
354+
"executeClientScripts": "",
355+
"filter": "",
356+
"remove": "",
357+
"select": ""
358+
}
359+
]
360+
}
361+
362+
```
363+
364+
If some parts of the source documents are repeated, they can be factorised. For example, it is common for the structure of HTML pages to be similar from page to page, so `select`, `remove` and `filter` would be the same. These elements can be shared instead of being duplicated:
365+
366+
```json
367+
368+
"documents": {
369+
"<terms type>":
370+
"executeClientScripts": "",
371+
"filter": "",
372+
"remove": "",
373+
"select": "",
374+
"combine": [
375+
{
376+
"fetch": "",
377+
},
378+
{
379+
"fetch": "",
380+
}
381+
]
382+
}
383+
384+
```
385+
386+
## Contributing new declarations
387+
388+
This is a step by step guide to help you add declarations to the [contrib-declaration](https://github.com/OpenTermsArchive/contrib-declarations) repository. This repository is dedicated for volunteer contribution of declarations to Open Terms Archive.
389+
390+
Having understood briefly how a declaration is structured in JSON format, we need to look at concrete steps on how you can add these JSON files to the repository. To add them, you need to:
391+
392+
1. Clone the [contrib-declaration](https://github.com/OpenTermsArchive/contrib-declarations) repository to your local machine.
393+
2. Create a branch that describes your contribution e.g. `add-Open-Terms-Archive-ToS` or `add-firefox-privacy-policy`
394+
3. Run `npm install`. This is to install all the dependencies including the Open Terms Archive engine which will allow you to test and validate your declaration to make sure it is ok.
395+
4. Create a JSON file with the name of the service you are adding the declaration for. This JSON file should be in the `declarations` folder of the repository. To learn more about selecting the right service name, please read the [declaring a new service](#declaring-a-new-service) section of our docs.
396+
5. Visit the declaration URL and use [browser developer tools](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Tools_and_setup/What_are_browser_developer_tools) to inspect the page and find the right selectors for the significant section containing the terms you want to declare.
397+
6. After you've properly added your selectors and structured your JSON file, you need to test and validate your JSON file to make sure it is ok. To do this, you need to run `npx ota validate --services [service name]` from the root of the repository. This will run a validation on the declaration, highlighting any changes required.
398+
7. If all tests are good, make a pull request to the main repository.
399+
31400
You can read more about the [CLI](https://docs.opentermsarchive.org/#cli) to learn more about other tests and linting you can run on your declaration
32401

33402
#### Terms type

0 commit comments

Comments
 (0)